# Explainer Notebook 

The explainer notebook describes how the project was carried out in terms of data collection, cleaning, analyses and what the idea with these steps were as well a reflection of what could have been executed better. This is done in 4 sections, as requested by the course administrator. 

The data collection and most of the cleaning was done using the scripts provided in the `exploration`-folder, whereas the preliminary data-analysis that was useful for the research questions was carried out in the `preliminary_analysis.ipynb`-file. The actual research questions were investigated in the files `RQ1.ipynb`, `RQ2.ipynb` and `RQ3.ipynb` and a brief overview of what they investigate as well as how they were investigated is given in the repository `README`-file.

## 1) Motivation

> - What is your dataset?

Multiple datasets are considered in order to carry out the analyses - they can be accessed through [this link](https://drive.google.com/drive/folders/1e2uLI2JjoN1DJW5UrvhNofq_fbWcLBev). These consists of 1) the **Reddit-ClimateGraph dataset**, 2) the **Twitter Climate Change Sentiment dataset** and 3) the **EM-DAT Natural Disasters dataset**. In the following sections, these datasets will be described more exhaustively.

The main data source - namely the **Reddit-ClimateGraph dataset** - was deliberately collected for this project work. It consists of +500,000 submissions from Reddit posted in the period between the 1st of November 2014 and the 5th of April 2022. Of all submissions, a random sample of size 90,000 without replacement (using random seed of 42) were considered for collecting Reddit comments. This was done due to the delimited time availabel to carry out the project and resulted in extracting +1,000,000 comments from Reddit related to Climate Change. Both submissions and comments were collected using the `Pushshift API` with the queries `'climate change  climate'`, `'global warming  climate'`and `'warming planet  climate'`. This choice of query terms were chosen due to the fact that the Twitter Climate Change Sentiment dataset was collected from these terms. The Reddit-ClimateGraph dataset is divided into years, each containing four types of files; 1) a zipped json-file of the extracted submission with labelled sentiment, 2) a zipped json-file of the extracted comments with labelled sentiment, 3) a zipped json-file of the combined submissions and comments grouped by author-id's and 4) a json-file containing a `networkx` directed graph - the ClimateGraph - of nodes, edges and metadata.

To map an opinion score to the collect data from Reddit, the [**Twitter Climate Change Sentiment dataset**](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset) (also see the [GitHub](https://github.com/edwardcqian/climate_change_sentiment)) - which in practice rather is a dataset on opinions and not sentiment - is a manually tagged dataset consisting of 43,943 Tweets collected in the period between  Apr 27, 2015 and Feb 21, 2018. These Tweets are annotated with the underlying opinions of the "man-made or not"-debate about climate change which are divided into four categories; 1) *'Anti'* 2) *'Neutral'*, 3) *'Pro'* and 4) as being *'News'*. The annotation procedure consisted of asking 3 reviewers about the opinion of a given Tweet - only Tweets on which all 3 reviewers agreed upon were included in the final dataset. As mentioned in the description of the Reddit-ClimateGraph dataset, the selected Tweets were found from the queries `'climate change  climate'`, `'global warming  climate'`and `'warming planet  climate'`.

The third dataset - namely the [**EM-DAT Natural Disasters dataset**](https://www.emdat.be/database) - is an expressive dataset describing several aspect of Natural Disasters and their societal impact. It is publicly available from the EM-DAT database after registering. For instance, it expresses the severity measured on attributes such as `deaths` for various types of natural disasters.

>- Why did you choose this/these particular dataset(s)?

The original idea was to extract Tweets related to Climate Change for investigations on networks and text rather than Reddit comments and submissions, however, the query limit on the free Reddit API was deemed as too limiting for the project to be well executed. To provide a more solid base for analyses and conclusions, Reddit was chosen due to the accessibility of the Pushshift API.

The time of the Reddit network was determined from the last two major IPCC report releases, namely the 5th Assesment report (November 2014) and the last part of the 6th Assesment report (April 2022). Within this period, the awareness of the Climate Change debate has risen, thereby providing a solid basis for examining the Reddit discussion both from a static point-of-view as well as from a dynamic point-of-view.

Due to the infeasability of querying the Twitter API, it was decided to use the Twitter opinion dataset for training an opinion classifier, that could be used to classify the opinions of Reddit posts. Following this approach was chosen based on the fact that there were no obvious, valid data-source of Reddit comments and submissions annotated with an opinion related to climate change.

Initially the thought was to include several types of events - like natural disasters, social events, relevant movie and documentary releases, etc. - in the analysis, however, we were not able to find such a dataset online. After having manually created a simple version of such a dataset containing event types, event names and dates of occurrence, we came to the realization that associating some sort of numerical value to the dataset would considerably allow for more interesting analyses and conclusions. The EM-DAT dataset was chosen for this reason, which resulted in a slight change of scope for research question 1, meaning that only climate change events of the type natural disasters were considered in the final project hand-in.


>- What was your goal for the end user's experience?

Our goal for the end user's experience was three-fold; 1) to communicate our findings intuitively as a data story in a webpage, 2) to provide a solid dataset about the Climate Change debate on Reddit and 3) to create a well-structured GitHub repository that enables others with an interest in the Climate Change debate to recreate our analyses.

## 2) Basic stats




> - Write about your choices in data cleaning and preprocessing




>- Write a short section that discusses the dataset stats (here you can recycle the work you did for Project Assignment A)

Reddit: size, etc.


Twitter: unbalanced.




Describe attributes!


## 3) Tools, theory and analysis. Describe the process of theory to insight

> - Talk about how you've worked with text, including regular expressions, unicode, etc.


We did the same processing for everything dealing with text. Since the Twitter opinion classifier 




> - Describe which network science tools and data analysis strategies you've used, how those network science measures work, and why the tools you've chosen are right for the problem you're solving.




> - How did you use the tools to understand your dataset?

## 4) Discussion


> - What went well?






> - What is still missing? What could be improved?, Why?







Improved: the classifier. As was seen in preliminary analyses, the classifier generally ...

Also: mapping news as neutral could have been more thought-through. I.e. by disregarding it...

We restrict the analyses in RQ's to the year of 2020.

However, we investigate whether the general ClimateGraph network evolves over time and use the findings to suggest digging into this trend for future researchers. 

Found out too late that we mapped news opinion differently within RQ1 and {RQ2, RQ3}. in RQ1, news was denoted as nan whereas it was set to 0 in RQ2 and RQ3.