# Milestone 2: Data Collection and Description

In this notebook we present our pipeline to answer the research questions that we raised in Milestone 1. In the first part, we show our step-step strategy on how we proceed to answer the individual questions. In the second part, the code we used to fetch the data and to do the preliminary analysis is provided. In the third part we present our preliminary statistical analysis, as well as some visualizations of our exploratory data analysis.

## Part 1

### Pursued strategy to answer the research questions (updated version):

1. **Are we emotionally biased?** Do the number of conflicts or their distance from our home define our emotions? 

    #### Fetching the data 

     From the GDELT dataset we fetch the following information from the "Mentions" and "Events" sets:

    - Time of the event (fetch the data in the available 2-year interval)
    - Url of the article mentioning the source 
    - Average Tone 
    - Location of the event (latitudinal and longitudinal coordinates)
    - Number of times the event is mentioned in the news (NumMentions, NumSources, NumArticles) 

    #### Analysis
   

   1. Addressing the subquestion: Do the number of conflicts or their distance from our home define our emotions?

        - Calculation of the distance between the source article and the event: 
            1. Get the country from the url 
            2. Get the geographic coordinates of the capital of the country
            3. Calculate the geographic distance between the source article and the event

        - Evaluation of the dependency between the emotions and the distance: 
            1. Plot the emotion metrics against the distance (curve with confidence interval)
            2. Evaluate the statistical significance of the regression coefficient

        - Evaluation of the dependency between the emotions and the importance of the conflict:
            1. Statistical evaluation of which of the 3 "importance of an event" metrics provided by GDELT (NumMentions, NumSources, NumArticles) best correlates with the emotion metrics
    

  
2. **Are some countries ignored in the news?**  Is the number of conflicts taking place in a country in relation with the number of mentions in the media depending on where the conflict has happened? 


3. **Are we emotionally predictable?** Can we observe patterns of emotions with respect to a country, religion or an ethnical group? Can we derive a model predicting emotions in case of a new conflict based on its specific features?


4. **Do we have a saturation limit?** Does increasing number of conflicts make people feel worse and worse or is there some limit? Do we get used to a conflict with time and become less sentimental?


    #### Fetching the data 

      From the GDELT dataset we fetch the following information from the "GKG":

          - Url of the article mentioning the source 
          - Average Tone 
          - GCAM 

      #### Analysis

      1. Addressing the subquestion: Does increasing number of conflicts make people feel worse and worse or is there some limit?

      - Calculation the increasing number of conflicts: 
          1. Get the country from the url 
          2. Parse through the gkg files (in the time interval we wish) and get the events referent to a country.

      - Possible limit of the emotions: 
          1. Get the average tone and the GCAM feelings referent to the events
          2. Evaluate the emotions that we have for each of this event, observing how the media shows the events and if there are some insensibility or not after a threshold number of events.
  
5. **Who is more emotional?** Do we see sensitivity differences between some countries? Do we see a trend towards more negative emotions over the years?



## Part 2

In [12]:
# regular imports
import os
import numpy as np
import pandas as pd

# function imports
from Q4_helper_functions import *
from Schema import *

import warnings
warnings.filterwarnings('ignore')

import findspark
findspark.init('C:\opt\spark')

from pyspark.sql import *
from pyspark.sql.functions import to_timestamp,desc, asc, udf, window, explode, unix_timestamp

%matplotlib inline
spark = SparkSession.builder.getOrCreate()

# update when changing functions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
## Change this in the cluster

#DATA_DIR = 'hdfs:///datasets/gdeltv2'
DATA_DIR = '../data/'

# directory for local files (ex.: UrlToCountry)
DATA_LOCAL = '../data/'

In [14]:
gkg_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.gkg.csv"),schema=GKG_SCHEMA)
events_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.export.CSV"),schema=EVENTS_SCHEMA)
mentions_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.mentions.CSV"),schema=MENTIONS_SCHEMA)

1. **Are we emotionally biased?**

2. **Are some countries ignored in the news?**

3. **Are we emotionally predictable?**

4. **Do we have a saturation limit?** / 5. **Who is more emotional?**
##### 1. GCAM Emotions data
> We will select the ID and the emotions so we can attribute to an event the emotion behind it

In [15]:
ID_GCAM = gkg_df.select("GKGRECORDID","GCAM")

In [16]:
df_ID_GCAM = ID_GCAM.toPandas()
df_ID_GCAM.head()

Unnamed: 0,GKGRECORDID,GCAM
0,20171123073000-0,"wc:122,c12.1:10,c12.10:7,c12.12:3,c12.13:1,c12..."
1,20171123073000-1,"wc:154,c1.2:1,c12.1:14,c12.10:10,c12.12:1,c12...."
2,20171123073000-2,"wc:499,c12.1:23,c12.10:37,c12.12:6,c12.13:18,c..."
3,20171123073000-3,"wc:797,c1.1:1,c1.2:1,c12.1:26,c12.10:65,c12.11..."
4,20171123073000-4,"wc:200,c1.2:3,c12.1:1,c12.10:10,c12.12:4,c12.1..."


> lets make sure we only have 1 ID per row (which is expected)

In [17]:
print('\n', 'Length of all dataframe :' ,len(df_ID_GCAM), '\n', 
            'Length of unique IDs    :' ,len(df_ID_GCAM['GKGRECORDID'].unique()))


 Length of all dataframe : 4169 
 Length of unique IDs    : 4169


> we also need to split the emotions so then we can better attribute each event the correspondent feel as there are several dictionaries with different collection of words. For a more clear vision you can check [GCAM](https://blog.gdeltproject.org/introducing-the-global-content-analysis-measures-gcam/)

In [18]:
# split the GCAM column
df_ID_Emotions = df_ID_GCAM['GCAM'].str.split(',',expand=True)
# insert the correspondent ID
df_ID_Emotions.insert(loc=0, column='GKGRECORDID', value=df_ID_GCAM['GKGRECORDID'])
df_ID_Emotions.head()

Unnamed: 0,GKGRECORDID,0,1,2,3,4,5,6,7,8,...,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978
0,20171123073000-0,wc:122,c12.1:10,c12.10:7,c12.12:3,c12.13:1,c12.14:3,c12.3:2,c12.4:2,c12.5:9,...,,,,,,,,,,
1,20171123073000-1,wc:154,c1.2:1,c12.1:14,c12.10:10,c12.12:1,c12.13:4,c12.14:5,c12.3:4,c12.4:2,...,,,,,,,,,,
2,20171123073000-2,wc:499,c12.1:23,c12.10:37,c12.12:6,c12.13:18,c12.14:14,c12.3:15,c12.4:5,c12.5:5,...,,,,,,,,,,
3,20171123073000-3,wc:797,c1.1:1,c1.2:1,c12.1:26,c12.10:65,c12.11:1,c12.12:8,c12.13:22,c12.14:38,...,,,,,,,,,,
4,20171123073000-4,wc:200,c1.2:3,c12.1:1,c12.10:10,c12.12:4,c12.13:4,c12.14:10,c12.5:1,c12.7:4,...,,,,,,,,,,


#### 1.1 Building emotions dictionary

In [23]:
# function to build emotion dictionary
Emotions_dictionary = get_emotion_dictionary(DATA_LOCAL, 'GCAM-MASTER-CODEBOOK.txt')

In [25]:
# lets have a look at it
Emotions_dictionary.tail()

Unnamed: 0,Variable,DimensionHumanName
2973,c37.33,Power
2974,c38.1,ImageryValence
2975,c41.1,POSITIVE
2976,c41.2,NEGATIVE
2977,c41.3,NEUTRAL


> We see that different variables refer to different feelings