# Milestone 2: Data Collection and Description

In this notebook we present our pipeline to answer the research questions that we raised in Milestone 1. In the first part, we show our step-step strategy on how we proceed to answer the individual questions. In the second part, the code we used to fetch the data and to do the preliminary analysis is provided. In the third part we present our preliminary statistical analysis, as well as some visualizations of our exploratory data analysis.

## Part 1

### Pursued strategy to answer the research questions:

### Question 1

**Are we emotionally biased?** Do the number of conflicts or their distance from our home define our emotions? 

#### Fetching & Processing the data 

 From the GDELT dataset we fetch the following information from the "Mentions" and "Events" sets:

- Time of the event (fetch the data in a defined time interval)
- Url of the article mentioning the source 
- Average Tone 
- Location of the event (latitudinal and longitudinal coordinates)
- Number of times the event is mentioned in the news (NumMentions, NumSources, NumArticles) 
- Calculation of the distance between the source article and the event: 
    1. Get the country from the url using **XXX**
    2. Get the geographic coordinates of the capital of the source country using the csv file provided by [this website](http://techslides.com/list-of-countries-and-capitals)
    3. Calculate the geographic distance between the source article and the event with the [Great-Circle distance formula](https://en.wikipedia.org/wiki/Great-circle_distance)

#### Analysis

- Evaluation of the dependency between the emotions and the distance: 
    1. Plot the emotion metrics against the distance 
    2. Evaluate the statistical significance of the regression coefficient/ correlation coefficient

- Evaluation of the dependency between the emotions and the importance of the conflict:
    1. Evaluation whether there is a dependency of the emotion metrics and the importance of the conflict using one of the 3 "importance of an event" metrics provided by GDELT (NumMentions, NumSources, NumArticles)

### Question 2

**Are some countries ignored in the news?**  Is the number of conflicts taking place in a country in relation with the number of mentions in the media depending on where the conflict has happened? 

#### Fetching the data 

We use the same dataset than in Question 1

#### Analysis

- Group by the event country and sum up the number of conflicts and the number of mentions 
- Evaluate whether there is a correlation between the number of conlficts and mentions or not
- Identify countries with a high number of conflicts, but a comparewise low number of mentions


### Question 3

**Are we emotionally predictable?** Can we observe patterns of emotions with respect to a country, religion or an ethnical group? Can we derive a model predicting emotions in case of a new conflict based on its specific features?

#### Fetching the data 



#### Analysis


### Question 4

**Do we have a saturation limit?** Does increasing number of conflicts make people feel worse and worse or is there some limit? Do we get used to a conflict with time and become less sentimental?


#### Fetching the data 

From the GDELT dataset we fetch the following information from the "GKG":

  - Url of the article mentioning the source 
  - Average Tone 
  - GCAM 

#### Analysis

1. Addressing the subquestion: Does increasing number of conflicts make people feel worse and worse or is there some limit?

      - Calculation the increasing number of conflicts: 
          1. Get the country from the url 
          2. Parse through the gkg files (in the time interval we wish) and get the events referent to a country.

      - Possible limit of the emotions: 
          1. Get the average tone and the GCAM feelings referent to the events
          2. Evaluate the emotions that we have for each of this event, observing how the media shows the events and if there are some insensibility or not after a threshold number of events.
  





### Question 5

**Who is more emotional?** Do we see sensitivity differences between some countries? Do we see a trend towards more negative emotions over the years?



## Part 2

In [12]:
# regular imports
import os
import numpy as np
import pandas as pd

# function imports
from Q4_helper_functions import *
from Schema import *

import warnings
warnings.filterwarnings('ignore')

import findspark
findspark.init('C:\opt\spark')

from pyspark.sql import *
from pyspark.sql.functions import to_timestamp,desc, asc, udf, window, explode, unix_timestamp

%matplotlib inline
spark = SparkSession.builder.getOrCreate()

# update when changing functions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
## Change this in the cluster

#DATA_DIR = 'hdfs:///datasets/gdeltv2'
DATA_DIR = '../data/'

# directory for local files (ex.: UrlToCountry)
DATA_LOCAL = '../data/'

In [14]:
# open GDELT data
gkg_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.gkg.csv"),schema=GKG_SCHEMA)
events_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.export.CSV"),schema=EVENTS_SCHEMA)
mentions_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.mentions.CSV"),schema=MENTIONS_SCHEMA)

In [None]:
# open helper datasets
UrlToCountry = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "UrlToCountry.csv")
CountryToCapital = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "country-capitals.csv")

1. **Are we emotionally biased?**

#### Q1. Fetching & preprocessing of the data (Code which is executed inside the cluster)

In [None]:
# select the required data from Mentions Dataset
mentions_q1_df = mentions_df.select("GLOBALEVENTID", "EventTimeDate", "MentionType", "MentionSourceName") \
                .filter(mentions_df["MentionType"] == '1')

# join the dataframe url to country
UrlToCountry = UrlToCountry.select(UrlToCountry['Country name'].alias('country_source'), UrlToCountry['Clean URL'].alias('url')) 
mentions_q1_country = mentions_q1_df.join(UrlToCountry, UrlToCountry['url'] == mentions_q1_df['MentionSourceName'], "left_outer") 

# print the number of urls that have no country
print('number of unknown urls: ', mentions_q1_country.filter("country_source is null").select('MentionSourceName').distinct().count())

# filter out urls that are unknown
mentions_filter_df = mentions_q1_country.filter(mentions_q1_country.country_source.isNotNull())

# print the number of urls associated to a country
print('number of known urls: ', mentions_filter_df.select('country_source').distinct().count())

In [None]:
# join the file with the countries and capitals
mentions_coord_df = mentions_filter_df.join(CountryToCapital, CountryToCapital['CountryName'] == mentions_filter_df['country_source'], "left_outer") 
print('Total number of countries: ', mentions_coord_df.count())
print('Number of countries with no coordinates: ', mentions_coord_df.filter("CapitalLatitude is null").count())

# filter out rows with no geographic coordinates
mentions_filter_coord_df = mentions_coord_df.filter(mentions_coord_df.CapitalLatitude.isNotNull())

# select relevant columns 
mentions_clean_df = mentions_filter_coord_df.select('GLOBALEVENTID', 'EventTimeDate','CountryCode', 'CountryName', mentions_filter_coord_df['CapitalLatitude'].alias('Source_Lat'),
                                                   mentions_filter_coord_df['CapitalLongitude'].alias('Source_Long'))

In [None]:
# select Data from Events Dataset
events_q1_df= events_df.select("GLOBALEVENTID", "ActionGeo_Lat", "ActionGeo_Long", "NumMentions","NumSources","NumArticles","AvgTone")

# filter out events that have no geographic coordinates
print('Total number of events: ', events_q1_df.count())
events_clean_df = events_q1_df.filter(events_q1_df.ActionGeo_Lat.isNotNull())
print('Number of events with geographic coordinates: ', events_clean_df.count())

# merge the clean events and mentions dfs
event_mentions_df = events_clean_df.join(mentions_clean_df, 'GLOBALEVENTID') 

#### Q2. Processing outside the cluster

In [None]:
def geo_distance(r):
    '''Extracts the geographic coordinates and calls the geo_dist_calc function to compute the geographic distance'''
    lat1 = r['ActionGeo_Lat']
    lon1 = radians('ActionGeo_Long')
    lat2 = radians('Source_Lat')
    lon2 = radians('Source_Long')
    return geo_dist_calc(lat1, lon1, lat2, lon2)

# append a column with the geographic distance
event_mentions_df['distance'] = event_mentions_df.rdd.map(lambda r: geo_distance(Row(r)))

2. **Are some countries ignored in the news?**

3. **Are we emotionally predictable?**

4. **Do we have a saturation limit?** / 5. **Who is more emotional?**
##### 1. GCAM Emotions data
> We will select the ID and the emotions so we can attribute to an event the emotion behind it

In [15]:
ID_GCAM = gkg_df.select("GKGRECORDID","GCAM")

In [16]:
df_ID_GCAM = ID_GCAM.toPandas()
df_ID_GCAM.head()

Unnamed: 0,GKGRECORDID,GCAM
0,20171123073000-0,"wc:122,c12.1:10,c12.10:7,c12.12:3,c12.13:1,c12..."
1,20171123073000-1,"wc:154,c1.2:1,c12.1:14,c12.10:10,c12.12:1,c12...."
2,20171123073000-2,"wc:499,c12.1:23,c12.10:37,c12.12:6,c12.13:18,c..."
3,20171123073000-3,"wc:797,c1.1:1,c1.2:1,c12.1:26,c12.10:65,c12.11..."
4,20171123073000-4,"wc:200,c1.2:3,c12.1:1,c12.10:10,c12.12:4,c12.1..."


> lets make sure we only have 1 ID per row (which is expected)

In [17]:
print('\n', 'Length of all dataframe :' ,len(df_ID_GCAM), '\n', 
            'Length of unique IDs    :' ,len(df_ID_GCAM['GKGRECORDID'].unique()))


 Length of all dataframe : 4169 
 Length of unique IDs    : 4169


> we also need to split the emotions so then we can better attribute each event the correspondent feel as there are several dictionaries with different collection of words. For a more clear vision you can check [GCAM](https://blog.gdeltproject.org/introducing-the-global-content-analysis-measures-gcam/)

In [18]:
# split the GCAM column
df_ID_Emotions = df_ID_GCAM['GCAM'].str.split(',',expand=True)
# insert the correspondent ID
df_ID_Emotions.insert(loc=0, column='GKGRECORDID', value=df_ID_GCAM['GKGRECORDID'])
df_ID_Emotions.head()

Unnamed: 0,GKGRECORDID,0,1,2,3,4,5,6,7,8,...,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978
0,20171123073000-0,wc:122,c12.1:10,c12.10:7,c12.12:3,c12.13:1,c12.14:3,c12.3:2,c12.4:2,c12.5:9,...,,,,,,,,,,
1,20171123073000-1,wc:154,c1.2:1,c12.1:14,c12.10:10,c12.12:1,c12.13:4,c12.14:5,c12.3:4,c12.4:2,...,,,,,,,,,,
2,20171123073000-2,wc:499,c12.1:23,c12.10:37,c12.12:6,c12.13:18,c12.14:14,c12.3:15,c12.4:5,c12.5:5,...,,,,,,,,,,
3,20171123073000-3,wc:797,c1.1:1,c1.2:1,c12.1:26,c12.10:65,c12.11:1,c12.12:8,c12.13:22,c12.14:38,...,,,,,,,,,,
4,20171123073000-4,wc:200,c1.2:3,c12.1:1,c12.10:10,c12.12:4,c12.13:4,c12.14:10,c12.5:1,c12.7:4,...,,,,,,,,,,


#### 1.1 Building emotions dictionary

In [23]:
# function to build emotion dictionary
Emotions_dictionary = get_emotion_dictionary(DATA_LOCAL, 'GCAM-MASTER-CODEBOOK.txt')

In [25]:
# lets have a look at it
Emotions_dictionary.tail()

Unnamed: 0,Variable,DimensionHumanName
2973,c37.33,Power
2974,c38.1,ImageryValence
2975,c41.1,POSITIVE
2976,c41.2,NEGATIVE
2977,c41.3,NEUTRAL


> We see that different variables refer to different feelings

# Part 3

1. **Are we emotionally biased?**

In [None]:
# Evaluation of the dependency between the emotions and the importance of the conflict:

def emotion_importance(df, Importance_metrics):
    '''Computes descriptive statistics and exploratory plots.
    Importance_metrics can take the following values: NumMentions, NumSources, NumArticles'''
    emotions_Number = df.select(Importance_metrics, 'AvgTone').toPandas()
    print(emotions_Number.describe())
    # Pearson correlation coefficient between the 2 variables
    print('Pearson Correlation: ', emotions_Number[Importance_metrics].corr(emotions_Number['AvgTone']))
    emotions_Number.plot(x=Importance_metrics, y='AvgTone', style='o')
    return 

In [None]:
emotion_importance(event_mentions_df,'NumSources')

In [None]:
emotion_importance(event_mentions_df,'NumMentions')

In [None]:
emotion_importance(event_mentions_df,'NumArticles')