# Emotions in Conflicts: Data Collection, Analysis & Visualization


In this notebook we present our pipeline to answer the research questions that we raised in Milestone 1. We start by a part 0, where we present and discuss some general points, mostly technical, that we encountered and considered throughout the project. Then, in the first part, we show our step-step strategy on how we proceed to answer the individual questions. In the second part, the code we used to fetch the data and to do the analysis is provided. In the third part we present our visualizations of the data analysis.

In this notebook, the code in Part 2 is executed on a sample of nine datafiles fetched from the cluster (3 for each of the 3 schemas (gkg, mentions, export)) and intermediate outputs are shown to guide the reader through the data processing part. The code used to fetch and process the whole GDELT dataset in the cluster is found in the `src` folder and the individual files are hyperlinked in this notebook. 

The analysis and the visualization part are done on the data gathered from the whole GDELT dataset, and the detailed interpretations are presented in our [data story](https://matterhorn-ada.github.io).

- [Part 0](#Part0): General points


- [Part 1](#Part1): Strategy to answer the research questions
    - [Question 0](#Part1_Q0): Exploratory analysis: Where does the news come from?
    - [Question 1](#Part1_Q1): Are we emotionally biased? 
    - [Question 2](#Part1_Q2): Are some countries ignored in the news?
    - [Question 3](#Part1_Q3): Are we emotionally predictable?
    - [Question 4](#Part1_Q4): Do we have a saturation limit?
    - [Question 5](#Part1_Q5): Who is more emotional?
    
    
- [Part 2](#Part2): Code
    - [Question 0](#Part2_Q0): Exploratory Data Fetching 
    - [Question 1](#Part2_Q1): Fetching & preprocessing of the data   
    - [Question 2](#Part2_Q2): Aggregating 
    - [Question 3](#Part2_Q3a): Fetching & preprocessing of the data (Cluster Code)
    - [Question 3](#Part2_Q3b): Decision tree building (learning & validating) 
    - [Question 4 & 5](#Part2_Q4a): GCAM Emotions data
    - [Question 4 & 5](#Part2_Q4b): Emotions dictionary
    
    
- [Part 3](#Part3): Visualization
    - [Question 0](#Part3_Q0): Choropleth maps
    - [Question 1](#Part3_Q1): Dependency evaluation
    - [Question 2](#Part3_Q2): Dependency evaluation
    - [Question 4 & 5](#Part3_Q4): Most common emotions
    

## Part 0 <a id='Part0'></a>

- All the filtering and aggregation steps of processing the data were executed in the cluster and the resulting dataset were exported as either `parquet` or `json` files. The later processing was done with `Pyspark` and `Pandas` dataframes. 


- When working with country names and country codes from different data sources (e.g. GDELT country codes, GeoJSON country codes, United Nations country names, ...) several heterogeneities in the names and codes were encountered. GDELT is using the [FIPS country codes](https://en.wikipedia.org/wiki/List_of_FIPS_country_codes) which is mainly used by the US government, whereas the [alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-22) letter code introduced International Organization for Standardization is more widely used (example: FIPS code for Switzerland is SZ, and the alpha-2 code is CH). For the processing, we converted the GDELT code to the standard by creating a [dictionary](../data/CountryCodes/CountryCode_FIPS_alpha.csv), with the data taken from [geodatasource.com](https://www.geodatasource.com/resources/tutorials/international-country-code-fips-versus-iso-3166/), however some manuel work had to be done for the country names (some external dataset used did not provide the country code and thus had to be joined on the country name, e.g. the [Human development index](http://hdr.undp.org/en/content/human-development-index-hdi)).


## Part 1 <a id='Part1'></a>

### Pursued strategy to answer the research questions:

### Question 0 <a id='Part1_Q0'></a>

**Exploratory analysis: Where does the news come from?**

#### Fetching & Processing the data 

 From the GDELT dataset we fetch the following information from the "Mentions" and "Events" sets:

- Url of the article mentioning the source 
- Location of the event
- Location of the country reporting the news: 
    - Get the country reporting the news from the url of the article using the [GDELT Geographic Source Lookup](https://blog.gdeltproject.org/mapping-the-media-a-geographic-lookup-of-gdelts-sources/)

#### Analysis

- Where do the events happen?
    - Group by the countries and count the number of events happening in this country.

- Who reports the news?
    - Group by the countries reporting the news and count the number of events reported by each country.

### Question 1 <a id='Part1_Q1'></a>

**Are we emotionally biased?** Do the number of conflicts or their distance from our home define our emotions? 

#### Fetching & Processing the data 

 From the GDELT dataset we fetch the following information from the "Mentions" and "Events" sets:

- Url of the article mentioning the source 
- Average Tone 
- Location of the event (latitudinal and longitudinal coordinates)
- Number of times the event is mentioned in the news (NumArticles) 
- Calculation of the distance between the source article and the event: 
    1. Get the country from the url of the article using the [GDELT Geographic Source Lookup](https://blog.gdeltproject.org/mapping-the-media-a-geographic-lookup-of-gdelts-sources/)
    2. Get the geographic coordinates of the capital of the source country using the csv file provided by [this website](http://techslides.com/list-of-countries-and-capitals)
    3. Calculate the geographic distance between the source article and the event with the [Great-Circle distance formula](https://en.wikipedia.org/wiki/Great-circle_distance)
    
    Note that in this approach we approximate the location of the source reporting the article by the capital of the country reporting the news.

#### Analysis

- Evaluation of the dependency between the emotions and the distance: 
    1. Plot the emotion metrics (Average Tone) against the distance for each event. Distinguish between the location of the event (colour codes) to see if the country where the event occurred influences the emotion, or the other way around, whether some news from a given country only get reported if they have a negative or positive association. 

- Evaluation of the dependency between the emotions and the importance of the conflict:
    1. Evaluation whether there is a dependency of the emotion metrics and the importance of the conflict using an "importance of an event" metrics provided by GDELT (3 very similar metrics are given, NumMentions, NumSources, NumArticles, and we decided to use NumArticles)

### Question 2 <a id='Part1_Q2'></a>

**Are some countries ignored in the news?**  Is the number of conflicts taking place in a country in relation with the number of mentions in the media depending on where the conflict has happened? 

#### Fetching the data 

We use the same dataset than in Question 1, in addition to the dataset ["population by country"](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)) provided by the United Nations (2017).

#### Analysis

- Group by the event country and sum up the number of conflicts and the number of mentions 
- Evaluate whether there is a correlation between the number of conflicts and mentions or not
- Identify countries with a high number of conflicts, but a comparewise low number of mentions
- Identify whether the number of people living in a country influence the number of event in a country


### Question 3 <a id='Part1_Q3'></a>

**Are we emotionally predictable?** Can we observe patterns of emotions with respect to a country, religion or an ethnical group? Can we derive a model predicting emotions in case of a new conflict based on its specific features?

# THIS NEEDS TO BE CHANGED

#### Rational:

To observe patterns of emotions with respect to a country, religion and ethnicity, we will derive a model with the features being the country where the event happened (a selection of max. 10 countries), the religion (around 3), and the ethnicity (around 3) (categorical features), and the response variable being an emotion metrics (average tone, or an emotion from our emotion dictionary). We will then learn this model for each source country (country reporting the event) and validate our approach on an event not included in the training set. Below is an example to illustrate:

For every country in the world an “emotion” model is trained on events that happened in Syria, with the actors being of islamic and christianic religion (in the “Events” set we have access to the religion and ethnicity of Actor 1 and Actor 2). Now, if a new conflict with these same features happens, we can predict how e.g. Switzerland or the US (as well as every other country in the world) might react. 

A further step in observing patterns is then to cluster countries together that having similar emotional responses (with a K-means algorithm). A clustering might give insight about which countries are culturally close to each other. 

#### Fetching:

In addition to the data fetched in Question 1 we select from the “Events” set ethnicity related data (‘Actor1EthnicCode', Actor1Religion1Code, Actor1Religion2Code, Actor2EthnicCode', Actor2Religion1Code, Actor2Religion2Code).


### Question 4 <a id='Part1_Q4'></a>

**Do we have a saturation limit?** Does increasing number of conflicts make people feel worse and worse or is there some limit?

#### Fetching the data 

From the GDELT dataset we fetch the following information from the "GKG":

  - Locations 
  - V2Tone 
  - GCAM 

#### Analysis

  - Calculation the increasing number of conflicts: 
      1. Get the location of the country
      2. Parse through the gkg files (in the time interval we wish) and get the events referent to a country.
      3. Associate quantity of conflicts with emotions passed

  - Possible limit of the emotions: 
      1. Get the V2Tone and the GCAM feelings referent to the events
      2. Evaluate the emotions that we have for each of this event, observing how the media shows the events and if there are some insensibility (measured by lower V2Tone scores and more neutral GCAM feelings) or not after a threshold number of events.






### Question 5 <a id='Part1_Q5'></a>

**Who is more emotional?** Do we see sensitivity differences between some countries or actors? 

#### Fetching the data 

From the GDELT dataset we fetch the following information from the "GKG" and "Events":

  - People
  - Locations 
  - V2Tone 
  - GCAM 

#### Analysis


  - Relation between country and emotion: 
      1. Similar to question 4
      2. Associate specific countries to the emotions. Specifically associate actors (people) to emotions concerning the events related to them.


## Part 2 <a id='Part2'></a>

In [51]:
# regular imports
import os
import numpy as np
import pandas as pd
import math
from math import radians, sqrt, sin, cos, atan2

# function imports
from helper_functions import *
from Schema import *

import warnings
warnings.filterwarnings('ignore')

#import findspark
#findspark.init('C:\opt\spark')

from pyspark.sql import *
from pyspark.sql.functions import to_timestamp,desc, col, abs, split, count, array, concat_ws, sum, min, max, mean, asc, coalesce, udf, when, window, explode, unix_timestamp, lit
from pyspark.sql.types import FloatType

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

%matplotlib inline
spark = SparkSession.builder.getOrCreate()
#sc = spark.sparkContext

# update when changing functions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
#DATA_DIR = 'hdfs:///datasets/gdeltv2'  # cluster code
DATA_DIR = '../data/' # local code

# directory for local files 
DATA_LOCAL = '../data/'

In [4]:
# open GDELT data
gkg_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.gkg.csv"),schema=GKG_SCHEMA)
events_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.export.CSV"),schema=EVENTS_SCHEMA)
mentions_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.mentions.CSV"),schema=MENTIONS_SCHEMA)

In [5]:
# open helper datasets
CountryToCapital = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "CountryInformations/country-capitals.csv")
Domains = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "UrlToCountry/urls.csv")
Domains = Domains.select(Domains['alpha-2'].alias('code') ,Domains['name'].alias('country_source'), 'region', Domains['web'].alias('url'))

In [6]:
Domains.show(5)

+----+--------------+------+-----------------+
|code|country_source|region|              url|
+----+--------------+------+-----------------+
|  AF|   Afghanistan|  Asia| 108worldnews.com|
|  AF|   Afghanistan|  Asia|           1tv.af|
|  AF|   Afghanistan|  Asia|       1tvnews.af|
|  AF|   Afghanistan|  Asia|       aff.org.af|
|  AF|   Afghanistan|  Asia|afghan-review.com|
+----+--------------+------+-----------------+
only showing top 5 rows



urls linking the url of the news article to the country which is reporting the event

#### Q0.  Exploratory Data Fetching <a id='Part2_Q0'></a>

# CHANGE ME The whole cluster code is found [here](github balbalb)

In [23]:
#### get the number of events grouped by the country they occurred

# select the required data from Event Dataset
events_1 = events_df.select('ActionGeo_CountryCode').filter(events_df.ActionGeo_CountryCode.isNotNull())
CountryEvent_number = events_1.groupBy('ActionGeo_CountryCode').count()
CountryEvent_number.show(5)

+---------------------+-----+
|ActionGeo_CountryCode|count|
+---------------------+-----+
|                   CI|   15|
|                   FI|    1|
|                   IC|    1|
|                   RO|    9|
|                   SL|    7|
+---------------------+-----+
only showing top 5 rows



In [24]:
#### get the number of events grouped by the country reporting

# select the required data from Mentions Dataset
mentions_1 = mentions_df.select("GLOBALEVENTID", "MentionType", "MentionSourceName") \
                .filter(mentions_df["MentionType"] == '1') # MentionType 1 refers to the web articles and not e.g. books

# join the domain df with the country corresponding to the source
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer")

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

events_1 = events_df.select('GLOBALEVENTID')

mention_event = events_1.join(mentions_3, 'GLOBALEVENTID')
CountrySource_number = mention_event.groupBy('country_source').count()
CountrySource_number.show(5)

+-------------------+-----+
|     country_source|count|
+-------------------+-----+
|Korea (Republic of)|   36|
|        Philippines|  166|
|           Malaysia|  151|
|               Fiji|    5|
|             Turkey|   40|
+-------------------+-----+
only showing top 5 rows



#### Q1.  Fetching & preprocessing of the data  <a id='Part2_Q1'></a>

# ADD THE FILE The whole cluster code is found in the [Q1_cluster.py](FetchingData/Q1_cluster.py)

In [46]:
# select the required data from Mentions Dataset

# select the required data from Mentions Dataset
mentions_1 = mentions_df.select("GLOBALEVENTID", "EventTimeDate", 'MentionIdentifier', "MentionType", "MentionSourceName") \
                .filter(mentions_df["MentionType"] == '1')

# join the domain df with the country corresponding to the source
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer")

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

# print the number of urls that have no country
print('number of unknown urls: ', mentions_2.filter("url is null").select('MentionSourceName').distinct().count())

# print the number of urls associated to a country
print('number of known urls: ', mentions_3.select('url').distinct().count())

number of unknown urls:  12
number of known urls:  843


Only a small fraction of 'news urls' are unknown in terms of which country they belong to

In [47]:
# join the file with the countries and capitals
mentions_4 = mentions_3.join(CountryToCapital, CountryToCapital['CountryName'] == mentions_3['country_source'], "left_outer") 

# filter out rows with no geographic coordinates
mentions_5 = mentions_4.filter(mentions_4.CapitalLatitude.isNotNull())

# select relevant columns 
mentions_6 = mentions_5.select('GLOBALEVENTID', 'MentionIdentifier','EventTimeDate','CountryCode', 'CountryName', mentions_5['CapitalLatitude'].alias('Source_Lat'),
                                                   mentions_5['CapitalLongitude'].alias('Source_Long'))
mentions_6 = mentions_6.withColumn("Source_Lat", mentions_6["Source_Lat"].cast("float"))
mentions_6 = mentions_6.withColumn("Source_Long", mentions_6["Source_Long"].cast("float"))

+-------------+--------------------+--------------+-----------+-----------+----------+-----------+
|GLOBALEVENTID|   MentionIdentifier| EventTimeDate|CountryCode|CountryName|Source_Lat|Source_Long|
+-------------+--------------------+--------------+-----------+-----------+----------+-----------+
|    709307290|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709307290|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709307289|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709307289|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709306810|https://www.pajhw...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709306810|https://www.pajhw...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    709306809|https://www.pajhw...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|
|    70930

In [52]:
# prepare gkg file
gkg_1 = gkg_df.select('DocumentIdentifier', 'V2Tone')
split_col = split(gkg_df['V2Tone'], ',')
gkg_2 = gkg_1.withColumn('Tone', split_col.getItem(0))
gkg_3 = gkg_2.withColumn('Polarity', split_col.getItem(3))
gkg_4 = gkg_3.withColumn('Emotion_Charge', abs(gkg_3.Tone - gkg_3.Polarity))

In [54]:
# select Data from Events Dataset
events_1= events_df.select("GLOBALEVENTID", 'ActionGeo_CountryCode', "ActionGeo_Lat", "ActionGeo_Long", "NumMentions","NumSources","NumArticles","AvgTone")

# filter out events that have no geographic coordinates
print('Total number of events: ', events_1.count())
events_2 = events_1.filter(events_1.ActionGeo_Lat.isNotNull())
print('Number of events with geographic coordinates: ', events_2.count())

# merge the clean events and mentions dfs
event_mention = events_2.join(mentions_6, 'GLOBALEVENTID') 

data_q1 = event_mention.join(gkg_4.select('DocumentIdentifier', 'Emotion_Charge'), gkg_4['DocumentIdentifier'] == event_mention['MentionIdentifier'])

Total number of events:  5251
Number of events with geographic coordinates:  5095
+-------------+---------------------+-------------+--------------+-----------+----------+-----------+----------+--------------------+--------------+-----------+-----------+----------+-----------+--------------------+-----------------+
|GLOBALEVENTID|ActionGeo_CountryCode|ActionGeo_Lat|ActionGeo_Long|NumMentions|NumSources|NumArticles|   AvgTone|   MentionIdentifier| EventTimeDate|CountryCode|CountryName|Source_Lat|Source_Long|  DocumentIdentifier|   Emotion_Charge|
+-------------+---------------------+-------------+--------------+-----------+----------+-----------+----------+--------------------+--------------+-----------+-----------+----------+-----------+--------------------+-----------------+
|    709307290|                   AF|      34.5167|       69.1833|          6|         1|          6|-3.5294118|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|http://www.khaama.

Most of the events reported by GDELT have a reported geographic location.

# CHANGE ME The following part of the code is found [here]()

In [22]:
# append a column with the geographic distance
def geocalc(lat1, lon1, lat2, lon2):
    #print(lat1, lon1, lat2, lon2)
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)
    
    dlon = lon1 - lon2

    EARTH_R = 6372.8

    y = sqrt((cos(lat2) * sin(dlon)) ** 2 + (cos(lat1) * sin(lat2) - sin(lat1) * cos(lat2) * cos(dlon)) ** 2)
    x = sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(dlon)
    c = atan2(y, x)
    return EARTH_R * c

udf_geocalc = udf(geocalc, FloatType())
 
data_q1 = spark.read.parquet(DATA_LOCAL + 'data_q1.parquet')
data_q1 = data_q1.withColumn("distance", lit(udf_geocalc('ActionGeo_Lat', 'ActionGeo_Long', 'Source_Lat', 'Source_Long')))

data_q1 = data_q1.select('ActionGeo_CountryCode', 'distance', 'AvgTone','NumArticles', data_q1['CountryName'].alias('SourceCountry'))
data_q1.show(5)

+---------------------+---------+-----------+-----------+-------------+
|ActionGeo_CountryCode| distance|    AvgTone|NumArticles|SourceCountry|
+---------------------+---------+-----------+-----------+-------------+
|                   AS|1786.5924|  1.2285012|        3.0|    Australia|
|                   AS|1786.5924|  1.2285012|        3.0|    Australia|
|                   IN|1087.6713|-0.29154518|        3.0|        India|
|                   IN|1087.6713|-0.29154518|        3.0|        India|
|                   PE|10281.417|  -6.508876|        2.0|  Netherlands|
+---------------------+---------+-----------+-----------+-------------+
only showing top 5 rows



When applying the algorithm on the whole GDELT dataset, a dataset of 3.4 GB was obtained. Since the goal was to plot each event as a datapoint in a graph, this dataset was randomly downsampled and only 0.01 % were kept.

#### Q2. Aggregation <a id='Part2_Q2'></a>

The whole cluster code is found in the [Q2_cluster.py](FetchingData/Q2_cluster.py) file.

In [27]:
data_q2 = events_2.groupBy('ActionGeo_CountryCode').agg(sum('NumArticles').alias('sum_articles'), count('GLOBALEVENTID').alias('count_events'))
data_q2.show(5)

+---------------------+------------+------------+
|ActionGeo_CountryCode|sum_articles|count_events|
+---------------------+------------+------------+
|                   CI|         111|          15|
|                   FI|           4|           1|
|                   IC|           2|           1|
|                   RO|          23|           9|
|                   SL|          31|           7|
+---------------------+------------+------------+
only showing top 5 rows



Q3. **Are we emotionally predictable?** <a id='Part2_Q3a'></a>

### Fetching the data

The whole cluster code is found in the [Q3_cluster.py](FetchingData/Q3_cluster.py) file.

# CHANGE HDI_categories datafile

In [30]:
# open helper dataset
HDI_df = spark.read.format("csv").option("header", "true").load("../data/CountryInformations/HDI_code_df.csv")
HDI_df.show(5)

+---+-----------+---------+------------+----------+-------+
|_c0|    Country|      HDI|Country name|FIPS_GDELT|alpha-2|
+---+-----------+---------+------------+----------+-------+
|  0|     Norway|Very High|      Norway|        NO|     NO|
|  1|Switzerland|Very High| Switzerland|        SZ|     CH|
|  2|  Australia|Very High|   Australia|        AS|     AU|
|  3|    Ireland|Very High|     Ireland|        EI|     IE|
|  4|    Germany|Very High|     Germany|        GM|     DE|
+---+-----------+---------+------------+----------+-------+
only showing top 5 rows



The human development index is divided into 4 categories: Very High, High, Medium, Low

In [31]:
# filter on events that have count information
gkg_1 = gkg_df.filter(gkg_df.Counts.isNotNull())
CountType = split(gkg_1['Counts'], '#')

# add the event type
gkg_2 = gkg_1.withColumn('EventType', CountType.getItem(0))

# add the number of people concerned
gkg_3 = gkg_2.withColumn('NumPeople', CountType.getItem(1))

In [32]:
# prepare mention file
mentions_1 = mentions_df.select("GLOBALEVENTID", "MentionType", "MentionSourceName", 'MentionIdentifier') \
                .filter(mentions_df["MentionType"] == '1')

# join the dataframe url to country
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer") 

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

In [35]:
events_1= events_df.select("GLOBALEVENTID", 'Day_DATE','Actor1Religion1Code', 'Actor2Religion1Code',
                               'NumMentions',"AvgTone", 'GoldsteinScale', 'ActionGeo_CountryCode')

# filter events with no country code
events_2 = events_1.filter(events_1.ActionGeo_CountryCode.isNotNull())

# create religion column (take one of the two actors, because in the example test there was never data for both at the same time)
events_3 = events_2.withColumn('ActorReligion', coalesce(events_2['Actor1Religion1Code'], events_2['Actor2Religion1Code']))

# filter out columns with no religion
events_4 = events_3.filter(events_3.ActorReligion.isNotNull())

In [38]:
#### join the files togther

# join the HDI file to country code
events_5 = events_4.join(HDI_df.select('Country', 'HDI', 'FIPS_GDELT'), HDI_df['FIPS_GDELT']==events_4['ActionGeo_CountryCode'])

# join event and mention file
mention_event = events_5.join(mentions_3, 'GLOBALEVENTID')

# join mention_event and gkg
gkg_mention_event = mention_event.join(gkg_3, mention_event['MentionIdentifier'] == gkg_3['DocumentIdentifier'])

In [37]:
print('gkg, original size: ', gkg_df.count())
print('gkg, filtered: ', gkg_3.count())
print('mention, original size: ', mentions_df.count())
print('mention, filtered: ', mentions_3.count())
print('event, original size: ', events_df.count())
print('event, filtered: ', events_4.count())
print('files joined together: ', mention_event.count())
print('all: ', gkg_mention_event.count())

gkg, original size:  6098
gkg, filtered:  772
mention, original size:  13508
mention, filtered:  13421
event, original size:  5251
event, filtered:  199
files joined together:  200
all:  49


We see that the filtering is reduces the size of the datafiles quite drastically. This is because in the gkg file, most of the articles do not have a specified event type, in the event file, very few events have the religion of the actors defined, and in the mention file, not every article can be linked to a country source reporting the article.

In the following we will select the emotions of interest from the GCAM data column available in the gkg file. 

In [39]:
HRE = ['c2.152', 'c2.214', 'c3.2', 'c5.7', 'c6.5', 'c7.2', 'c10.1',
       'c14.9', 'c15.3', 'c15.4', 'c15.12', 'c15.26', 'c15.27', 'c15.30',
       'c15.36', 'c15.42', 'c15.53', 'c15.57', 'c15.61', 'c15.92',
       'c15.93', 'c15.94', 'c15.97', 'c15.101', 'c15.102', 'c15.103',
       'c15.105', 'c15.106', 'c15.107', 'c15.108', 'c15.109', 'c15.110',
       'c15.116', 'c15.120', 'c15.123', 'c15.126', 'c15.131', 'c15.136',
       'c15.137', 'c15.152', 'c15.171', 'c15.173', 'c15.179', 'c15.203',
       'c15.219', 'c15.221', 'c15.239', 'c15.260', 'c21.1', 'c35.31',
       'c24.1', 'c24.2', 'c24.3', 'c24.4', 'c24.5', 'c24.6', 'c24.7',
       'c24.8', 'c24.9', 'c24.10', 'c24.11', 'c36.31', 'c37.31', 'c41.1']


Emot_Words_df = gkg_mention_event.select(gkg_mention_event['ActionGeo_CountryCode'].alias('CountryEvent'), 'EventType',
                                   'ActorReligion', 'HDI', 'AvgTone',
                                  'country_source', 'GCAM',split(col("GCAM"), ":").alias("GCAM2"))

Emot_Words_df = Emot_Words_df.withColumn('GCAM2', concat_ws(',', 'GCAM2'))

Emot_Words_df = Emot_Words_df.select('CountryEvent', 'EventType','ActorReligion', 'country_source',
                                     'HDI', 'AvgTone', split(col("GCAM2"), ",").alias("GCAM"))

Emot_Words_df = Emot_Words_df.withColumn("HRE", array([lit(x) for x in HRE]) )

differencer = udf(lambda x,y: list(set(y)-(set(y)-set(x))), ArrayType(StringType()))
Emot_Words_df = Emot_Words_df.withColumn('DIF', differencer('HRE', 'GCAM'))

Emot_Words_df = Emot_Words_df.select('CountryEvent', 'EventType','ActorReligion', 'country_source',
                                     'HDI', 'AvgTone', 'DIF')

data_q3 = Emot_Words_df.dropDuplicates()

In [41]:
data_q3.show(5)

+------------+---------+-------------+--------------------+------+----------+--------------------+
|CountryEvent|EventType|ActorReligion|      country_source|   HDI|   AvgTone|                 DIF|
+------------+---------+-------------+--------------------+------+----------+--------------------+
|          BK|     KILL|          MOS|United States of ...|  High|-10.911809|[c15.110, c15.103...|
|          MR|  PROTEST|          MOS|              Turkey|   Low|-8.5561495|[c2.214, c41.1, c...|
|          ID|   AFFECT|          MOS|            Malaysia|Medium|-2.7181687|[c2.214, c14.9, c...|
|          PK|  PROTEST|          MOS|         Philippines|Medium|-3.1468532|[c15.110, c14.9, ...|
|          IZ|  PROTEST|          MOS|        Saudi Arabia|Medium| 1.9607843|[c15.171, c2.214,...|
+------------+---------+-------------+--------------------+------+----------+--------------------+
only showing top 5 rows



This dataset serves to build a decision tree in the following. Due to the drastic filtering processing, the algorithm applied on the whole GDELT dataset returned a dataset with a manageable size of 88 MB.

### Building the decision tree <a id='Part2_Q3b'></a>

#### Prediction model:

Decision tree model with the Average Tone as the response.

Features are: CountryEvent, EventType, ActorReligion, HDI

Other considered Features are: NumPeople, ActorEthnicity, GoldsteinScale

#### Feature description:

CountryEvent: Geographic country where the event has happened (categorical variable). To have a large sample size for each category, we only pick the top 10 countries, that are mentioned most often.

EventType: This reflects the type of the event. Categorical value with the following values: AFFECT, ARREST, KIDNAP, KILL, PROTEST, SEIZE, or WOUND.

NumPeople: Number of people concerned by the EventType. NOT INCLUDED BECAUSE AS A STANDALONE IT HAS NO MEANING. COULD BE REPLACED BY NUMMENTIONS IF THAT IS FOUND TO BE SIGNIFICANT

ActorEthnicity: Ethnicity of the actors implied in the event and indicated by the CAMEO Ethnic Coding Scheme (see Chapter 5 from the [CAMEO
Conflict and Mediation Event Observations Event and Actor Codebook](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf)). We use the top 10 most common ethnic groups (categorical variable). NOT INCLUDED BECAUSE IT IS REPETITIVE OF THE COUNTRYEVENT.

ActorReligion: Religion of the actors implied in the event and indicated by the CAMEO Religious Coding Scheme (see Chapter 4 from the [CAMEO
Conflict and Mediation Event Observations Event and Actor Codebook](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf)). 19 religions are reported (categorical variable).

GoldsteinScale: Score that reflects the potential impact an event will have on the stability of a country. The attributed value lies between -10 and 10, where a positive value indicates more cooperation than conflict. NOT INCLUDED BECAUSE IT LOOKED THIS VARIABLE IS NOT RELATED TO THE OUTCOME VARIABLE AND THE SITUATION

HDI: Human Development Index. This index takes into account the development of a country, not  only including its economic growth (measured by the gross national income), but also life expectance and education (more information are found on the [United Nations Development Website](http://hdr.undp.org/en/content/human-development-index-hdi). Continuous variable (HDI $\in [0,1]$). We use the [latest version](http://hdr.undp.org/en/composite/HDI) (released in Sep. 2018) covering the period of 2017. 

AvgTone: The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.


In [None]:
model_data = Data_q3.toPandas()
model_data.head()

In [None]:
# discretize the average tone
bins = pd.IntervalIndex.from_tuples([(-101, -10), (-10, -8), (-8,-5), (-5, -2), (-2, 0), (0,2),
                                    (2,5), (5,10), (10, 101)])
model_data.AvgTone = pd.cut(model_data.AvgTone, bins)

# transform into categorical data & create corresponding dictionary
le_CountryEvent = LabelEncoder()
model_data['CountryEvent'] = le_CountryEvent.fit_transform(model_data['CountryEvent'])
le_CountryEvent_mapping = dict(zip(le_CountryEvent.classes_, le_CountryEvent.transform(le_CountryEvent.classes_)))


# transform back with: list(le_CountryEvent.inverse_transform([2, 2, 1]))
le_EventType = LabelEncoder()
model_data['EventType'] = le_EventType.fit_transform(model_data['EventType'])
le_EventType_mapping = dict(zip(le_EventType.classes_, le_EventType.transform(le_EventType.classes_)))

le_ActorReligion = LabelEncoder()
model_data['ActorReligion'] = le_ActorReligion.fit_transform(model_data['ActorReligion'])
le_ActorReligion_mapping = dict(zip(le_ActorReligion.classes_, le_ActorReligion.transform(le_ActorReligion.classes_)))

le_HDI = LabelEncoder()
model_data['HDI'] = le_HDI.fit_transform(model_data['HDI'])
le_HDI_mapping = dict(zip(le_HDI.classes_, le_HDI.transform(le_HDI.classes_)))

le_AvgTone	 = LabelEncoder()
model_data['AvgTone'] = le_AvgTone.fit_transform(model_data['AvgTone'])
le_AvgTone_mapping = dict(zip(le_AvgTone.classes_, le_AvgTone.transform(le_AvgTone.classes_)))


In [None]:
# show transformed dataset
model_data.head()

In [None]:
# define X and Y
X = model_data.values[:, 0:4]
Y = model_data.values[:,4]
Y = Y.astype(int) 

In [None]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

In [None]:
# train the model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)


In [None]:
y_pred = clf.predict(X_test)
print("Accuracy is "), accuracy_score(y_test,y_pred)*100

In [None]:
# predict for: ['MY', 'AFFECT', 'CHR', 'Very High']
le_EventType_mapping['KILL']
y_pr =clf.predict([[le_CountryEvent_mapping['MY'], le_EventType_mapping['AFFECT'], le_ActorReligion_mapping['CHR'], le_HDI_mapping['Very High']]])
AvTone = [key  for (key, value) in le_AvgTone_mapping.items() if value == y_pr]
AvTone

Q4. **Do we have a saturation limit?** / Q5. **Who is more emotional?**

The data to answer questions 4 % 5 is pretty similar, therefore the processing for both parts is shown here.

##### Q4. a) GCAM Emotions data <a id='Part2_Q4a'></a>
> We will select the ID ,the emotions, locations, persons and tone so we can attribute to an event the emotion behind it. For question 4 and 5 we will use mainly the gkg files as these files contain all the information that allows to answer our questions (locations, persons, emotions)

In [None]:
ID_GCAM = gkg_df.select("GKGRECORDID","GCAM","Locations","Persons","V2Tone")

In [None]:
df_ID_GCAM = ID_GCAM.toPandas()
df_ID_GCAM.head()

> lets make sure we only have 1 ID per row (which is expected)

In [None]:
print('\n', 'Length of all dataframe :' ,len(df_ID_GCAM), '\n', 
            'Length of unique IDs    :' ,len(df_ID_GCAM['GKGRECORDID'].unique()))

> Let's focus on the emotions since the other columns are pretty straightforward. We need to split the emotions so then we can better attribute each event the correspondent feel as there are several dictionaries with different collection of words. For a more clear vision you can check [GCAM](https://blog.gdeltproject.org/introducing-the-global-content-analysis-measures-gcam/)

In [None]:
# split the GCAM column
df_ID_Emotions = df_ID_GCAM['GCAM'].str.split(',',expand=True)
# insert the correspondent ID
df_ID_Emotions.insert(loc=0, column='GKGRECORDID', value=df_ID_GCAM['GKGRECORDID'])
df_ID_Emotions.head()

#### Q4. b) Building the emotion dictionary <a id='Part2_Q4b'></a>

In [None]:
# function to build emotion dictionary
Emotions_dictionary = get_emotion_dictionary(DATA_LOCAL, 'GCAM-MASTER-CODEBOOK.txt')

In [None]:
# lets have a look at it
Emotions_dictionary.tail()

> We see that different variables refer to different feelings.
> After some research (as you'll see) one of the most common emotions is "H4Lvd" which clearly is not an emotion or a feeling of the speech, but it corresponds to 2 dictionaries. Meaning that these words belong to the Harvard and Lasswell dictionaries. In order to understand what this means, we went to see the spreadsheet of the words in these dictionaries, additional information found in [H4Lvd](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm).
When we don't have the specific emotion, it is useful to know which is the most common feeling associated with the most common dictionaries (therefore the check of the spreadsheets). For some others, we already know the feeling the dictionary refers to, such as "Positivity" via Lexicoder, “Smugness” via WordNet Affect, “Passivity” via Regressive Imagery Dictionary, etc. With this information we can associate these sentiments with the news and speeches for each event

# Part 3 <a id='Part3'></a>

We mainly do our plots with the plotly library. The choropleth world maps are drawn with GeoJSON (?).

In [17]:
from Visualization import *
import pandas as pd

# update when changing functions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Q0. **Where does the news come from?** <a id='Part3_Q0'></a>

In [None]:
# plot the world map of the number of events grouped by country

In [None]:
# plot the world map of the number of events reported by each country

Q1. **Are we emotionally biased?** <a id='Part3_Q1'></a>

In [None]:
# plot the average tone vs the distance

Q2. **Are some countries ignored in the news?** <a id='Part3_Q2'></a>

We plot a bubble plot, where the size of the bubble relates to the population count of the country.

In [18]:
data_q2_final = pd.read_csv("data_q2_final.csv")
visualization_q2(data_q2_final)

Q4. **Do we have a saturation limit?** / Q5. **Who is more emotional?** <a id='Part3_Q4'></a>

In [None]:
# plot the top 10 most common emotions in 500 events (for each event)
Top_Emotions_GCAM = dict_top_feelings(df_ID_Emotions,Emotions_dictionary,1,500)
plot_commom_emotion(tuple_list = None, dictionary = Top_Emotions_GCAM, dictionary_flag=True)

> With this analysis we wanted to see if there are recurrent emotions in the news or speeches. Clearly there are. There might be some reasons to that, such as the dictionaries being so large that there are always a lot of words that belong to them or that in fact the words in those dictionaries are common. We will take this knowledge into account when attributing the emotions to countries or people (also because there are words that can belong to different dictionaries that have different meanings). For these feelings (when not obvious, such as "POSITIVE"), we went into extra efforts to understand what they mean.