# Emotions in Conflicts: Data Collection, Analysis & Visualization


In this notebook we present our pipeline to answer the research questions that we raised in Milestone 1. We start by a part 0, where we present and discuss some general points, mostly technical, that we encountered and considered throughout the project. Then, in the first part, we show our step-step strategy on how we proceed to answer the individual questions. In the second part, the code we used to fetch the data and to do the analysis is provided. In the third part we present our visualizations of the data analysis.

In this notebook, the code in Part 2 is executed on a sample of nine datafiles fetched from the cluster (3 for each of the 3 schemas (gkg, mentions, export)) and intermediate outputs are shown to guide the reader through the data processing part. The code used to fetch and process the whole GDELT dataset in the cluster is found in the `src` folder and the individual files are hyperlinked in this notebook. The processing of the extracted cluster data is shown on the actual fetched data.

The statistical analysis and the visualization part is as well done on the data gathered from the whole GDELT dataset, and the detailed interpretations are presented in our [data story](https://matterhorn-ada.github.io).

- [Part 0](#Part0): General points


- [Part 1](#Part1): Strategy to answer the research questions
    - [Question 0](#Part1_Q0): Exploratory analysis: Where does the news come from?
    - [Question 1](#Part1_Q1): Are we emotionally biased? 
    - [Question 2](#Part1_Q2): Are some countries ignored in the news?
    - [Question 3](#Part1_Q3): Are we emotionally predictable?
    - [Question 4](#Part1_Q4): Do we have a saturation limit?
    - [Question 5](#Part1_Q5): Who is more emotional?
    
    
- [Part 2](#Part2): Code
    - [Question 0](#Part2_Q0): Exploratory Data Fetching 
    - [Question 1](#Part2_Q1): Fetching & preprocessing of the data   
    - [Question 2](#Part2_Q2): Aggregating 
    - [Question 3](#Part2_Q3a): Fetching & preprocessing of the data (Cluster Code)
    - [Question 3](#Part2_Q3b): Model building (learning & validating) 
    - [Question 4](#Part2_Q4): V2Tone and Emotion Metric
    - [Question 5](#Part2_Q5): GCAM Emotions and Emotions dictionary
    
    
- [Part 3](#Part3): Statistics & Visualization
    - [Question 0](#Part3_Q0): Choropleth maps
    - [Question 1](#Part3_Q1): Dependency evaluation
    - [Question 2](#Part3_Q2): Dependency evaluation
    - [Question 4](#Part3_Q4): Emotion Saturation
    - [Question 5](#Part3_Q5): GCAM words and countries
    

## Part 0 <a id='Part0'></a>

- All the filtering and aggregation steps of processing the data were executed in the cluster and the resulting dataset were exported as either `parquet` or `json` files. The later processing was done with `Pyspark` and `Pandas` dataframes. To handle the large dataset locally, e.g. for the plots and model creation, some downsampling was done. For the plot, random sampling was performed, and to generate the model data, stratified sampling was performed to garantuee a similar amount of data in each feature category.


- When working with country names and country codes from different data sources (e.g. GDELT country codes, GeoJSON country codes, United Nations country names, ...) several heterogeneities in the names and codes were encountered. GDELT is using the [FIPS country codes](https://en.wikipedia.org/wiki/List_of_FIPS_country_codes) which is mainly used by the US government, whereas the [alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-22) letter code introduced International Organization for Standardization is more widely used (example: FIPS code for Switzerland is SZ, and the alpha-2 code is CH). For the processing, we converted the GDELT code to the standard by creating a [dictionary](../data/CountryCodes/CountryCode_FIPS_alpha.csv), with the data taken from [geodatasource.com](https://www.geodatasource.com/resources/tutorials/international-country-code-fips-versus-iso-3166/), however some manuel work had to be done for the country names (some external dataset used did not provide the country code and thus had to be joined on the country name, e.g. the [Human development index](http://hdr.undp.org/en/content/human-development-index-hdi)).


## Part 1 <a id='Part1'></a>

### Pursued strategy to answer the research questions:

Many of our question make use of an emotion charge defined by us, so it makes sense to get familiar with it from the beginning

**Emotion Charge** *Our emotion metrics explained*

To make a reliable and accurate metrics that relates the emotion to each event and then to each country we did the following:
1. Get the absolute value of the difference between the Average Tone and the Polarity which is provided by GDELT and which rely on the word count of positive and negative words. Why do we do that: because we can have really polar speeches that translate into a big emotional charge, however the average tone is not capable of transmiting that as it is the difference between the negative and positive tone of the speech. Having a max positive and negative score gives an avg tone of 0, 100 - 100 = 0, however the speech was highly emotional. Taking the absolute value we can see that as it goes further away from 0 the more emotional it is.
    
2. Then we normalize the word count to map it to a value between 0 and 1. You wonder why: well, the more words the GDELT score was based on to give the polarity and the AvgTone scores, the more reliable is the result. Therefore, normalizing the word count and then multiplying the value we get from step 1 by this value let us have a notion of the weight of the value of step 1 regarding the final sum.
    
3. After we sum all the values we get from step 2 when doing the group by operation by country, and we divide the value we obtain by the number of events that country had in that month. Now you probably won't ask why as the reason is similar to step 2. If there is a country with a loooot of events (as it is the case for the USA), the final sum is biased, USA might not be the most emotional country but as it has the most events recorded by the GDELT it will appear so. So by dividing by the number of events we get the average emotional charge for the speeches of that country.


### Question 0 <a id='Part1_Q0'></a>

**Exploratory analysis: Where does the news come from?**

#### Fetching & Processing the data 

 From the GDELT dataset we fetch the following information from the "Mentions" and "Events" sets:

- Url of the article mentioning the source 
- Location of the event
- Location of the country reporting the news: 
    - Get the country reporting the news from the url of the article using the [GDELT Geographic Source Lookup](https://blog.gdeltproject.org/mapping-the-media-a-geographic-lookup-of-gdelts-sources/)

#### Analysis

- Where do the events happen?
    - Group by the countries and count the number of events happening in this country.

- Who reports the news?
    - Group by the countries reporting the news and count the number of events reported by each country.

### Question 1 <a id='Part1_Q1'></a>

**Are we emotionally biased?** Do the number of conflicts or their distance from our home define our emotions? 

#### Fetching & Processing the data 

 From the GDELT dataset we fetch the following information from the "Mentions", "Events" and "Global Knowledge Graph" sets:

- Url of the article mentioning the source 
- Average Tone (Event-file), Polarity, Tone (gkg-file)
- Location of the event (latitudinal and longitudinal coordinates)
- Number of times the event is mentioned in the news (NumArticles) 
- Calculation of the distance between the source article and the event: 
    1. Get the country from the url of the article using the [GDELT Geographic Source Lookup](https://blog.gdeltproject.org/mapping-the-media-a-geographic-lookup-of-gdelts-sources/)
    2. Get the geographic coordinates of the capital of the source country using the csv file provided by [this website](http://techslides.com/list-of-countries-and-capitals)
    3. Calculate the geographic distance between the source article and the event with the [Great-Circle distance formula](https://en.wikipedia.org/wiki/Great-circle_distance)
    
    Note that in this approach we approximate the location of the source reporting the article by the capital of the country reporting the news.

#### Analysis

- Evaluation of the dependency between the emotions and the distance: 
    1. Plot the emotion metrics (Average Tone/ Emotional charge (measure of polarity and tone)) against the distance for each event. Distinguish between the location of the event (colour codes) to see if the country where the event occurred influences the emotion, or the other way around, whether some news from a given country only get reported if they have a negative or positive association. 
    

- Evaluation of the dependency between the emotions and the importance of the conflict:
    1. Evaluation whether there is a dependency of the emotion metrics and the importance of the conflict using an "importance of an event" metrics provided by GDELT (3 very similar metrics are given, NumMentions, NumSources, NumArticles, and we decided to use NumArticles)

### Question 2 <a id='Part1_Q2'></a>

**Are some countries ignored in the news?**  Is the number of conflicts taking place in a country in relation with the number of mentions in the media depending on where the conflict has happened? 

#### Fetching the data 

We use the same dataset than in Question 1, in addition to the dataset ["population by country"](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)) provided by the United Nations (2017).

#### Analysis

- Group by the event country and sum up the number of conflicts and the number of mentions 
- Evaluate whether there is a correlation between the number of conflicts and mentions or not
- Identify countries with a high number of conflicts, but a comparewise low number of mentions
- Identify whether the number of people living in a country influence the number of event in a country


### Question 3 <a id='Part1_Q3'></a>

**Are we emotionally predictable?** Can we observe patterns of emotions with respect to a country, religion or an ethnical group? Can we derive a model predicting emotions in case of a new conflict based on its specific features?

We derive a model using the tone metrics given by GDELT as the response variable and the following variables as input features:

Features are: CountryEvent, CountrySource, EventType, NumPeople, ActorReligion, HDI  

CountryEvent: Geographic country where the event has happened.

CountrySource: Geographic country reporting the news article.

EventType: This reflects the type of the event. Categorical value with the following values: AFFECT, ARREST, KIDNAP, KILL, PROTEST, SEIZE, WOUND, etc. 

NumPeople: Number of people concerned by the EventType. 

ActorReligion: Religion of the actors implied in the event and indicated by the CAMEO Religious Coding Scheme (see Chapter 4 from the [CAMEO
Conflict and Mediation Event Observations Event and Actor Codebook](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf)). 19 religions are reported (categorical variable).

HDI: Human Development Index. This index takes into account the development of a country, not  only including its economic growth (measured by the gross national income), but also life expectance and education (more information are found on the [United Nations Development Website](http://hdr.undp.org/en/content/human-development-index-hdi). Categorical variable (HDI $\in$ [very high, high, medium, low]). We use the [latest version](http://hdr.undp.org/en/composite/HDI) (released in Sep. 2018) covering the period of 2017. 

Tone: The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral. We categorize this variable into 5 categories ([-100, -10], [-10, -5], [-5, 0], [0, 5], [5,100])

#### Analysis

- We fetch the required data from GDELT and we do a stratified downsampling on the EventType category (since the different classes do not have equal data points)
- We split the data into a training and test set train a machine model algorithm (either decision tree, or random forest) on the training set and we test the accuracy on the test set. 

### Question 4 <a id='Part1_Q4'></a>

**Do we have a saturation limit?** Does increasing number of conflicts make people feel worse and worse or is there some limit?

#### Fetching the data 

For question 4 the way we proceeded was the following:
- Read files by month.
- Get a dataframe where we can associate the gkg GCAM and V2Tone with the country by the url in the mentions dataframe.
- Calculate our metrics of Emotion_Charge from the V2Tone data.
- Get a final dataframe where we group by country summing the Emotion_Charge of the individual events of that country while also counting how many events happened in that country in that month. 


#### Analysis

  - Calculation of the increasing number of conflicts: 
      1. Get the location of the country
      2. Parse through the gkg files (in the time interval we wish) and get the events referent to a country.
      3. Associate quantity of conflicts with emotions passed

  - Possible limit of the emotions: 
      1. Get the V2Tone and the GCAM feelings referent to the events
      2. Evaluate the emotions that we have for each of these events, observing how the media shows the events and if there are some insensibility (measured by lower V2Tone scores and more neutral GCAM feelings) or not after a threshold number of events.






### Question 5 <a id='Part1_Q5'></a>

**Who is more emotional and what do they say?** Lets go deeper into question 4, this time associating words with countries. How are some countries clustered referent to the emotions they represent (GCAM data). 

#### Fetching the data 

From the GDELT dataset we fetch the following information from the "GKG" and "Events":

  - Locations
  - GCAM words

#### Analysis


  - Relation between the country and the emotion: 
      1. We get the GCAM data and with the mentions dataset we associate the words of GCAM to the countries.
      2. We associate the word code to the word itself, and make use of our own almost 800 long list of human readable words, so we get a list of words per country that we can understand.
      3. We see the most often words used by specific countries.
      4. We perform a principal component analysis to see if some countries cluster together due to similar choice of words in the news article reports


## Part 2 <a id='Part2'></a>

In [1]:
# regular imports
import os
import math
import numpy as np
import pandas as pd
from math import radians, sqrt, sin, cos, atan2
import pickle

# function imports
from Schema import *
from Visualization import *
from helper_functions import *

import warnings
warnings.filterwarnings('ignore')

#import findspark
#findspark.init('C:\opt\spark')

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import FloatType

# scikit learn imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

%matplotlib inline
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# update when changing functions
%load_ext autoreload
%autoreload 2

In [6]:
#DATA_DIR = 'hdfs:///datasets/gdeltv2'  # cluster code
DATA_DIR = '../data/Gdelt/' # local code

# directory for local files 
DATA_LOCAL = '../data/'

In [7]:
# open GDELT data
gkg_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.gkg.csv"),schema=GKG_SCHEMA)
events_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.export.CSV"),schema=EVENTS_SCHEMA)
mentions_df = spark.read.option("sep", "\t").csv(os.path.join(DATA_DIR, "*.mentions.CSV"),schema=MENTIONS_SCHEMA)

In [8]:
# open helper datasets
CountryToCapital = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "CountryInformations/country-capitals.csv")
Domains = spark.read.format("csv").option("header", "true").load(DATA_LOCAL + "UrlToCountry/urls.csv")
Domains = Domains.select(Domains['alpha-2'].alias('code') ,Domains['name'].alias('country_source'), 'region', Domains['web'].alias('url'))

In [9]:
Domains.show(5)

+----+--------------+------+-----------------+
|code|country_source|region|              url|
+----+--------------+------+-----------------+
|  AF|   Afghanistan|  Asia| 108worldnews.com|
|  AF|   Afghanistan|  Asia|           1tv.af|
|  AF|   Afghanistan|  Asia|       1tvnews.af|
|  AF|   Afghanistan|  Asia|       aff.org.af|
|  AF|   Afghanistan|  Asia|afghan-review.com|
+----+--------------+------+-----------------+
only showing top 5 rows



urls linking the url of the news article to the country which is reporting the event

## Q0.  Exploratory Data Fetching <a id='Part2_Q0'></a>
The whole cluster code is found [here](FetchingData/Q0_cluster.py)

The whole news sources urls processing code is found [here](domains.py). The biggest issue we have faced was probably the different notations of countries - different names, different country codes - so we had to convert inbetween them to be able to work and visualize all the data.

In [10]:
#### get the number of events grouped by the country they occurred

# select the required data from Event Dataset
events_1 = events_df.select('ActionGeo_CountryCode').filter(events_df.ActionGeo_CountryCode.isNotNull())
CountryEvent_number = events_1.groupBy('ActionGeo_CountryCode').count()
CountryEvent_number.show(5)

+---------------------+-----+
|ActionGeo_CountryCode|count|
+---------------------+-----+
|                   CI|   15|
|                   FI|    1|
|                   IC|    1|
|                   RO|    9|
|                   SL|    7|
+---------------------+-----+
only showing top 5 rows



In [11]:
#### get the number of events grouped by the country reporting

# select the required data from Mentions Dataset
mentions_1 = mentions_df.select("GLOBALEVENTID", "MentionType", "MentionSourceName") \
                .filter(mentions_df["MentionType"] == '1') # MentionType 1 refers to the web articles and not e.g. books

# join the domain df with the country corresponding to the source
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer")

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

events_1 = events_df.select('GLOBALEVENTID')

mention_event = events_1.join(mentions_3, 'GLOBALEVENTID')
CountrySource_number = mention_event.groupBy('country_source').count()
CountrySource_number.show(5)

+-------------------+-----+
|     country_source|count|
+-------------------+-----+
|Korea (Republic of)|   36|
|        Philippines|  166|
|           Malaysia|  151|
|               Fiji|    5|
|             Turkey|   40|
+-------------------+-----+
only showing top 5 rows



## Q1.  Are we emotionally biased?   <a id='Part2_Q1'></a>

The whole cluster code is found in the [Q1_cluster.py](FetchingData/Q1_cluster.py)

In [12]:
# select the required data from Mentions Dataset

# select the required data from Mentions Dataset
mentions_1 = mentions_df.select("GLOBALEVENTID", "EventTimeDate", 'MentionIdentifier', "MentionType", "MentionSourceName") \
                .filter(mentions_df["MentionType"] == '1')

# join the domain df with the country corresponding to the source
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer")

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

# print the number of urls that have no country
print('number of unknown urls: ', mentions_2.filter("url is null").select('MentionSourceName').distinct().count())

# print the number of urls associated to a country
print('number of known urls: ', mentions_3.select('url').distinct().count())

number of unknown urls:  12
number of known urls:  843


Only a small fraction of 'news urls' are unknown in terms of which country they belong to

In [13]:
# join the file with the countries and capitals
mentions_4 = mentions_3.join(CountryToCapital, CountryToCapital['CountryName'] == mentions_3['country_source'], "left_outer") 

# filter out rows with no geographic coordinates
mentions_5 = mentions_4.filter(mentions_4.CapitalLatitude.isNotNull())

# select relevant columns 
mentions_6 = mentions_5.select('GLOBALEVENTID', 'MentionIdentifier','EventTimeDate','CountryCode', 'CountryName', mentions_5['CapitalLatitude'].alias('Source_Lat'),
                                                   mentions_5['CapitalLongitude'].alias('Source_Long'))
mentions_6 = mentions_6.withColumn("Source_Lat", mentions_6["Source_Lat"].cast("float"))
mentions_6 = mentions_6.withColumn("Source_Long", mentions_6["Source_Long"].cast("float"))

In [14]:
# prepare gkg file
gkg_1 = gkg_df.select('DocumentIdentifier', 'V2Tone')
split_col = split(gkg_df['V2Tone'], ',')
gkg_2 = gkg_1.withColumn('Tone', split_col.getItem(0))
gkg_3 = gkg_2.withColumn('Polarity', split_col.getItem(3))
gkg_4 = gkg_3.withColumn('Emotion_Charge', abs(gkg_3.Tone - gkg_3.Polarity))

In [26]:
# select Data from Events Dataset
events_1= events_df.select("GLOBALEVENTID", 'ActionGeo_CountryCode', "ActionGeo_Lat", "ActionGeo_Long", "NumMentions","NumSources","NumArticles","AvgTone")

# filter out events that have no geographic coordinates
print('Total number of events: ', events_1.count())
events_2 = events_1.filter(events_1.ActionGeo_Lat.isNotNull())
print('Number of events with geographic coordinates: ', events_2.count())

# merge the clean events and mentions dfs
event_mention = events_2.join(mentions_6, 'GLOBALEVENTID') 

data_q1 = event_mention.join(gkg_4.select('DocumentIdentifier', 'Emotion_Charge'), gkg_4['DocumentIdentifier'] == event_mention['MentionIdentifier'])
data_q1.show(5)

Total number of events:  5251
Number of events with geographic coordinates:  5095
+-------------+---------------------+-------------+--------------+-----------+----------+-----------+----------+--------------------+--------------+-----------+-----------+----------+-----------+--------------------+----------------+
|GLOBALEVENTID|ActionGeo_CountryCode|ActionGeo_Lat|ActionGeo_Long|NumMentions|NumSources|NumArticles|   AvgTone|   MentionIdentifier| EventTimeDate|CountryCode|CountryName|Source_Lat|Source_Long|  DocumentIdentifier|  Emotion_Charge|
+-------------+---------------------+-------------+--------------+-----------+----------+-----------+----------+--------------------+--------------+-----------+-----------+----------+-----------+--------------------+----------------+
|    709307290|                   AF|      34.5167|       69.1833|          6|         1|          6|-3.5294118|http://www.khaama...|20171123064500|         AF|Afghanistan| 34.516666|  69.183334|http://www.khaama...|

Most of the events reported by GDELT have a reported geographic location.

In [32]:
# calculate the distance between the source and the event
# append a column with the geographic distance
def geocalc(lat1, lon1, lat2, lon2):
    #print(lat1, lon1, lat2, lon2)
    lat1 = radians(lat1)
    lon1 = radians(lon1)
    lat2 = radians(lat2)
    lon2 = radians(lon2)
    
    dlon = lon1 - lon2

    EARTH_R = 6372.8

    y = sqrt((cos(lat2) * sin(dlon)) ** 2 + (cos(lat1) * sin(lat2) - sin(lat1) * cos(lat2) * cos(dlon)) ** 2)
    x = sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(dlon)
    c = atan2(y, x)
    return EARTH_R * c

udf_geocalc = udf(geocalc, FloatType())
 
data_q1_1 = data_q1.withColumn("distance", lit(udf_geocalc('ActionGeo_Lat', 'ActionGeo_Long', 'Source_Lat', 'Source_Long')))

data_q1_2 = data_q1_1.select('ActionGeo_CountryCode', 'distance', 'AvgTone','NumArticles', data_q1_1['CountryName'].alias('SourceCountry'))

When applying the algorithm on the whole GDELT dataset, a dataset of 3.4 GB was obtained. Since the goal was to plot each event as a datapoint in a graph, this dataset was randomly downsampled and only 0.01 % were kept.

## Q2. Are some countries ignored in the news?<a id='Part2_Q2'></a>

The whole fetching cluster code is found in the [Q2_cluster.py](FetchingData/Q2_cluster.py) file, which is very similar to the cluster 1 code.

In [27]:
data_q2 = events_2.groupBy('ActionGeo_CountryCode').agg(sum('NumArticles').alias('sum_articles'), count('GLOBALEVENTID').alias('count_events'))
data_q2.show(5)

+---------------------+------------+------------+
|ActionGeo_CountryCode|sum_articles|count_events|
+---------------------+------------+------------+
|                   CI|         111|          15|
|                   FI|           4|           1|
|                   IC|           2|           1|
|                   RO|          23|           9|
|                   SL|          31|           7|
+---------------------+------------+------------+
only showing top 5 rows



## Q3. Are we emotionally predictable? <a id='Part2_Q3a'></a>

### Fetching the data

The whole cluster code is found in the [Q3_cluster.py](FetchingData/Q3_cluster.py) file.

In [30]:
# open helper dataset
HDI_df = spark.read.format("csv").option("header", "true").load("../data/CountryInformations/HDI_code_df.csv")
HDI_df.show(5)

+---+-----------+---------+------------+----------+-------+
|_c0|    Country|      HDI|Country name|FIPS_GDELT|alpha-2|
+---+-----------+---------+------------+----------+-------+
|  0|     Norway|Very High|      Norway|        NO|     NO|
|  1|Switzerland|Very High| Switzerland|        SZ|     CH|
|  2|  Australia|Very High|   Australia|        AS|     AU|
|  3|    Ireland|Very High|     Ireland|        EI|     IE|
|  4|    Germany|Very High|     Germany|        GM|     DE|
+---+-----------+---------+------------+----------+-------+
only showing top 5 rows



The human development index is divided into 4 categories: Very High, High, Medium, Low

In [31]:
# filter on events that have count information
gkg_1 = gkg_df.filter(gkg_df.Counts.isNotNull())
CountType = split(gkg_1['Counts'], '#')

# add the event type
gkg_2 = gkg_1.withColumn('EventType', CountType.getItem(0))

# add the number of people concerned
gkg_3 = gkg_2.withColumn('NumPeople', CountType.getItem(1))

In [32]:
# prepare mention file
mentions_1 = mentions_df.select("GLOBALEVENTID", "MentionType", "MentionSourceName", 'MentionIdentifier') \
                .filter(mentions_df["MentionType"] == '1')

# join the dataframe url to country
mentions_2 = mentions_1.join(Domains, Domains['url'] == mentions_1['MentionSourceName'], "left_outer") 

# filter out urls that are unknown
mentions_3 = mentions_2.filter(mentions_2.country_source.isNotNull())

In [35]:
events_1= events_df.select("GLOBALEVENTID", 'Day_DATE','Actor1Religion1Code', 'Actor2Religion1Code',
                               'NumMentions',"AvgTone", 'GoldsteinScale', 'ActionGeo_CountryCode')

# filter events with no country code
events_2 = events_1.filter(events_1.ActionGeo_CountryCode.isNotNull())

# create religion column (take one of the two actors, because in the example test there was never data for both at the same time)
events_3 = events_2.withColumn('ActorReligion', coalesce(events_2['Actor1Religion1Code'], events_2['Actor2Religion1Code']))

# filter out columns with no religion
events_4 = events_3.filter(events_3.ActorReligion.isNotNull())

In [38]:
#### join the files togther

# join the HDI file to country code
events_5 = events_4.join(HDI_df.select('Country', 'HDI', 'FIPS_GDELT'), HDI_df['FIPS_GDELT']==events_4['ActionGeo_CountryCode'])

# join event and mention file
mention_event = events_5.join(mentions_3, 'GLOBALEVENTID')

# join mention_event and gkg
gkg_mention_event = mention_event.join(gkg_3, mention_event['MentionIdentifier'] == gkg_3['DocumentIdentifier'])

In [37]:
print('gkg, original size: ', gkg_df.count())
print('gkg, filtered: ', gkg_3.count())
print('mention, original size: ', mentions_df.count())
print('mention, filtered: ', mentions_3.count())
print('event, original size: ', events_df.count())
print('event, filtered: ', events_4.count())
print('files joined together: ', mention_event.count())
print('all: ', gkg_mention_event.count())

gkg, original size:  6098
gkg, filtered:  772
mention, original size:  13508
mention, filtered:  13421
event, original size:  5251
event, filtered:  199
files joined together:  200
all:  49


We see that the filtering is reduces the size of the datafiles quite drastically. This is because in the gkg file, most of the articles do not have a specified event type, in the event file, very few events have the religion of the actors defined, and in the mention file, not every article can be linked to a country source reporting the article.

In the following we will select the emotions of interest from the GCAM data column available in the gkg file. 

In [39]:
HRE = ['c2.152', 'c2.214', 'c3.2', 'c5.7', 'c6.5', 'c7.2', 'c10.1',
       'c14.9', 'c15.3', 'c15.4', 'c15.12', 'c15.26', 'c15.27', 'c15.30',
       'c15.36', 'c15.42', 'c15.53', 'c15.57', 'c15.61', 'c15.92',
       'c15.93', 'c15.94', 'c15.97', 'c15.101', 'c15.102', 'c15.103',
       'c15.105', 'c15.106', 'c15.107', 'c15.108', 'c15.109', 'c15.110',
       'c15.116', 'c15.120', 'c15.123', 'c15.126', 'c15.131', 'c15.136',
       'c15.137', 'c15.152', 'c15.171', 'c15.173', 'c15.179', 'c15.203',
       'c15.219', 'c15.221', 'c15.239', 'c15.260', 'c21.1', 'c35.31',
       'c24.1', 'c24.2', 'c24.3', 'c24.4', 'c24.5', 'c24.6', 'c24.7',
       'c24.8', 'c24.9', 'c24.10', 'c24.11', 'c36.31', 'c37.31', 'c41.1']


Emot_Words_df = gkg_mention_event.select(gkg_mention_event['ActionGeo_CountryCode'].alias('CountryEvent'), 'EventType',
                                   'ActorReligion', 'HDI', 'AvgTone',
                                  'country_source', 'GCAM',split(col("GCAM"), ":").alias("GCAM2"))

Emot_Words_df = Emot_Words_df.withColumn('GCAM2', concat_ws(',', 'GCAM2'))

Emot_Words_df = Emot_Words_df.select('CountryEvent', 'EventType','ActorReligion', 'country_source',
                                     'HDI', 'AvgTone', split(col("GCAM2"), ",").alias("GCAM"))

Emot_Words_df = Emot_Words_df.withColumn("HRE", array([lit(x) for x in HRE]) )

differencer = udf(lambda x,y: list(set(y)-(set(y)-set(x))), ArrayType(StringType()))
Emot_Words_df = Emot_Words_df.withColumn('DIF', differencer('HRE', 'GCAM'))

Emot_Words_df = Emot_Words_df.select('CountryEvent', 'EventType','ActorReligion', 'country_source',
                                     'HDI', 'AvgTone', 'DIF')

data_q3 = Emot_Words_df.dropDuplicates()

In [41]:
data_q3.show(5)

+------------+---------+-------------+--------------------+------+----------+--------------------+
|CountryEvent|EventType|ActorReligion|      country_source|   HDI|   AvgTone|                 DIF|
+------------+---------+-------------+--------------------+------+----------+--------------------+
|          BK|     KILL|          MOS|United States of ...|  High|-10.911809|[c15.110, c15.103...|
|          MR|  PROTEST|          MOS|              Turkey|   Low|-8.5561495|[c2.214, c41.1, c...|
|          ID|   AFFECT|          MOS|            Malaysia|Medium|-2.7181687|[c2.214, c14.9, c...|
|          PK|  PROTEST|          MOS|         Philippines|Medium|-3.1468532|[c15.110, c14.9, ...|
|          IZ|  PROTEST|          MOS|        Saudi Arabia|Medium| 1.9607843|[c15.171, c2.214,...|
+------------+---------+-------------+--------------------+------+----------+--------------------+
only showing top 5 rows



In [None]:
data_q3_strat = data_q3.sampleBy('EventType', fractions={'AFFECT': 0.1, 'ARMEDCONFLICT': 1.0, 'ARREST': 0.1, 'ASSASSINATION': 1,
                                                                'EVACUATION': 1, 'HUMAN_RIGHTS_ABUSES_FORCED_MIGRATION': 1, 'KIDNAP': 0.1,
                                                                'KILL': 0.01, 'MOVEMENT_GENERAL': 1, 'POVERTY': 1, 'PROTEST': 0.1,
                                                                'REFUGEES': 1, 'RELEASE_HOSTAGE': 1, 'STRIKE': 1, 'TERROR': 1, 'WOUND': 0.1,
                                                                'SEIZE': 0.1})

This dataset serves to build a model in the following. Due to the drastic filtering processing, the algorithm applied on the whole GDELT dataset returned a dataset which we only needed to downsample a little bit, which we did by stratifying the EventType column.

### Building the model <a id='Part2_Q3b'></a>

This part we do on the downsampled data we gathered from the cluster. We trained the model with the scikit decision tree algorithm and once with the random forest algorithm. The test accuracy was very similar, however the decision tree overfit the training data too much and thus we decided to use the random forest model.

In [2]:
model_data = pd.read_json('../data/data_q3_tone.json', lines = True) #the data is not appended in Github due to the large file size
model_data.head(2)

Unnamed: 0,ActorReligion,AvgTone,CountryEvent,DIF,Day_DATE,Emotion_Charge,EventType,HDI,NumPeople,Polarity,Tone,country_source
0,MOS,-8.812949,PK,"[c15.260, c15.93, c2.214, c15.107, c6.5, c7.2,...",20161116,17.173524,WOUND,Medium,3,8.586762,-8.586762,Pakistan
1,MOS,-6.27414,SY,"[c15.123, c2.214, c15.97, c6.5, c15.110, c5.7,...",20170411,13.211845,ARREST,Low,8,7.744875,-5.46697,Canada


In [5]:
# open helper datasets
country_info = pd.read_csv('../data/CountryCodes/countries-info.csv')
country_fips_alpha = pd.read_csv('../data/CountryCodes/CountryCode_FIPS_alpha.csv')

In [6]:
# merge the alpha-3 country codes to the source countries
model_data_1 = model_data.merge(country_info[['name', 'alpha-3']], left_on = 'country_source', right_on = 'name')
model_data_1 = model_data_1.rename(columns = {'country_source': 'CountrySource', 'alpha-3': 'SourceCode'})
model_data_1 = model_data_1.drop(columns = ['name'])

In [7]:
# merge the official alpha country codes to the GDELT country code
model_data_2 = model_data_1.merge(country_fips_alpha, left_on = 'CountryEvent', right_on = 'FIPS_GDELT')
model_data_2 = model_data_2.rename(columns = {'CountryEvent':'CountryEventCode', 'Country name': 'CountryEvent'})
model_data_2 = model_data_2.drop(columns = ['FIPS_GDELT'])
#model_data_2.head()

In [8]:
# merge the alpha-3 country codes to the event countries
model_data_3 = model_data_2.merge(country_info[['alpha-2', 'alpha-3']], left_on = 'alpha-2', right_on = 'alpha-2')
model_data_3 = model_data_3.drop(columns = ['CountryEventCode', 'alpha-2'])
model_data_3 = model_data_3.rename(columns = {'country_source': 'CountrySource', 'alpha-3': 'CountryEventCode'})
#model_data_3.head()

In [9]:
# transform the features into categorical data to be used by the learning algorithm
def discretize(df):
    model_data = df.copy()
    # discretize the tone
    bins = pd.IntervalIndex.from_tuples([(-101, -10), (-10, -5), (-5, 0), (0,5),
                                         (5,101)])
    model_data.Tone = pd.cut(model_data.Tone, bins)

    # transform into categorical data & create corresponding dictionary
    le_CountryEvent = LabelEncoder()
    model_data['CountryEventCode'] = le_CountryEvent.fit_transform(model_data['CountryEventCode'])
    le_CountryEvent_mapping = dict(zip(le_CountryEvent.classes_, le_CountryEvent.transform(le_CountryEvent.classes_)))

    le_CountrySource = LabelEncoder()
    model_data['SourceCode'] = le_CountrySource.fit_transform(model_data['SourceCode'])
    le_CountrySource_mapping = dict(zip(le_CountrySource.classes_, le_CountrySource.transform(le_CountrySource.classes_)))


    # transform back with: list(le_CountryEvent.inverse_transform([2, 2, 1]))
    le_EventType = LabelEncoder()
    model_data['EventType'] = le_EventType.fit_transform(model_data['EventType'])
    le_EventType_mapping = dict(zip(le_EventType.classes_, le_EventType.transform(le_EventType.classes_)))

    le_ActorReligion = LabelEncoder()
    model_data['ActorReligion'] = le_ActorReligion.fit_transform(model_data['ActorReligion'])
    le_ActorReligion_mapping = dict(zip(le_ActorReligion.classes_, le_ActorReligion.transform(le_ActorReligion.classes_)))

    le_HDI = LabelEncoder()
    model_data['HDI'] = le_HDI.fit_transform(model_data['HDI'])
    le_HDI_mapping = dict(zip(le_HDI.classes_, le_HDI.transform(le_HDI.classes_)))

    le_Tone = LabelEncoder()
    model_data['Tone'] = le_Tone.fit_transform(model_data['Tone'])
    le_Tone_mapping = dict(zip(le_Tone.classes_, le_Tone.transform(le_Tone.classes_)))
    return model_data

In [10]:
model_data = discretize(model_data_3)
model_data.head(2)

Unnamed: 0,ActorReligion,AvgTone,DIF,Day_DATE,Emotion_Charge,EventType,HDI,NumPeople,Polarity,Tone,CountrySource,SourceCode,CountryEvent,CountryEventCode
0,9,-8.812949,"[c15.260, c15.93, c2.214, c15.107, c6.5, c7.2,...",20161116,17.173524,16,2,3,8.586762,1,Pakistan,101,Pakistan,123
1,9,-5.73543,"[c15.92, c2.214, c6.5, c15.42, c7.2, c15.57, c...",20170323,13.381555,6,2,200,7.866184,1,Pakistan,101,Pakistan,123


In [11]:
# define X and Y
X = model_data.values[:, [0,5,6,7,11,13]]
Y = model_data.values[:,9]
Y = Y.astype(int)

In [12]:
def model(X, Y, importance = True):
    # split the data
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)
    # train the model (we use the random forest, because the decision tree overfit the data)
    #clf = tree.DecisionTreeClassifier()
    clf = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0)
    clf.fit(X_train, y_train)
    y_pred_test = clf.predict(X_test)
    y_pred_train = clf.predict(X_train)
    print("Test Accuracy is ", accuracy_score(y_test,y_pred_test)*100)
    print("Train Accuracy is ", accuracy_score(y_train ,y_pred_train)*100)
    if importance:
        importances = clf.feature_importances_
        print(importances)
    return

In [13]:
# train the model and see the accuracy
model(X, Y)

Test Accuracy is  56.0315670800451
Train Accuracy is  55.866341683399014
[0.22240931 0.3796353  0.03289924 0.1575336  0.00932777 0.19819478]


In [15]:
# train the model with the words
s = model_data_2['DIF']
dummy_words = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
dummy_words = dummy_words.astype(int)

In [16]:
model_data_4 = pd.concat([model_data_3, dummy_words], axis=1)
model_data_4 = model_data_4.drop(columns = ['DIF', 'Day_DATE'])
model_data_4.head(2)

Unnamed: 0,ActorReligion,AvgTone,Emotion_Charge,EventType,HDI,NumPeople,Polarity,Tone,CountrySource,SourceCode,...,c15.94,c15.97,c2.152,c2.214,c3.2,c35.31,c41.1,c5.7,c6.5,c7.2
0,MOS,-8.812949,17.173524,WOUND,Medium,3,8.586762,-8.586762,Pakistan,PAK,...,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
1,MOS,-5.73543,13.381555,KIDNAP,Medium,200,7.866184,-5.515371,Pakistan,PAK,...,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0


In [17]:
# do the same procedure
model_data = discretize(model_data_4)
X = np.array(np.delete(model_data.values, [1,2,6,7,8,10], axis=1), dtype=float)
Y = Y = model_data.values[:,7]
Y = Y.astype(int)
X[np.isnan(X)] = 0
model(X, Y, importance = False)

Test Accuracy is  60.345734686208196
Train Accuracy is  60.47158364918987


## Q4. **Do we have a saturation limit?**  <a id='Part2_Q4'></a>
The whole cluster code is found in the [Q4_cluster.py](FetchingData/Q4_cluster.py) file.
> In the cluster py file you will find the code that was used to run in the cluster to build the Emotion Charge for each country for that month. In this part we will focus on the dataframe we get form the parquet files obtained from the cluster code.

Some references to question 5 might be made, as these questions are related

In [None]:
All_df = concat_parquets_info("../data/Gdelt/Q4_parquets_csv_wReg")

In [14]:
All_df.head()

Unnamed: 0,CountrySource,region,EmotionCharge,CountCountrySource,NormEvnt,year
0,Korea (Republic of),Asia,580.323195,8433,0.068816,2015
1,Philippines,Asia,6840.854788,57366,0.119249,2015
2,Malaysia,Asia,4078.438874,47824,0.08528,2015
3,Fiji,Oceania,269.846193,5039,0.053552,2015
4,Turkey,Asia,4847.62357,57565,0.084211,2015


As always we see that the US dominates the number of events, and therefore obtains the biggest Emotion Charge, however that does not mean that it is the most emotional country (question 5 question). That is why that if you look at the NormEvnt it is not the biggest we have seen. The Philippines are more emotional (in the year 2015) than the US, the NormEvnt allows us to compare the countries removing the bias concerning number of events that the GDELT catches.

In [17]:
All_df.loc[All_df['EmotionCharge'] == All_df["EmotionCharge"].max()]

Unnamed: 0,CountrySource,region,EmotionCharge,CountCountrySource,NormEvnt,year
76,United States of America,Americas,641624.399395,14271395,0.044959,2017


In [22]:
# make list of continents
continents = []
for continent in All_df['region']:
    if continent not in continents: 
        continents.append(continent)
        
for reg in continents:
    print('We account for', len(All_df.loc[All_df['region'] == reg]),
          'countries in :', reg)

We account for 105 countries in : Asia
We account for 15 countries in : Oceania
We account for 51 countries in : Africa
We account for 48 countries in : Europe
We account for 18 countries in : Americas


In [23]:
EventCount = All_df.groupby(['region'])['CountCountrySource'].sum()

In [24]:
EventCount

region
Africa       1898486
Americas    31454085
Asia         7319279
Europe       3382402
Oceania      2953846
Name: CountCountrySource, dtype: int64

Something good to take into account is, that even though Americas is the second region with the least number of countries it is the one that by far (a really huge by far) has the most events. This does not mean that in America happen the most events but it is where GDELT gets most of them.

In the plot that you will see in the next section this is taken into account.

## Q5. Who is more emotional and what do they say? <a id='Part2_Q5'></a>
The whole cluster code is found in the [Q5_cluster.py](FetchingData/Q5_cluster.py) file.
> For this questions we focused specifically on the words that were used for some countries in the months that we analize

In [27]:
# to start playing with the GCAM words we must have the info of the parquet files in a dataframe
All_df_Q5 = concat_parquets_info("../data/Gdelt/Q5_parquets_csv", drop_flag=False)

In [28]:
# we must do some string operations in order to have the desired list
All_df_Q5.SpeechWordsList = All_df_Q5.SpeechWordsList.str.replace("'", "")
All_df_Q5.SpeechWordsList = All_df_Q5.SpeechWordsList.str.replace("[", "")
All_df_Q5.SpeechWordsList = All_df_Q5.SpeechWordsList.str.replace("]", "")
All_df_Q5.SpeechWordsList = All_df_Q5.SpeechWordsList.str.split(",")

In [29]:
# this part is only for visualization purposes (adding another column with size referent to years)
All_df_Q5["SizeViz"] = np.where(All_df_Q5['year']== '2015', 10, 20)
All_df_Q5.loc[All_df_Q5['year'] == '2016', 'SizeViz'] = 15

In [32]:
# now we want to have a list of words for each country and year
Concat_lists = All_df_Q5.groupby(['country_source', 'year','SizeViz'])['SpeechWordsList'].sum()
Q5_df = Concat_lists.to_frame()
Q5_df = Q5_df.reset_index()

In [37]:
# get dictionary with the code words and how many times were used
All_dicts = build_dict(Q5_df)

Now starts the part of associating the words with the codes, for this we will used our own built list of human readable words. You can find how we built this dictionary in [HumanReadableEmotions_creation.py](../data/Gdelt/HumanReadableEmotions_creation.py)

In [38]:
HRE = pd.read_csv("../data/Gdelt/Human_Readable_Emotions_v2",sep='\t')
HRE = HRE.drop(['Unnamed: 0'], axis=1)

In [44]:
All_dicts_Words = dict_with_words(All_dicts,Q5_df,HRE)

In [51]:
# example of random country and words associated
All_dicts_Words[5]

(8348,
 {'futurity': 25,
  'priority': 14,
  'posteriority': 20,
  'preterition': 21,
  'newness': 18,
  'work': 27,
  'time': 22,
  'death': 8,
  'religion': 1,
  'leisure': 28,
  'achievement': 28,
  'money': 28,
  'home': 17,
  'negative': 28,
  'positive': 28,
  'air': 15,
  'sanguine': 8,
  'fire': 7,
  'mythology': 17,
  'music': 28,
  'mechanics': 27,
  'mathematics': 28,
  'meteorology': 26,
  'medicine': 28,
  'military': 3,
  'metrology': 28,
  'mountaineering': 15,
  'acad': 28,
  'arousal': 26,
  'begin': 22,
  'be': 28,
  'earliness': 23,
  'photography': 25,
  'philosophy': 28,
  'physiology': 28,
  'physics': 28,
  'pharmacy': 20,
  'person': 28,
  'philology': 24,
  'plants': 28,
  'noun': 28,
  'name': 24,
  'need': 25,
  'negate': 21,
  'no': 8,
  'forgiveness': 5,
  'sure': 28,
  'fearlessness': 9,
  'favor': 16,
  'indef': 28,
  'if': 27,
  'impers': 28,
  'incr': 23,
  'peace': 4,
  'optimism': 10,
  'captivation': 11,
  'joylessness': 4,
  'statistics': 14,
  'jol

In the list you see the words used for every country during the years 2015, 2016 and 2017. This list was made in ordered to have a subset of the GCAM feature but with human readable words. As you saw from our model, adding 64 chosen words (mostly positive) we obtained a gain in terms of accuracy, something that sustains our principle that these **words are representative** of the speeches used by countries.
There are words that are used often by all, however we clustered the groups takin this fact into account

We perform a principal component analysis to see if some countries use similar words and thus cluster together. To do so, we build a matrix with the words as columns and the rows as the count of these words for each country.

In [20]:
# open the dictionary with the word counts for each country
with open ('../data/Emotions/word_dictionary', 'rb') as fp:
    itemlist = pickle.load(fp)
# open the file with all the 800 words
HRE = pd.read_csv("../data/Emotions/Human_Readable_Emotions_v2",sep='\t')
HRE = HRE.drop(['Unnamed: 0'], axis=1)

In [21]:
# initialize matrix with the words as columns and the counts for each country as rows
feature_list = HRE.Variable.values
zero_data = np.zeros(shape=(0, HRE.shape[0]))
df = pd.DataFrame(zero_data, columns=feature_list)
n_countries = len(itemlist)
countries_list = []
for i in range(n_countries):
    countries_list.append(itemlist[i][0])
    df = df.append(itemlist[i][1], ignore_index=True)
df = df.fillna(value= 0)
df.head(2)

Unnamed: 0,c1.1,c2.7,c2.8,c2.9,c2.10,c2.12,c2.13,c2.16,c2.17,c2.18,...,c36.31,c36.32,c36.33,c37.31,c37.32,c37.33,c41.1,c41.2,c41.3,Unnamed: 21
0,4.0,10.0,8.0,80.0,19.0,93.0,2.0,10.0,56.0,92.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,8.0,6.0,6.0,26.0,14.0,28.0,0.0,9.0,26.0,28.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# prepare the data for the principal component analysis
x = df.loc[:, feature_list].values
x = StandardScaler().fit_transform(x) # standardize

In [23]:
# calculate the first 4 principal components
pca = PCA(n_components=6)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal1', 'principal2', 'principal3', 'principal4', 'principal5', 'principal6'])
# add the countries to the table
principalDf['countries'] = countries_list
principalDf.head(5)

Unnamed: 0,principal1,principal2,principal3,principal4,principal5,principal6,countries
0,-3.054989,-0.12016,0.687818,1.187802,0.428953,0.05104,Afghanistan
1,-3.434259,-0.196003,-0.106151,-0.063644,-0.0836,-0.067572,Albania
2,-3.460121,-0.208953,-0.128601,-0.046756,-0.091875,-0.105244,Angola
3,-3.520203,-0.250687,-0.098517,-0.031553,-0.12563,-0.104326,Anguilla
4,-3.509191,-0.239283,-0.108023,-0.035506,-0.112824,-0.109143,Argentina


As specified in milestone2 :
> We see that different variables refer to different feelings.
> After some research (as you'll see) one of the most common emotions is "H4Lvd" which clearly is not an emotion or a feeling of the speech, but it corresponds to 2 dictionaries. Meaning that these words belong to the Harvard and Lasswell dictionaries. In order to understand what this means, we went to see the spreadsheet of the words in these dictionaries, additional information found in [H4Lvd](http://www.wjh.harvard.edu/~inquirer/spreadsheet_guide.htm).
When we don't have the specific emotion, it is useful to know which is the most common feeling associated with the most common dictionaries (therefore the check of the spreadsheets). For some others, we already know the feeling the dictionary refers to, such as "Positivity" via Lexicoder, “Smugness” via WordNet Affect, “Passivity” via Regressive Imagery Dictionary, etc. With this information we can associate these sentiments with the news and speeches for each event

This was overcome by building our own dictionary (as you saw just now) and refering to those words to make our analysis.

# Part 3 <a id='Part3'></a>

We mainly do our plots with the plotly library. The choropleth world maps are drawn with GeoJSON.
The plots are linked by the html source code and the code is found in the [Visualization script](Visualization.py).

In [17]:
from Visualization import *
import pandas as pd

# update when changing functions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Q0. **Where does the news come from?** <a id='Part3_Q0'></a>

On the map below we see the number of GDELT online news sources agregated over country, logarithmic scale is used. Whole code [here](domains.py).

In [1]:
%%html
<iframe src="https://matterhorn-ada.github.io/urls-log.html" width="100%" height="400px" frameBorder="0" scrolling="no"></iframe>

On the map below we see the number of events happening over the world agregated over country, logarithmic scale is used. Whole code [here](domains.py).

In [2]:
%%html
<iframe src="https://matterhorn-ada.github.io/events-log.html" width="100%" height="400px" frameBorder="0" scrolling="no"></iframe>

On the map below we see the number of reports about events happening over the world agregated over country, logarithmic scale is used. Whole code [here](domains.py).

In [3]:
%%html
<iframe src="https://matterhorn-ada.github.io/reports-log.html" width="100%" height="400px" frameBorder="0" scrolling="no"></iframe>

Q1. **Are we emotionally biased?** <a id='Part3_Q1'></a>

In [4]:
%%html
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~matterhorn_ada/24.embed"></iframe>

In [5]:
%%html
<iframe width="900" height="500" frameborder="0" scrolling="no" src="//plot.ly/~matterhorn_ada/6.embed"></iframe>

Q2. **Are some countries ignored in the news?** <a id='Part3_Q2'></a>

We plot a bubble plot, where the size of the bubble relates to the population count of the country.

In [6]:
%%html
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~matterhorn_ada/5.embed"></iframe>

In [8]:
# we calculate the correlation coefficient 
data_q2 = pd.read_csv("../data/data_q2_final.csv") #due to its size this data is not in the Github

In [37]:
data_q2.head()

Unnamed: 0.1,Unnamed: 0,sum_articles,count_events,country,pop,continent
0,0,11894575.0,2710326.0,Iran (Islamic Republic of),81162788,Asia
1,1,1177686.0,240928.0,Morocco,35739580,Africa
2,2,681964.0,127535.0,Belize,374681,Americas
3,3,451584.0,94845.0,El Salvador,6377853,Americas
4,4,611711.0,123985.0,Ecuador,16624858,Americas


In [41]:
data_q2.sum_articles.corr(data_q2.count_events)

0.9995895757120425

The pearson correlation coefficient is almost 1, which is expected since the relation between the variables is positively increasing.

Q4. **Do we have a saturation limit?** <a id='Part3_Q4'></a>

In [9]:
%%html
<iframe width="900" height="800" frameborder="0" scrolling="no" src="//plot.ly/~StudentUni/47.embed"></iframe>

In this image you can see the number of events by country in one month and how they relate to the emotion charge. We can see that in fact there is a saturation as the emotion value for the countries with a lot of events do not go over 0.1, while the emotion charge for the countries in that month that did not have that much events is more variable!
After a certain point we do get tired.

Q5. **Who is more emotional and what do they say?** <a id='Part3_Q5'></a>

In [75]:
fig = visualization_Q5(Q5_df,All_dicts_Words,['United States of America','Kenya','Philippines','Mexico','Greece', 'Jamaica'])
py.iplot(fig, filename='line-mode', auto_open=False)

Here we see the some of the most commom words used by some countries. The somewhat stable percentages that you see, is due to the fact that we get the events that this words appeared, and not the number of words in that specific event. This means that some words appear as clusters but we can see some difference in speach by some of the countries, which is really awesome.

In [43]:
%%html
<iframe width="900" height="500" frameborder="0" scrolling="no" src="//plot.ly/~matterhorn_ada/22.embed"></iframe>

The principal component analysis mainly shows that most of the countries form a big cluster, which is explained by the fact that news article are not tremendously different from each other in the word choice. However, there are also some small clusters that form, which shows that some countries use similar, but a little bit different, word choices, but on the other hand there are also some outliers which means, that some countries use very different words than others.