# Milestone 3 - Report
------
In this notebook we are going to review everything that was done in the project following Milestone 2. The focus at this point of the project was placed on constructing the awareness model and the final datastory 

In [2]:
import os
import json
import time
import folium
import branca
import jenkspy
import pickle
import unicodedata
import collections
import numpy as np
import pandas as pd

## 0. Abstract

*What's the motivation behind your project? A 150 word description of the project idea, goals, dataset used. What story you would like to tell and why?*

Major events happen on a regular basis all around the world, some involving high number of casualties but the resulting reaction on the international scale is often far from proportional. Most of the time the largest reaction comes from the place where the incident occurred or places which are closeby. The objective would be to create an awareness map, and determine why people react to an event. From that we would attempt to define an awareness metric. We want to see how factors other than physical proximity come into play such as country, culture, language, religion. With this we could determine which country has the highest level of international awareness. The project would require the Twitter API to acquire hashtag specific tweets with geolocation and therefore measure the awareness and reactions of different communities to a given event. GDELT would be used to recover standardised information regarding different events.
____
____
____


# 1. Recapitulation of Milestone 2

To recap what was done in the previous milestone, here we present a brief summary of the previous notebook which can be found here **(TODO : mettre le lien)**. 

____

## 1.1 Tweet Acquisition
A lot of time was spent acquiring the Tweets for two main reasons : 
- The Twitter dump did not contain the geolocations of the tweets or the users. 
- The Twitter API does not permit the recovery of tweets which are over 1 week old when searching for tweets with a given hashtag around a specific date. 

That is why a significant amount of time was put into recovering the tweets through webscraping using phantom js with random scrolling time and random breaks to avoid being blocked by Twitter.

The tweets were acquired based on their hashtags, all the while making sure to have a wide range of hashtags in different languages. 
____

## 1.2. Tweet Geolocalisation

Once the tweets were recovered we only kept those for which we had either the geolocalisation of the tweets itself or of the user. As the locations are manually input, they are not standardized and the corresponding location needed to be determined. 

The first step was therefore to format the strings by removing all special characters as setting everything to lower case for example.  Afterwards, we created multiple mappings which were used in the order presented as they represent the confidence in the given method : 
- A country mapper which uses multiple variations of the country name (different spellings and languages) to determine the country it corresponds to with the given ISO2 code
- A capital mapper which maps the capital of a country to the ISO2 code
- A first city mapper where duplicates were removed as there are multiple cities in the world with the same name (example : Torronto in Canada, the US and Australia). 
- A final city mapper with all the cities of all the countries in the world using the database provided here **(TODO : mettre le lien)**

## 1.3. Visualization of the Reaction For An Event

We then observed for one of the selected events the number of related tweets on a Chloropleth map. This helped us note certain issues in the mappings (high number of tweets in Antarctica for example). 

____


## 1.4. Enriching the Data 

To create our awareness model we wanted to recover information which could be used to determine a sort of distance between the different countries, that is why we recovered for all countries: 
- area
- ISO2
- ISO3
- languages
- border countries
- latitude and longitude
- language_codes
- Internet users
- population
- GDP
- GDP per capita
- religions
- government types
- population in poverty
- unemployment rate

As you can see, we have more features than we had originally presented in Milestone 2. That's because we realized that one of the datasets we used had a lot more (and more complete) information that was of our interested. After all, the 4 datasets that we were using originally ended up being just two, which are these:

- Dataset 1: https://github.com/mledoze/countries
- Dataset 2: https://github.com/opendatajson/factbook.json

To see all the process of the extraction of the data see "Data enriching.ipynb".
____
____
____


# 2. Creating the Awareness Model

To create our model we decided to construct a graph linking all countries where the weights of the segments would represent the proximity between countries in terms of awareness to major events. 

For that we used the dataset containing all the information regarding the countries and looked to give numeric values to all the different categories mentioned previously and a few new ones: 

1. Languages
2. Governments 
3. Distances between countries :
    - Birds eye view distances 
    - Hop matrix : the number of countries which separate two given countries (**TODO : noter quil y a le souci des distances a traverser au sein des pays qui n'est pas pris en compte, exemple de la Finlande a la Chine qui passe uniquement par la russie**). 
    - Normalized Adj matrix : a matrix which gives the neighboring countries with a normalizing factor to account for multiple countries on the border
    - Flight routes between countries
4. Religion
5. Population, Area, GDP


What needed to be taken into account here is that we have many variables which are categoric. Therefore we needed to find a way to attribute numeric values in order to construct the final graph.

Remark : most of the information was extracted from the CIA factbook as it is a complete and realiable source. 

**TODO : voir si le fait que le pays soit très grand avec une grande population qui peut diminuer l'intérêt par rapport à ce qui se passe ailleurs dans le monde**

____

## 2.1. Languages

Languages are representative of culture and most of the time, countries which speak the same language have a given proximity. That is why this metric was used to create the graph. 

Two main possibilities presented themselves :
- binary : if two countries have the same official language then they are linked
- more informed : some languages are similar in the sense of understanding the other language, learning it, etc. That is why we wanted to take into account Linguistic distances. Unfortunately this is a current research topic and there is no database which gives the Linguistic distances between all the official languages of all countries. Certain databases exist linking English to other languages but this is not sufficient for our application. That is why we decided to use another metric, albeit more abstact, which is to say the phylogenetic trees linking languages together. 


Using the phylogenetic trees taken here (TODO insert nice link) http://glottolog.org/glottolog/family, we were able to construct a distance metric and compute the distances between all languages as can be seen below. The notebook handling this can be found here (TODO : mettre le lien du notebook)

In [None]:
# TODO METTRE LA MATRICE DISTANCE ENTRE LES LANGUES

From this language distance matrix we were able to compute a standardized distance between all countries. 

The idea is the following :
- if countries have an official language in common, set the distance to 0
- else get the distances between the official languages and compute the average of the non infinite distances. 
- else set the distance to inf


In [None]:
# TODO METTRE LA MATRICE DISTANCE ENTRE LES PAYS SELON LA LANGUE

All of these values are then standardized. 

In [None]:
# TODO : METTRE LA MATRICE STANDARDIZE

____

## 2.2. Governments

The type of government is also a good representative of the people's perception to worldwide events. People of different countries with the same government type will be influenced in a similar way. 

After getting the list of all the government types (see "Data enriching.ipynb" for more details on how we got them) we have to convert them to a numeric scale with a meaning.

First, we start by dividing all the government types in three groups and assigning them numerical values -1, 0 and 1. These are the resulting groups:
- Group 1:
    - 'parliamentary democracy'
    - 'parliamentary republic'
    - 'federal republic'
    - 'federation of monarchies'
    - 'semi-presidential republic'
    - 'semi-presidential federation'
- Group 2: 
    - 'non-self-governing overseas territory'
    - 'in transition'
    - 'unknown'
- Group 3: 
    - 'presidential republic'
    - 'presidential democracy'
    - 'monarchy'
    - 'theocratic republic'
    - 'communist state'
    - 'absolute monarchy'

As our criteria to group the government types, we used the variable of the power that the leaders of the government have. Coutries where the leader has a lot of power go in group 3 and will be assigned to the value of 1, the others in group 1 and they will be assigned the value -1. For the non well defined government types, we placed them at the center of the scale with the value of 0.

Here is the actual implementation (full code on "Data enriching.ipynb"):

In [None]:
data = pd.read_pickle('./DataEnriching/data.pickle')
gov_type_array = [['parliamentary democracy', 'parliamentary republic', 'federal republic', 'federation of monarchies',
                   'semi-presidential republic', 'semi-presidential federation'],
                  ['non-self-governing overseas territory', 'in transition', 'unknown'],
                  ['presidential republic', 'presidential democracy', 'monarchy', 'theocratic republic', 'communist state',
                   'absolute monarchy']]
mapping_gov_type=dict()
for i, gov_group in enumerate(gov_type_array):
    for gov in gov_group:
        if i==0:
            mapping_gov_type.update({gov: -1})
        elif i==1:
            mapping_gov_type.update({gov: 0})
        else:
            mapping_gov_type.update({gov: 1})

# Mapping gov_type values to their numerical value (1, 0, -1)
data['gov_type_num'] = data.gov_type.map(mapping_gov_type)
data[['gov_type', 'gov_type_num']].head()

____

## 2.3. Distances 

____

## 2.4. Religions 

Religion, like government, affects the way of thinking of the people. Religion bonds between countries will definitely have an impact on the awareness of events that happen in countries with the same religions.

We could extract the religions from Dataset 2. Here is the dataframe with all the selected data:

In [18]:
rel_df = pd.read_pickle("./DataEnriching/data.pickle")
cols = rel_df.columns.tolist()[15:]
rel_df = rel_df[cols]
rel_df.head()

Unnamed: 0_level_0,religion,religion,religion,religion,religion,religion,religion,religion,religion,religion,religion,religion
Unnamed: 0_level_1,protestants,catholics,ortodox,buddhism,hindu,jewish,muslim,oriental,other,animist,atheist,unaffiliated
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Aruba,0.049,0.753,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Afghanistan,0.0,0.0,0.0,0.0,0.0,0.0,0.997,0.0,0.0,0.0,0.0,0.0
Angola,0.381,0.411,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Anguilla,0.732,0.068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Åland Islands,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


____

## 2.5. Population, Area, GDP, Internet users, population in poverty, unemployment rate

This data is also straight forward to extract from Dataset 2. From the data.pickle file we can extract it directly. The result is the following:

In [21]:
other_features = pd.read_pickle('.\DataEnriching\data.pickle')
cols = [('POP',''),('area',''),('gdp',''),('gdp_capita',''),('Internet users',''),('pop_pov',''),('unemployment','')]
other_features = other_features[cols]
other_features.head(5)

Unnamed: 0_level_0,POP,area,gdp,gdp_capita,Internet users,pop_pov,unemployment
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aruba,113648.0,180.0,2516000000.0,25300.0,99.0,,0.069
Afghanistan,33332025.0,652230.0,18400000000.0,2000.0,2690000.0,0.358,0.35
Angola,20172332.0,1246700.0,91940000000.0,6800.0,2434000.0,0.405,
Anguilla,16752.0,91.0,175400000.0,12200.0,12.0,0.23,0.08
Åland Islands,,1580.0,,,,,


____

## 2.6. Constructing the Graph

____
____
____

# 3. Improving the Mappings

One main issue with the mapping up until now is that there were redundancies which were not correctly handled. At the time of Milestone 2 we would just go through the complete mapping sequentially and stop at the first match. As there were duplicates there was no guarantee that the city found was the most coherent. That is why we corrected the mapping to take into account the population of the different cities which was also provided in the GeoNames database http://download.geonames.org/export/dump/. Therefore when there are multiple cities with the same name in different countries, we only keep the city with the largest population as it would be the most probable. 

____
____
____

# 4. Predictions Using the Awareness Model and Graph Diffusion