# Milestone 3 - Report
------
In this notebook we are going to review everything that was done in the project following Milestone 2. The focus at this point of the project was placed on constructing the awareness model and the final datastory 

In [42]:
import os
import json
import time
import folium
import branca
import jenkspy
import pickle
import unicodedata
import collections
import numpy as np
import pandas as pd
from scipy import spatial
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 0. Abstract

*What's the motivation behind your project? A 150 word description of the project idea, goals, dataset used. What story you would like to tell and why?*

Major events happen on a regular basis all around the world, some involving high number of casualties but the resulting reaction on the international scale is often far from proportional. Most of the time the largest reaction comes from the place where the incident occurred or places which are closeby. The objective would be to create an awareness map, and determine why people react to an event. From that we would attempt to define an awareness metric. We want to see how factors other than physical proximity come into play such as country, culture, language, religion. With this we could determine which country has the highest level of international awareness. The project would require the Twitter API to acquire hashtag specific tweets with geolocation and therefore measure the awareness and reactions of different communities to a given event. GDELT would be used to recover standardised information regarding different events.
____
____
____


# 1. Recapitulation of Milestone 2

To recap what was done in the previous milestone, here we present a brief summary of the previous notebook which can be found here **(TODO : mettre le lien)**. 

____

## 1.1 Tweet Acquisition
A lot of time was spent acquiring the Tweets for two main reasons : 
- The Twitter dump did not contain the geolocations of the tweets or the users. 
- The Twitter API does not permit the recovery of tweets which are over 1 week old when searching for tweets with a given hashtag around a specific date. 

That is why a significant amount of time was put into recovering the tweets through webscraping using phantom js with random scrolling time and random breaks to avoid being blocked by Twitter.

The tweets were acquired based on their hashtags, all the while making sure to have a wide range of hashtags in different languages. 
____

## 1.2. Tweet Geolocalisation

Once the tweets were recovered we only kept those for which we had either the geolocalisation of the tweets itself or of the user. As the locations are manually input, they are not standardized and the corresponding location needed to be determined. 

The first step was therefore to format the strings by removing all special characters as setting everything to lower case for example.  Afterwards, we created multiple mappings which were used in the order presented as they represent the confidence in the given method : 
- A country mapper which uses multiple variations of the country name (different spellings and languages) to determine the country it corresponds to with the given ISO2 code
- A capital mapper which maps the capital of a country to the ISO2 code
- A first city mapper where duplicates were removed as there are multiple cities in the world with the same name (example : Torronto in Canada, the US and Australia). 
- A final city mapper with all the cities of all the countries in the world using the database provided here **(TODO : mettre le lien)**

## 1.3. Visualization of the Reaction For An Event

We then observed for one of the selected events the number of related tweets on a Chloropleth map. This helped us note certain issues in the mappings (high number of tweets in Antarctica for example). 

____


## 1.4. Enriching the Data 

To create our awareness model we wanted to recover information which could be used to determine a sort of distance between the different countries, that is why we recovered for all countries: 
- area
- ISO2
- ISO3
- languages
- border countries
- latitude and longitude
- language_codes
- Internet users
- population
- GDP
- GDP per capita
- religions
- government types
- population in poverty
- unemployment rate

As you can see, we have more features than we had originally presented in Milestone 2. That's because we realized that one of the datasets we used had a lot more (and more complete) information that was of our interest. After all, the 4 datasets that we were using originally ended up being just two, which are these:

- Dataset 1: https://github.com/mledoze/countries
- Dataset 2: https://github.com/opendatajson/factbook.json

To see all the detailed process of the extraction of the data see "Data enriching.ipynb".
____
____
____


# 2. Creating the Awareness Model

To create our model we decided to construct a graph linking all countries where the weights of the segments would represent the proximity between countries in terms of awareness to major events. 

For that we used the dataset containing all the information regarding the countries and looked to give numeric values to all the different categories mentioned previously and a few new ones: 

1. Languages
2. Governments 
3. Distances between countries :
    - Birds eye view distances 
    - Hop matrix : the number of countries which separate two given countries (**TODO : noter quil y a le souci des distances a traverser au sein des pays qui n'est pas pris en compte, exemple de la Finlande a la Chine qui passe uniquement par la russie**). 
    - Normalized Adj matrix : a matrix which gives the neighboring countries with a normalizing factor to account for multiple countries on the border
    - Flight routes between countries
4. Religion
5. Population, Area, GDP


What needed to be taken into account here is that we have many variables which are categoric. Therefore we needed to find a way to attribute numeric values in order to construct the final graph.

Remark : most of the information was extracted from the CIA factbook as it is a complete and realiable source. 

**TODO : voir si le fait que le pays soit très grand avec une grande population qui peut diminuer l'intérêt par rapport à ce qui se passe ailleurs dans le monde**

____

## 2.1. Languages

Languages are representative of culture and most of the time, countries which speak the same language have a given proximity. That is why this metric was used to create the graph. 

Two main possibilities presented themselves :
- binary : if two countries have the same official language then they are linked
- more informed : some languages are similar in the sense of understanding the other language, learning it, etc. That is why we wanted to take into account Linguistic distances. Unfortunately this is a current research topic and there is no database which gives the Linguistic distances between all the official languages of all countries. Certain databases exist linking English to other languages but this is not sufficient for our application. That is why we decided to use another metric, albeit more abstact, which is to say the phylogenetic trees linking languages together. 


Using the phylogenetic trees taken here (TODO insert nice link) http://glottolog.org/glottolog/family, we were able to construct a distance metric and compute the distances between all languages as can be seen below. The notebook handling this can be found here (TODO : mettre le lien du notebook)

In [None]:
# TODO METTRE LA MATRICE DISTANCE ENTRE LES LANGUES

From this language distance matrix we were able to compute a standardized distance between all countries. 

The idea is the following :
- if countries have an official language in common, set the distance to 0
- else get the distances between the official languages and compute the average of the non infinite distances. 
- else set the distance to inf


In [None]:
# TODO METTRE LA MATRICE DISTANCE ENTRE LES PAYS SELON LA LANGUE

All of these values are then standardized. 

In [None]:
# TODO : METTRE LA MATRICE STANDARDIZE

____

## 2.2. Governments

The type of government is also a good representative of the people's perception to worldwide events. People of different countries with the same government type will be influenced in a similar way. 

After getting the list of all the government types (see "Data enriching.ipynb" for more details on how we got them) we have to convert them to a numeric scale with a meaning.

First, we start by dividing all the government types in three groups and assigning them numerical values -1, 0 and 1. These are the resulting groups:
- Group 1:
    - 'parliamentary democracy'
    - 'parliamentary republic'
    - 'federal republic'
    - 'federation of monarchies'
    - 'semi-presidential republic'
    - 'semi-presidential federation'
- Group 2: 
    - 'non-self-governing overseas territory'
    - 'in transition'
    - 'unknown'
- Group 3: 
    - 'presidential republic'
    - 'presidential democracy'
    - 'monarchy'
    - 'theocratic republic'
    - 'communist state'
    - 'absolute monarchy'

As our criteria to group the government types, we used the variable of the power that the leaders of the government have. Coutries where the leader has a lot of power go in group 3 and will be assigned to the value of 1, the others in group 1 and they will be assigned the value -1. For the non well defined government types, we placed them at the center of the scale with the value of 0.

Here is the actual implementation (full code on "Data enriching.ipynb"):

In [3]:
# Reading the dataframe
data = pd.read_pickle('./DataEnriching/data.pickle')

# Defining the three groups
gov_type_array = [['parliamentary democracy', 'parliamentary republic', 'federal republic', 'federation of monarchies',
                   'semi-presidential republic', 'semi-presidential federation'],
                  ['non-self-governing overseas territory', 'in transition', 'unknown'],
                  ['presidential republic', 'presidential democracy', 'monarchy', 'theocratic republic', 'communist state',
                   'absolute monarchy']]

# Creating the mapping dictionary
mapping_gov_type=dict()
for i, gov_group in enumerate(gov_type_array):
    for gov in gov_group:
        if i==0:
            mapping_gov_type.update({gov: -1})
        elif i==1:
            mapping_gov_type.update({gov: 0})
        else:
            mapping_gov_type.update({gov: 1})

# Mapping gov_type values to their numerical value (1, 0, -1)
data['gov_type_num'] = data.gov_type.map(mapping_gov_type)
data[['gov_type', 'gov_type_num']].head()

Unnamed: 0_level_0,gov_type,gov_type_num
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aruba,parliamentary democracy,-1.0
Afghanistan,presidential republic,1.0
Angola,presidential republic,1.0
Anguilla,parliamentary democracy,-1.0
Åland Islands,unknown,0.0


In order to compute the distance metric between government types, the numeric government type will be used. The distance metric will be defined by the absolute value of the subtraction between the values of each pair of countries. That way, we end up with a symetric metric with possible values of 0, 1 and 2.

In [9]:
# Reading the dataframe and setting ISO2 as index (as it will be the useful country code)
data = pd.read_pickle(os.path.abspath(os.path.join(os.getcwd() + '\DataEnriching\data.pickle')))
data.reset_index(inplace=True)
data.set_index('ISO2', inplace=True)

In [23]:
# Selecting numeric government type column
gov_type_df = data[['gov_type_num']]

# Creating the matrix with the distances
gov_distance_df = pd.DataFrame(columns=gov_type_df.index.tolist())

for country1, value1 in zip(gov_type_df.index.tolist(), gov_type_df['gov_type_num']):
    row = []
    for country1, value2 in zip(gov_type_df.index.tolist(), gov_type_df['gov_type_num']):
        row.append(abs(value1-value2))
    
    
    dictionary = dict(zip(data.index.tolist(), row))
    gov_distance_df = gov_distance_df.append(dictionary, ignore_index=True)
    
gov_distance_df.index = gov_type_df.index
gov_distance_df.head(10)

Unnamed: 0_level_0,AW,AF,AO,AI,AX,AL,AD,AE,AR,AM,...,VG,VI,VN,VU,WF,WS,YE,ZA,ZM,ZW
ISO2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AW,0.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,...,0.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0
AF,2.0,0.0,0.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0,...,2.0,0.0,0.0,2.0,2.0,2.0,1.0,2.0,0.0,0.0
AO,2.0,0.0,0.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0,...,2.0,0.0,0.0,2.0,2.0,2.0,1.0,2.0,0.0,0.0
AI,0.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,...,0.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0
AX,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
AL,0.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,...,0.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0
AD,0.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,...,0.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0
AE,0.0,2.0,2.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,...,0.0,2.0,2.0,0.0,0.0,0.0,1.0,0.0,2.0,2.0
AR,2.0,0.0,0.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0,...,2.0,0.0,0.0,2.0,2.0,2.0,1.0,2.0,0.0,0.0
AM,2.0,0.0,0.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0,...,2.0,0.0,0.0,2.0,2.0,2.0,1.0,2.0,0.0,0.0


____

## 2.3. Distances 

____

## 2.4. Religions 

Religion, like government, affects the way of thinking of the people. Religion bonds between countries will definitely have an impact on the awareness of events that happen in countries with the same religions.


### Methodology
We could extract the religions from Dataset 2, but we had to do some cleaning to have the desired format of the data. First we needed to know which where the religions that were in the dataset, so we created an array containing the unique names of the religions that appeared througouth each country. Once we know all the religions, we created the religions dataframe, containing the percentage of every religion in every country. Finally, we grouped the small religions in 10 broader categories of religions.

As said, we start by creating the unique religions array. Here is a sample of part of the array:

In [19]:
#unique_religions[:20]
['Adventist', 'Animist', 'Armenian Apostolic', 'Assembly of God', 'Awakening Churches/Christian Revival', 'Badimo',
 "Baha'i", 'Baptist', 'Bektashi', 'Buddhism', 'Buddhist', 'Bukot nan Jesus', 'Calvinist', 'Cao Dai', 'Catholic', 'Christian',
 'Christianity','Church of England', 'Church of Ireland', 'Church of Norway']

['Adventist',
 'Animist',
 'Armenian Apostolic',
 'Assembly of God',
 'Awakening Churches/Christian Revival',
 'Badimo',
 "Baha'i",
 'Baptist',
 'Bektashi',
 'Buddhism',
 'Buddhist',
 'Bukot nan Jesus',
 'Calvinist',
 'Cao Dai',
 'Catholic',
 'Christian',
 'Christianity',
 'Church of England',
 'Church of Ireland',
 'Church of Norway']

Now that we have all possible religions in one array, we will create the dataframe (rel_df), with all religions as columns, and the country as the index. The cell values will represent the percentage of the religion in the country.

Some of the countries had bad formated data to be extracted (i.e. percentages given by ranges like 10-30%, some countries didn't have the percentages, etc). We manually added the rows of the countries that had issues. For the countries that didn't have percentages, we searched on Wikipedia for the percentages and we also added them manually.

To create the rel_df dataframe, the rows were appended into an initial empty dataframe. The rows were created by looping through all unique_religions and checking if the country in particular had that religion. If it had it, we inserted the percentage. If it didn't have it, we inserted the value zero in that religion column.

Here is what the rel_df looks like, after dropping the smallest religions (columns) that didn't have more than 10% in any country:

In [51]:
rel_df = pd.read_pickle('./DataEnriching/rel_df.pickle')
rel_df.set_index('ISO2', inplace=True)
rel_df.head()

Unnamed: 0_level_0,Armenian Apostolic,Assembly of God,Awakening Churches/Christian Revival,Buddhism,Buddhist,Calvinist,Catholic,Christian,Church of Norway,Congregational Christian Church,...,none,none or other,other,other Christian,other and unspecified,other or none,other or unspecified,unaffiliated,unaffiliated or other,unspecified
ISO2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DZ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AO,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.123,0.0,0.086,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BW,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.791,0.0,0.0,...,0.152,0.0,0.014,0.0,0.0,0.0,0.0,0.0,0.0,0.003
BJ,0.0,0.0,0.0,0.0,0.0,0.0,0.255,0.0,0.0,0.0,...,0.058,0.0,0.026,0.095,0.0,0.0,0.0,0.0,0.0,0.0
BI,0.0,0.0,0.0,0.0,0.0,0.0,0.621,0.0,0.0,0.0,...,0.0,0.0,0.036,0.0,0.0,0.0,0.0,0.0,0.0,0.079


### Grouping religions under broader categories
We grouped the religions in broader categories by adding the percentages. We had to dropp some columns that didn't provide any information about the religious similarity between two countries (i.e unspecified, none or other, other, etc).

In [52]:
# Here are the broader categories that we came up with
christianity = ['Calvinist', 'Church of Norway', 'Congregational Christian Church', 'Ekalesia Niue', 'Evangelical', 
               'Evangelical Lutheran', 'Evangelical Lutheran Church of Iceland', 'Evangelical or Protestant', 'Lutheran',
               'Protestant', 'Protestant and other', 'Seventh-Day Adventist', 'non-Catholic Christians', 'Armenian Apostolic', 
               'Assembly of God', 'Christian', 'Kimbanguist', 'Mormon', 'Zionist Christian', 'nondenominational', 
               'Awakening Churches/Christian Revival', 'Catholic', 'Roman Catholic', 'nominally Roman Catholic', 
                'Eastern Orthodox', 'Ethiopian Orthodox', 'Greek Orthodox', 'Macedonian Orthodox', 'Orthodox', 'Orthodox Christian',
                'Russian Orthodox', 'Serbian Orthodox']
buddhism = ['Buddhism', 'Buddhist', 'Lamaistic Buddhist']
hindu = ['Hindu', 'Indian- and Nepalese-influenced Hinduism']
jewish = ['Jewish', 'Zionist', ]
muslim = ['Muslim', 'Sunni Muslim']
oriental = ['Shintoism', 'Taoist', 'mixture of Buddhist and Taoist']
other = ['Vodoun', 'eclectic mixture of local religions', 'folk religion', 'indigenous beliefs']
animist = ['animist', 'animist or no religion']
atheist = ['atheist or agnostic', 'no religion', 'non-believer/agnostic', 'non-believers']
unaffiliated = ['unaffiliated', 'unaffiliated or other']

# Those religions will be dropped
dropped_cols = ['Kempsville Presbyterian Church', 'none', 'none or other', 'other',
                'other Christian', 'other and unspecified', 'other or none', 'other or unspecified', 'unspecified',
                'atheist and agnostic']

# Arrays needed to run next cell
final_categories = [christianity, buddhism, hindu, jewish, muslim, oriental, other, animist,
                   atheist, unaffiliated]
final_categories_names = ['christianity', 'buddhism', 'hindu', 'jewish', 'muslim', 
                           'oriental', 'other', 'animist', 'atheist', 'unaffiliated']

The function that unifies the small religions to broader religion groups, and the resulting dataframe are the following:

In [53]:
def generate_categories_df(dataframe):
    df = pd.DataFrame()
    for category, category_name in zip(final_categories, final_categories_names):
        df[category_name] = dataframe[category].sum(axis=1)
    return df

categorized_df = generate_categories_df(rel_df)
categorized_df.head()

Unnamed: 0_level_0,christianity,buddhism,hindu,jewish,muslim,oriental,other,animist,atheist,unaffiliated
ISO2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
DZ,0.0,0.0,0.0,0.0,0.99,0.0,0.0,0.0,0.0,0.0
AO,0.792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BW,0.791,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BJ,0.39,0.0,0.0,0.0,0.277,0.0,0.116,0.0,0.0,0.0
BI,0.86,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0


Lastly, in this dataframe there are countries that have null values in every religion. This represents a big issue because when computing the distance between countries, these countries will have distance zero not for their religious proximity, but because we lacked the data. We managed to find the percentages of the main religions in most of the important countries. Here is an example of some manually inserted values:

In [54]:
# We found the percentages on wikipedia by doing the search on google: percentages of main religions [country]
data.set_value('Greenland', ('religion', 'christianity'), 0.96);
data.set_value('Madagascar', ('religion', 'atheist'), 0.84);
data.set_value('North Korea', ('religion', 'atheist'), 0.643);
data.set_value('Palestine', ('religion', 'muslim'), 0.85);
data.set_value('Palestine', ('religion', 'jewish'), 0.12);
data.set_value('Sudan', ('religion', 'muslim'), 0.97);

To compute the distances, the euclidean dot product was applied using the following function, that returns the matrix with the distances between each pair of countries.

In [55]:
rel_distances = spatial.distance.squareform(spatial.distance.pdist(categorized_df,'euclidean'))

In [61]:
# Converting it to a dataframe
rel_distance_df = pd.DataFrame(rel_distances, index=rel_df.index.tolist(), columns=rel_df.index.tolist())
rel_distance_df.head(10)

Unnamed: 0,DZ,AO,BW,BJ,BI,TD,CG,CD,CM,KM,...,UY,VE,AF,BD,BT,LK,IN,IO,MV,NP
DZ,0.0,1.267819,1.267194,0.820929,1.292604,0.534932,1.231132,1.196704,1.015729,0.022361,...,1.393018,0.007,0.140716,1.263309,1.144487,1.164662,0.99,0.99,1.250672,0.026
AO,1.267819,0.0,0.001,0.501786,0.07245,0.735916,0.042154,0.100319,0.260465,1.247551,...,0.188,1.273292,1.196305,1.11495,1.02589,1.117286,0.792,0.792,1.12973,1.247622
BW,1.267194,0.001,0.0,0.500985,0.073389,0.73531,0.041231,0.100404,0.25991,1.246933,...,0.189,1.27267,1.195643,1.11424,1.025178,1.116598,0.791,0.791,1.129041,1.246987
BJ,0.820929,0.501786,0.500985,0.0,0.545766,0.333528,0.461894,0.461395,0.295406,0.802848,...,0.662031,0.827016,0.743338,0.926356,0.814111,0.8962,0.492225,0.492225,0.937118,0.798452
BI,1.292604,0.07245,0.073389,0.545766,0.0,0.760445,0.107378,0.096047,0.286986,1.271859,...,0.122577,1.297838,1.224564,1.164506,1.073436,1.162352,0.860363,0.860363,1.176922,1.273311
TD,0.534932,0.735916,0.73531,0.333528,0.760445,0.0,0.699909,0.66481,0.481126,0.514482,...,0.862909,0.540264,0.474937,1.038375,0.910315,0.968554,0.679979,0.679979,1.035612,0.515476
CG,1.231132,0.042154,0.041231,0.461894,0.107378,0.699909,0.0,0.096255,0.227203,1.211026,...,0.227563,1.236677,1.158721,1.087711,0.997048,1.088843,0.75317,0.75317,1.102712,1.210666
CD,1.196704,0.100319,0.100404,0.461395,0.096047,0.66481,0.096255,0.0,0.196026,1.175925,...,0.205913,1.201919,1.12946,1.1251,1.027039,1.114584,0.806226,0.806226,1.135782,1.177496
CM,1.015729,0.260465,0.25991,0.295406,0.286986,0.481126,0.227203,0.196026,0.0,0.995342,...,0.397122,1.021121,0.947032,1.039844,0.931534,1.016762,0.682221,0.682221,1.048866,0.995876
KM,0.022361,1.247551,1.246933,0.802848,1.271859,0.514482,1.211026,1.175925,0.995342,0.0,...,1.37186,0.026249,0.135355,1.255647,1.135804,1.157176,0.980204,0.980204,1.243061,0.025612


____

## 2.5. Population, Area, GDP, Internet users, population in poverty, unemployment rate

This data is also straight forward to extract from Dataset 2. From the data.pickle file we can extract it directly. The result is the following:

In [21]:
other_features = pd.read_pickle('.\DataEnriching\data.pickle')
cols = [('POP',''),('area',''),('gdp',''),('gdp_capita',''),('Internet users',''),('pop_pov',''),('unemployment','')]
other_features = other_features[cols]
other_features.head(5)

Unnamed: 0_level_0,POP,area,gdp,gdp_capita,Internet users,pop_pov,unemployment
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aruba,113648.0,180.0,2516000000.0,25300.0,99.0,,0.069
Afghanistan,33332025.0,652230.0,18400000000.0,2000.0,2690000.0,0.358,0.35
Angola,20172332.0,1246700.0,91940000000.0,6800.0,2434000.0,0.405,
Anguilla,16752.0,91.0,175400000.0,12200.0,12.0,0.23,0.08
Åland Islands,,1580.0,,,,,


____

## 2.6. Constructing the Graph

____
____
____

# 3. Improving the Mappings

One main issue with the mapping up until now is that there were redundancies which were not correctly handled. At the time of Milestone 2 we would just go through the complete mapping sequentially and stop at the first match. As there were duplicates there was no guarantee that the city found was the most coherent. That is why we corrected the mapping to take into account the population of the different cities which was also provided in the GeoNames database http://download.geonames.org/export/dump/. Therefore when there are multiple cities with the same name in different countries, we only keep the city with the largest population as it would be the most probable. 

____
____
____

# 4. Predictions Using the Awareness Model and Graph Diffusion