# Milestone 3 - Report
------
In this notebook we are going to summarize what was done in Milestone 2 and explain all that was done for the final Milestone. The focus at this point of the project was placed on constructing the awareness model using our different metrics and the final datastory. 

In [None]:
import os
import json
import time
import folium
import branca
import jenkspy
import pickle
import unicodedata
import collections
import numpy as np
import pandas as pd
from scipy import spatial
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline

## 0. Abstract

*What's the motivation behind your project? A 150 word description of the project idea, goals, dataset used. What story you would like to tell and why?*

Major events happen on a regular basis all around the world, some involving high number of casualties but the resulting reaction on the international scale is often far from proportional. Most of the time the largest reaction comes from the place where the incident occurred or places which are closeby. The objective would be to create an awareness map, and determine why people react to an event. From that we would attempt to define an awareness metric. We want to see how factors other than physical proximity come into play such as country, culture, language, religion. With this we could determine which country has the highest level of international awareness. The project would require the Twitter API to acquire hashtag specific tweets with geolocation and therefore measure the awareness and reactions of different communities to a given event. GDELT would be used to recover standardised information regarding different events.
____
____
____


# 1. Recapitulation of Milestone 2

To recap what was done in the previous milestone, here we present a brief summary of the previous notebook which can be found here **(TODO : mettre le lien)**. 

____

## 1.1 Tweet Acquisition
A lot of time was spent acquiring the Tweets for two main reasons : 
- The Twitter dump did not contain the geolocations of the tweets or the users and only containecd 10% of all tweets for a given period. Given that users to not necessarily give their location we wanted to avoid cases where we would not have enough overall located tweets. 
- The Twitter API does not permit the recovery of tweets which are over 1 week old when searching for tweets with a given hashtag around a specific date. 

That is why a significant amount of time was put into recovering the tweets through webscraping using phantom js with random scrolling time and random breaks to avoid being blocked by Twitter.

The tweets were acquired based on their hashtags, all the while making sure to have a wide range of hashtags in different languages. Here is the link to the file where we put all the hashtags for each event: https://github.com/LailaHms/ADA2017_Homeworks/blob/master/Project/TweetAcquisition/Hashtags%20Per%20Event.docx

____

## 1.2. Tweet Geolocalisation

Once the tweets were recovered we only kept those for which we had either the geolocalisation of the tweets itself or of the user. As the locations are manually input, they are not standardized and the corresponding location needed to be determined. 

The first step was therefore to format the strings by removing all special characters as setting everything to lower case for example.  Afterwards, we created multiple mappings which were used in the order presented as they represent the confidence in the given method : 
- A country mapper which uses multiple variations of the country name (different spellings and languages) to determine the country it corresponds to with the given ISO2 code
- A capital mapper which maps the capital of a country to the ISO2 code
- A first city mapper where duplicates were removed as there are multiple cities in the world with the same name (example : Torronto in Canada, the US and Australia). 
- A final city mapper with all the cities of all the countries in the world using the database provided here: http://download.geonames.org/export/dump/


## 1.3. Visualization of the Reaction For An Event

We then observed for one of the selected events the number of related tweets on a Chloropleth map. This helped us note certain issues in the mappings (high number of tweets in Antarctica for example). 

Here is the number of reactions per each event with other interesting data.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "event_reactions.png"), width=600)

## 1.4. Enriching the Data 

To create our awareness model we wanted to recover information which could be used to determine a sort of distance between the different countries, that is why we recovered for all countries: 
- area
- ISO2
- ISO3
- languages
- border countries
- latitude and longitude
- language_codes
- Internet users
- population
- GDP
- GDP per capita
- religions
- government types
- population in poverty
- unemployment rate

As you can see, we have more features than we had originally presented in Milestone 2. That's because we realized that one of the datasets we used had a lot more (and more complete) information that was of our interest. After all, the 4 datasets that we were using originally ended up being just two, which are these:

- Dataset 1: https://github.com/mledoze/countries
- Dataset 2: https://github.com/opendatajson/factbook.json

To see all the detailed process of the extraction of the data see "Data enriching.ipynb".

____
____
____

# 2. Improving the Mappings

One main issue with the mapping up until now is that there were redundancies which were not correctly handled. At the time of Milestone 2 we would just go through the complete mapping sequentially and stop at the first match. As there were duplicates there was no guarantee that the city found was the most coherent. That is why we corrected the mapping to take into account the population of the different cities which was also provided in the GeoNames database http://download.geonames.org/export/dump/. Therefore when there are multiple cities with the same name in different countries, we only keep the city with the largest population as it would be the most probable. 

We also replaced our intermediate city mapping with a mapping containing the top 20'000 cities in the world (and therefore the most probable). With appoximately 200 countries this corresponds to around 100 cities per country. 

# 3. Metrics For the Awareness Model

To create our model we originally planned to construct a graph linking all countries where the weights of the segments would represent the proximity between countries in terms of awareness to major events. 

For that we used the dataset containing all the information regarding the countries and looked to give numeric values to all the different categories mentioned previously and a few new ones: 

1. Languages
2. Governments 
3. Distances between countries :
    - Birds eye view distances 
    - Hop matrix : the number of countries which separate two given countries. This method had the problem of not accounting for the real area extention of the countries in between. For example, the distance between Finland and China is small because from Finland we only have to cross Russia to get to China, while in reality they are very separated.
  
    - Normalized Adj matrix : a matrix which gives the neighboring countries with a normalizing factor to account for multiple countries on the border
    - Flight routes between countries
4. Religion
5. Population, Area, GDP, Internet Users etc...


What needed to be taken into account here is that we have many variables which are categoric. Therefore we needed to find a way to attribute numeric values in order to construct the final graph.

Remark : most of the information was extracted from the CIA factbook as it is a complete and realiable source (https://github.com/opendatajson/factbook.json).

____

## 3.1. Languages

Languages are representative of culture and most of the time, countries which speak the same language have a given proximity. That is why this metric was used to create the graph. 

Two main possibilities presented themselves :
- binary : if two countries have the same official language then they are linked
- more informed : some languages are similar in the sense of understanding the other language, learning it, etc. That is why we wanted to take into account Linguistic distances. Unfortunately this is a current research topic and there is no database which gives the Linguistic distances between all the official languages of all countries. Certain databases exist linking English to other languages but this is not sufficient for our application. That is why we decided to use another metric, albeit more abstact, which is to say the phylogenetic trees linking languages together. 


Using the phylogenetic trees taken here http://glottolog.org/glottolog/family, we were able to construct a distance metric and compute the distances between all languages as can be seen below. The notebook handling this can be found here: https://github.com/LailaHms/ADA2017_Homeworks/blob/master/Project/LinkingLanguages/Linking%20Languages.ipynb

In [None]:
# Matrix of distances between languages
dist_languages = pd.read_pickle(os.path.join(os.getcwd(), 'LinkingLanguages', 'dist_languages.pkl'))
dist_languages.head()

In [None]:
# Heatmap of language distances between languages
plt.figure(figsize=[10,8])
sns.heatmap(dist_languages.iloc[:25,:25],cmap="Reds", vmin=0., vmax=25.);

From this language distance matrix we were able to compute a standardized distance between all countries. 

The idea is the following :
- if countries have an official language in common, set the distance to 0
- else get the distances between the official languages and compute the average of the non infinite distances. 
- else set the distance to inf


In [None]:
# Matrix of distances between countries
country_dist_languages = pd.read_pickle(os.path.join(os.getcwd(), 'LinkingLanguages', 'country_dist_languages.pkl'))
country_dist_languages.head()

In [None]:
# Heatmap of language distances between countries
plt.figure(figsize=[11,9])
sns.heatmap(country_dist_languages.iloc[:25,:25],cmap="Blues", vmin=0., vmax=25.);

The issue with this type of distance is that it takes discrete values which don't account for the actual distance between the different languages in terms of comprehension, facility to learn the other language etc.. For example, in the cell below we show the case of two pairs of languages, ones that we know are significantly much closer (Italian-Spanish) than the others (English-German), which in fact are very differnet. By using this distance metric, we see that the distance between these languages is the same in this case, while it shouldn't be.

In [None]:
print('Distance between English-German:', dist_languages.loc['English', 'German'])
print('Distance between Italian-Spanish:', dist_languages.loc['Italian', 'Spanish'])

Results more representative of the reality of the situation could have been obtained had we had not only the phylogenetic links between the languages but the moment that the a new language appeared from a mother language. Incoroporating this time component would have given a more precise approximation, assuming that languages evolve at similar rates and that once the new language has emerged that there is minimal contact / influence from the mother or any sister languages. 

Still we managed to have an algorithm that worked quite well as we can see in the next map where we show the language distances relative to Spain.

### Example 1

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "Language distance example 1.jpg"), width=1000)

We can see that in Europe our algorithm displayed correctly the most similar languages relative to Spain (Italian, Portuguese and French) and it also correctly matched all South and Central America as having close similarity.

To play with the interactivity of the map you can visit http://tweet-awareness.eu/ in the section Modelling the reactions, by selecting Language Distance at the top of the map and clicking any country.

### Example 2

We noticed a problem with the language religion distance to Switzerland. It displays that the distance between Switzerland and Germany is 16, which is a lot taking into account that both speak German. 

Because Switzerland has 4 official languages and that Swiss German and German are not identified as the same, the distance between Switzerland and Germany are computed as the average of the distances between Sitzerland's 4 official languages and German. Ideally we would have pondered the computation by the proportion of the population speaking each one of the official languages. Unfortunately we did not have this data at hand.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "Language distance example 2.jpg"), width=1000)

____

## 3.2. Governments

The type of government is also a good representative of the people's perception to worldwide events. People of different countries with the same government type will be influenced in a similar way. 

After getting the list of all the government types (see "Data enriching.ipynb" for more details on how we got them) we have to convert them to a numeric scale with a meaning.

First, we start by dividing all the government types in three groups and assigning them numerical values -1, 0 and 1. These are the resulting groups:
- Group 1:
    - 'parliamentary democracy'
    - 'parliamentary republic'
    - 'federal republic'
    - 'federation of monarchies'
    - 'semi-presidential republic'
    - 'semi-presidential federation'
- Group 2: 
    - 'non-self-governing overseas territory'
    - 'in transition'
    - 'unknown'
- Group 3: 
    - 'presidential republic'
    - 'presidential democracy'
    - 'monarchy'
    - 'theocratic republic'
    - 'communist state'
    - 'absolute monarchy'

As our criteria to group the government types, we used the variable of the power that the leaders of the government have. Coutries where the leader has a lot of power go in group 3 and will be assigned to the value of 1, the others in group 1 and they will be assigned the value -1. For the non well defined government types, we placed them at the center of the scale with the value of 0.

Here is the actual implementation (full code on "Data enriching.ipynb"):

In [None]:
# Reading the dataframe
data = pd.read_pickle(os.path.join(os.getcwd(), 'DataEnriching', 'data.pickle'))

# Defining the three groups
gov_type_array = [['parliamentary democracy', 'parliamentary republic', 'federal republic', 'federation of monarchies',
                   'semi-presidential republic', 'semi-presidential federation'],
                  ['non-self-governing overseas territory', 'in transition', 'unknown'],
                  ['presidential republic', 'presidential democracy', 'monarchy', 'theocratic republic', 'communist state',
                   'absolute monarchy']]

# Creating the mapping dictionary
mapping_gov_type=dict()
for i, gov_group in enumerate(gov_type_array):
    for gov in gov_group:
        if i==0:
            mapping_gov_type.update({gov: -1})
        elif i==1:
            mapping_gov_type.update({gov: 0})
        else:
            mapping_gov_type.update({gov: 1})

# Mapping gov_type values to their numerical value (1, 0, -1)
data['gov_type_num'] = data.gov_type.map(mapping_gov_type)
data[['gov_type', 'gov_type_num']].head()

In order to compute the distance metric between government types, the numeric government type will be used. The distance metric will be defined by the absolute value of the subtraction between the values of each pair of countries. That way, we end up with a symetric metric with possible values of 0, 1 and 2.

In [None]:
# Reading the dataframe and setting ISO2 as index (as it will be the useful country code)
data = pd.read_pickle(os.path.join(os.getcwd(), 'DataEnriching', 'data.pickle'))
data.reset_index(inplace=True)
data.set_index('ISO2', inplace=True)

In [None]:
# Selecting numeric government type column
gov_type_df = data[['gov_type_num']]

# Creating the matrix with the distances
gov_distance_df = pd.DataFrame(columns=gov_type_df.index.tolist())

for country1, value1 in zip(gov_type_df.index.tolist(), gov_type_df['gov_type_num']):
    row = []
    for country1, value2 in zip(gov_type_df.index.tolist(), gov_type_df['gov_type_num']):
        row.append(abs(value1-value2))
    
    
    dictionary = dict(zip(data.index.tolist(), row))
    gov_distance_df = gov_distance_df.append(dictionary, ignore_index=True)
    
gov_distance_df.index = gov_type_df.index
gov_distance_df.head()

** TODO: comment why we didn't finally use this metric **

____

## 3.3. Distances 

Distance tends to play an important role in determining people's interest regarding a specific event. But there are several types of distances which can be considered when speaking of countries. 

The first and most evident is the **birds-eye-view distance** between the countries. This is computed based on the positions of the center of each country. But this is not enough. Take for example the US and Canada, the distance between the center of thees two countries is much larger than the distance between France and any of its neighboring countries. 

So what else could we consider to have a more representative representation of distance between the different countries? We came up with three other distance metrics which combined would be more complete.


1. **Relative Importance of Neighbors** :  To take into account the fact that two countries can be neighbors and still have a big distance between them we decided to create a metric which would give importance to countries which are direct neighbors. We also wanted to make sure to give each neighbor the importance it is due. For example France has  multiple countries at its border. But that does not mean that each of these countries are of similar importance. That is why we weighted this metric by the size of the countries. Therefore a small neighboring country such as Luxemburg would have a smaller weight than Germany for example.  

2. **Hop Distance** : This second metric accounts not only for direct neighbors but also for the smallest number of countries which would need to be traversed to connect any two countries in the world. 

3. **Flight Routes** : This last metric accounts for movement of populations between the different countries. The assumption is that the existence of flight routes is due to the fact that people exhibit a certain interest for the other country. The more often you visit a place, the more likely you are to be interested in what is going on there. 


These metrics are all constructed in the country distances noteboook in the GeoMetrics folder (https://github.com/LailaHms/ADA2017_Homeworks/blob/master/Project/GeoMetrics/country_distances.ipynb). Refer to this noteboook for more detail on how the metrics were constructed. 

** Bird's Eye View Distance / Real Distance **

Constructing this mapping was relatively straightforward. Using the position in latitude and longitude of the countries the distance between all countries in computed all the while being carefull with the extreme values of longitude. It is important to consider the shortest path as the earth is round. 

In [None]:
# Reading the dataframe and dropping the level 0 to make the plot nicer
real_distance = pd.read_pickle(os.path.join(os.getcwd(), 'GeoMetrics', 'real_distance.pickle'))
real_distance.columns = real_distance.columns.droplevel(0)

# Factor to change from lat-lon distance to kilometric distance
real_distance = real_distance.multiply(105.)
real_distance.head()

In [None]:
# Heatmap of real distance between countries
plt.figure(figsize=[11,9])
sns.heatmap(real_distance.iloc[:25,:25],cmap="Blues");

Here is an example of the visualization of the real distance metric for the case of France where we can see that the algorithm worked really well.

To play with the interactivity of the map you can visit http://tweet-awareness.eu/ in the section Modelling the reactions, by selecting Real Distance at the top of the map and clicking any country.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "Real distance example.jpg"), width=1000)


** Hop Distance **


For this metric as well as for the following we established an adjacency matrix of the connected countries in term of borders. It is important to note that Islands and certain continents are isolated which means that the resulting graph will not be fully connected. 

This will also be a usefull tool to create a neigborhood influence matrix.
We can visualize the graph using networkx. 

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "Hop Distance Graph.jpg"), width=1000)

With the obtained graph in networkx we compute the shortest path between all pairs of nodes to obtain what we call the hop matrix between all countries. Countries which are not connected have a hop distance of infinity between them in the final dataframe.

In the next cell we plot the heatmap of the hop distance. We can see that the countries that are not connected (islands) have an infinite distance and are represented with the darkest color on of the scale.

In [None]:
# Reading the dataframe and dropping the level 0 to make the plot nicer
hop_distance = pd.read_pickle(os.path.join(os.getcwd(), 'GeoMetrics', 'hop_distance.pickle'))
hop_distance.columns = hop_distance.columns.droplevel(0)
hop_distance.head()

In [None]:
# Heatmap of real distance between countries
plt.figure(figsize=[11,9])
sns.heatmap(hop_distance.iloc[:25,:25],cmap="Blues", vmin=0., vmax=15.);

In [None]:
# Examples of islands with infinite distance
print(hop_distance.loc['MV'].head()) # Maldives
print(hop_distance.loc['CU'].head()) # Cuba

In the next cell we visualize the hop distances related to Spain. We can see an example to verify that it works: to go from Spain to Poland we just need to jump three times (France-Germany-Poland).

To play with the interactivity of the map you can visit http://tweet-awareness.eu/ in the section Modelling the reactions, by selecting Hop Distance at the top of the map and clicking any country.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "hop distance example.jpg"), width=1000)

** Relative Importance of Neighbors **

The adjacency matrix established previously is then used to create a weighted adjacency matrix where the edges weighted by the size of the neighboring country over the sum of the sizes of all the neighbors. The graph is no longer an undirected graph. 

The following visualization needs an explanation in order to be well interpreted. When you clik on a country (C1 from now on), the countries that are in contac with C1 pop up with some percentages of influence. What do these percentages mean? They have to be interpreted as how many reactions there will be if some event happend in the country that we clicked on, C1.

In the following cell we visualize an easy example. We clik Spain and we are interested in knowing the reaction levels of the surrounding countries. As explained in more detail in country_distances.ipynb, this metric was computed by comparing the areas of the neighbor countries. For Portugal, the only country in contact is Spain, so when we click in Spain it will display a 100% at Portugal. This is because if we compare the area of Spain with the area of the sum of all the countries in contact with Portugal (in this case just Spain) we get that value. For France, which has higher total area of neighbors counties, the relative area of Spain becomes smaller and thus the percentage of importance is lower (41%) and for Morroco even lower (16%).

In conclusion, when we click on a country C1, the given percentages have to be seen as what importance will give the given country to an event happening in C1.

To play with the interactivity of the map you can visit http://tweet-awareness.eu/ in the section Modelling the reactions, by selecting Relative Importance for Neighbors at the top of the map and clicking any country.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "neighbor influence example.jpg"), width=1000)


** Flight Routes **

Given a dataset of 59036 routes between 3209 airports on 531 airlines in the world as of January 2012. In addition the that we are able so associate all the Airports to their country thanks to a second dataset. Both were taken from https://openflights.org/data.html

Once that was done, we associated all the routes with the departure and arrival countries. That way were able to determine for each country the proportion of flights to all other countries. 

In [None]:
# Reading the dataframe and dropping the level 0 to make the plot nicer
flight_routes = pd.read_pickle(os.path.join(os.getcwd(), 'GeoMetrics', 'flight_routes.pickle'))
flight_routes.columns = flight_routes.columns.droplevel(0)
flight_routes.head()

In the same way as before the graph is a directed graph. Here we can visualize the nodes at the positions of the countries.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "flight routes graph.jpg"), width=1000)

For an example, here we visualized the flights coming to Russia. This map has to be interpreted as the previous one.
As we clicked in Russia, the given percentages in the other countries are representative of the amount of flights that go to Russia in that countries. The more flights to Russia, we assumed that the more interest a surrounding country would have related to Russia. In this case we can see that Uzbekistain is really interested in Russia, with a percentage of awareness of 68%.

To play with the interactivity of the map you can visit http://tweet-awareness.eu/ in the section Modelling the reactions, by selecting Percentage of outbound flights at the top of the map and clicking any countryo

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "flight routes example.jpg"), width=1000)

____

## 3.4. Religions 

Religion, like government, affects the way of thinking of the people. Religion bonds between countries will definitely have an impact on the awareness of events that happen in countries with the same religions.

Here we can see the religion data that we were originally using in Milestone 2:

In [None]:
rel_df_milestone2 = pd.read_pickle(os.path.join(os.getcwd(), 'DataEnriching', 'Pickles for Milestone 3', 'final_rel_df.pickle'))
rel_df_milestone2.head()

Unfortunately, there were too many countries for which the religion information was not complete. That is why we went back to the factbook which was more complete. 

### Methodology
From now on, we will explain how we took the data from the factbook dataset.

We could extract the religions from Dataset 2, but we had to do some cleaning to have the desired format of the data. First we needed to know which where the religions that were in the dataset, so we created an array containing the unique names of the religions that appeared througouth each country. Once we know all the religions, we created the religions dataframe, containing the percentage of every religion in every country. Finally, we grouped the small religions in 10 broader categories of religions.

As said, we start by creating the unique religions array. Here is a sample of part of the array:

In [None]:
#unique_religions[:20]
['Adventist', 'Animist', 'Armenian Apostolic', 'Assembly of God', 'Awakening Churches/Christian Revival', 'Badimo',
 "Baha'i", 'Baptist', 'Bektashi', 'Buddhism', 'Buddhist', 'Bukot nan Jesus', 'Calvinist', 'Cao Dai', 'Catholic', 'Christian',
 'Christianity','Church of England', 'Church of Ireland', 'Church of Norway']

Now that we have all possible religions in one array, we will create the dataframe (rel_df), with all religions as columns, and the country as the index. The cell values will represent the percentage of the religion in the country.

Some of the countries had bad formated data to be extracted (i.e. percentages given by ranges like 10-30%, some countries didn't have the percentages, etc). We manually added the rows of the countries that had issues. For the countries that didn't have percentages, we searched on Wikipedia for the percentages and we also added them manually.

To create the rel_df dataframe, the rows were appended into an initial empty dataframe. The rows were created by looping through all unique_religions and checking if the country in particular had that religion. If it had it, we inserted the percentage. If it didn't have it, we inserted the value zero in that religion column.

Here is what the rel_df looks like, after dropping the smallest religions (columns) that didn't have more than 10% in any country:

In [None]:
rel_df = pd.read_pickle(os.path.join(os.getcwd(), 'DataEnriching', 'rel_df.pickle'))
rel_df.set_index('ISO2', inplace=True)
rel_df.head()

### Grouping religions under broader categories
We grouped the religions in broader categories by adding the percentages. We had to dropp some columns that didn't provide any information about the religious similarity between two countries (i.e unspecified, none or other, other, etc).

In [None]:
# Here are the broader categories that we came up with
protestants = ['Calvinist', 'Church of Norway', 'Congregational Christian Church', 'Ekalesia Niue', 'Evangelical', 
               'Evangelical Lutheran', 'Evangelical Lutheran Church of Iceland', 'Evangelical or Protestant', 'Lutheran',
               'Protestant', 'Protestant and other', 'Seventh-Day Adventist', 'non-Catholic Christians', 'Armenian Apostolic', 
               'Assembly of God', 'Christian', 'Kimbanguist', 'Mormon', 'Zionist Christian', 'nondenominational']
               
catholic = ['Awakening Churches/Christian Revival', 'Catholic', 'Roman Catholic', 'nominally Roman Catholic']
orthodox = ['Eastern Orthodox', 'Ethiopian Orthodox', 'Greek Orthodox', 'Macedonian Orthodox', 'Orthodox', 'Orthodox Christian',
                'Russian Orthodox', 'Serbian Orthodox']
buddhism = ['Buddhism', 'Buddhist', 'Lamaistic Buddhist']
hindu = ['Hindu', 'Indian- and Nepalese-influenced Hinduism']
jewish = ['Jewish', 'Zionist', ]
muslim = ['Muslim', 'Sunni Muslim']
oriental = ['Shintoism', 'Taoist', 'mixture of Buddhist and Taoist']
other = ['Vodoun', 'eclectic mixture of local religions', 'folk religion', 'indigenous beliefs']
animist = ['animist', 'animist or no religion']
atheist = ['atheist or agnostic', 'no religion', 'non-believer/agnostic', 'non-believers']
unaffiliated = ['unaffiliated', 'unaffiliated or other']

# Those religions will be dropped
dropped_cols = ['Kempsville Presbyterian Church', 'none', 'none or other', 'other',
                'other Christian', 'other and unspecified', 'other or none', 'other or unspecified', 'unspecified',
                'atheist and agnostic']

# Arrays needed to run next cell
final_categories = [protestants, catholic, orthodox, buddhism, hindu, jewish, muslim, oriental, other, animist,
                   atheist, unaffiliated]
final_categories_names = ['protestants', 'catholic', 'orthodox', 'buddhism', 'hindu', 'jewish', 'muslim', 
                           'oriental', 'other', 'animist', 'atheist', 'unaffiliated']

The function that unifies the small religions to broader religion groups, and the resulting dataframe are the following:

In [None]:
def generate_categories_df(dataframe):
    df = pd.DataFrame()
    for category, category_name in zip(final_categories, final_categories_names):
        df[category_name] = dataframe[category].sum(axis=1)
    return df

categorized_df = generate_categories_df(rel_df)
categorized_df.head()

To compute the distances, the euclidean dot product was applied using the following function, that returns the matrix with the distances between each pair of countries.

In [None]:
rel_distances = spatial.distance.squareform(spatial.distance.pdist(categorized_df,'euclidean'))

In [None]:
# Converting it to a dataframe
rel_distance_df = pd.DataFrame(rel_distances, index=rel_df.index.tolist(), columns=rel_df.index.tolist())
rel_distance_df.head()

Here is an example for the religion distance. We clicked in France. As we used euclidean dot product to compute the distance between each pair of countries, similar countries in this map regarding religions will have high values and thus, they will be displayed in darker colors. Different countries will have low values and lighter colors.

In [None]:
Image(filename=os.path.join(os.getcwd(), 'Visualization', "religion distance good example.jpg"), width=1000)

To compute the last map, we had to solve also the issues with the lack of data of some countries. In the categorized_df dataframe there are countries that have null values in every religion. This represents a big issue because when computing the distance between countries, these countries will have distance zero not for their religious proximity, but because we lacked the data. We managed to find the percentages of the main religions in most of the important countries by searching in Wikipedia. Here is an example of some manually inserted values:

In [None]:
# We found the percentages on wikipedia by doing the search on google: percentages of main religions [country]
data.set_value('Greenland', ('religion', 'christianity'), 0.96);
data.set_value('Madagascar', ('religion', 'atheist'), 0.84);
data.set_value('North Korea', ('religion', 'atheist'), 0.643);
data.set_value('Palestine', ('religion', 'muslim'), 0.85);
data.set_value('Palestine', ('religion', 'jewish'), 0.12);
data.set_value('Sudan', ('religion', 'muslim'), 0.97);

____

## 3.5. Population, Area, GDP, Internet users, population in poverty, unemployment rate

Countries can be characterized by general attributes such as population, size, gross domestic product (GDP), poverty line, internet users and so forth. These metrics are interesting because they are in direct link to the number of tweets in a conutry and can be used as normalizing factors. 
         
As we announced previously, by using the factbook dataset for Milestone 3 we were able to extract more features than in Milestone 2. These features are straight forward to extract from Dataset 2 (see "Data enriching.ipynb"). From the data.pickle file we can extract them directly. The result is the following:

In [None]:
other_features = pd.read_pickle(os.path.join(os.getcwd(), 'DataEnriching', 'data.pickle'))
cols = [('POP',''),('area',''),('gdp',''),('gdp_capita',''),('Internet users',''),('pop_pov',''),('unemployment','')]
other_features = other_features[cols]
other_features.head()

____

##  3.6. Estimating Number of Active Tweeters 

As will be explained in the next portion of the notebook it is important to have a normalizing factor to account for differences in twitter activity between the different country. The ideal thing would have been to have the number of active tweeters per country. A second best would have been to simply have the number of tweeters per country. Unfortunately none of this is available to the public. That is why it was important to create our own estimation of these values to generate the model. 

We had at our disposal the number of tweets worldwide for 8 different events. We used this to compute the average number of tweets per country and used it as a baseline. We were careful to compute the average of tweets for a given country only using the data when there was no event in the given country. In clearer terms we did not take the number of tweets in France during Charlie Hebdo to compute the average number of tweets. We relied on the 7 other events. In the same fashion we did not consider the tweets for france during the event in Belgium as the countries are sufficiently close to generate larger than normal reactions. The inverse was also done for Belgium.  

**TODO : load this data and print the average number of tweets in a week for a given country, show the difference for France with and without Charlie Hebdo to justify why**

____

# 4. Constructing the Awareness Model 

## 4.1. Correlations

## 4.2. Non Negative Matrix Factorization

Expliquer pourquoi ca n'a pas fonctionné, Bloop

####  Graph Construction and Graph Diffusion

## 4.3. Regression Model To Predict Reaction Levels

____
____
____

# 4. Critical Assessment and Conclusion