# Milestone 2 : Data Collection and Description
------
In this notebook we are going to review everything that was done so far in the project and evaluate what the remaining tasks are.

In [14]:
import numpy as np
import pandas as pd
import unicodedata
import time
import os
import collections

## 0. Abstract

*What's the motivation behind your project? A 150 word description of the project idea, goals, dataset used. What story you would like to tell and why?*

Major events happen on a regular basis all around the world, some involving high number of casualties but the resulting reaction on the international scale is often far from proportional. Most of the time the largest reaction comes from the place where the incident occurred or places which are closeby. The objective would be to create an awareness map, and determine why people react to an event. From that we would attempt to define an awareness metric. We want to see how factors other than physical proximity come into play such as country, culture, language, religion. With this we could determine which country has the highest level of international awareness. The project would require the Twitter API to acquire hashtag specific tweets with geolocation and therefore measure the awareness and reactions of different communities to a given event. GDELT would be used to recover standardised information regarding different events.
____
____
____


## 1.  Identifying Relevant Tweets
-----

### 1.1 Reminder of the events that we chose

**Case 1**: Events of similar magnitude, civilian casualties, 6 months timeframe

- Nigeria 30/01/2016, Shooting 65 Deaths, 136 Injured
- Belgium 22/03/2016, Bombing in airport, 35 Deaths, 300+ Injured
- Pakistan 27/03/2016, Bombing, 70 Deaths, 300 Injured
- US 12/06/2016 Shooting in gay bar, 49 Deaths, 53 Injured
- Turkey 28/06/2016, Shooting + bombing in airport, 45 Deaths, 230 Injured

**Case 2**: Events of different magnitude

- France 07/01/2015, Charlie Hebdo, 12 Deaths, 11 Injured
- Nigeria 08/01/2015, Massacre Boko Haram, 200+ Deaths, unknown Injured
- Lebanon 10/01/2015, suicide bombing, 9 Deaths, 30+ Injured


### 1.2 Hashtags as Key Elements for Searching

On twitter the Hashtags are mainly during events. In our case it is the perfect tool to evaluate the awareness across the world. It is very convenient because it is often specifically related to one event and tends to be in english even though the rest of the tweet is in a different language. In order to find all the tweets related to an event, we needed to find as many hashtags which were related and in as many languages as possible.  

### 1.3 Selection of Hashtags 
For the selection of the hastags we need to take into acount these factors:
- Which hashtags do we select?
- How far do we have to go in time to make sure we get all the tweets to study the time evolution?
- Hashtags can be written in different languages


#### Which hashtags do we select?
For selecting the hashtags we used the website http://hashtagify.me/hashtag/smm which after a given search for an initial hashtags it returns the most related hashtags given the timeframe and the actual hashtag similarity. In addition, we manually did an advanced search on twitter to manually check if the hashtags were related to the event and to search for another hashtags that may not appear in the website (some people use more than one related hashtag so that's why we also checked manually).

Here is an example of the hashtags that we selected for Charlie Hebdo:

- PrayForParis
- JeSuisCharlie
- NousSommesChalie
- CharlieHebdo
- LaFranceEstCharlie
- LeMondeestCharlie
- IAmCharlie
- ParisShooting
- FreedomOfSpeech
- somCharlie
- soyCharlie
- SomCharlieHebdo
- YoSoyCharlie
- YoTambienSoyCharlieHebdoç
- أنا_شارلي  
- IchBinCharlie
- EuSouCharlie
- JsemCharlie
- TodosSomosCharlieHebdo
- ЯШарлиЭбдо
- من‌شارلی‌هستم

#### How far do we have to go in time to make sure we get all the tweets to study the time evolution?
We decided that we would only retrieve the tweets done from the day of the event until one week after at maximum. We consider that it will be enough because we concluded that after one week people normally stop massively commenting about these kind of events. Although we think our assumption will be correct, it might not be true for Charlie Hebdo which is the most commented event, but after analyzing all the events we can always rerun the code that gathers all the tweets and get more.

#### Hashtags can be written in different languages
The methodology we applied to get all possible languages was first getting the main hashtags in English and then we manually checked if the translations were also used. We searched for the translations in the website above and also we checked manually in twitter, to check other similar hashtags but with the other languages rather than English. 

____
____

## 2.  Tweets Acquisition
We had originally planned to use the twitter dataset that was given in the course. Unfortunatelly it was containing only 10% of the tweets in a given time period and wasn't including any information on the location of the user nor the user profile. Because of this we decided to go get the tweets about specific events by ourselves. 

------
### 2.1 Twitter API 
Our initial idea was to get the information we needed with the Twitter API, but there again we encountered several problems : 

- The **Rate Limit** of the Twitter API :  It would have taken a lot of time to get the tweets of a specific event, but we were ready to wait and launch the code on several computers (or on clusters)
- The **Search Query** limitations : After designing a code that would allow us to get the tweets by searching specific hashtags over a time interval, we discovered a huge limitation : tweets can only we searched with the API if they are *less than one week old*. 

So we have to discard the idea to use the Twitter API.

------
### 2.2 Scrapping Manually the Tweets 
Fortunatelly the twitter html interface (the website) allows us to search for any query on anytime interval. So we decided get the data by scraping directly the website. For that we use a browser that doesn't have a user interface **PhantomJS** and **Selenium** a python package that allows us to load urls in this browser and scroll down the search page in order to load results. Once loaded the use **Beautifull Soup 4** with the parser **LXML** To get every tweets of the page.

This was done using one script : [`tweet_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/tweet_acquisition.py). 
For each event a new folder is created (for example here `Nigeria_1`). The logs of the tweet acquisition has been saved in this folder with an obvious name (Here `Nigeria_1.log`). Here is an example of the start of the log file : 

-----
```javascript
------------------------------------------- ACQUISITION PARAMETERS -------------------------------------------
Started at : 2017-11-27 10:10:47.485905
Tweets saved in ./Nigeria_1/
Searching from 2016-01-29 to 2016-02-06
Hastags used : ['Dalori', 'Dalorilivesmatter', 'Nigeria', 'BokoHaram', 'Bokoharam', 'bokoharam', 'Borno', 'StopBokoHaram', 'PrayForNigeria']
------------------------------------------- STARTING ACQUISITION -------------------------------------------
1 - Tweets : 2772 - Total : 2772 - Date : 2016-02-05 07:39:06 - Elapsed Time : 810.799 s - Delay : 810.799 s - Rate : 3.419 tw/s - Executed at 2017-11-27 10:24:20.470199
     + First Tweet Time : 2016-02-05 22:11:24
     + Last Tweet Time : 2016-02-05 07:39:06
```

------
The query url is created using the list of hashtags specified inside the script. The explanations on how to use the scripts are in the [`README.md`](ADA2017_Homeworks/Project/TweetAcquisition/README.md) file.


The tweets are acquired by segments : we scroll 500 times the page before parsing the html and saving a pickle containing the Raw data. Each pickle contains an average of 7000 tweets.  We show here an example of the structure of the dataframe acquired :


In [9]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Tweets_1.pickle', 'rb'))
df.head(4)

Unnamed: 0,hashtags,id,language,text,time_stamp,user_id,user_name,date
0,"[@FitzMP, #Biafrans, #Nigeria, #TyrantBuhari]",695731474243440640,en,"@FitzMP,We #Biafrans have died enough, we don’...",1454710284,354778701,EmekaGift,2016-02-05 22:11:24
1,"[#Nigeria, http://bit.ly/1SR2k89 ]",695758765627281408,es,"A más de dos años de que comenzó la crisis, ¿q...",1454716790,57683930,MSF_Mexico,2016-02-05 23:59:50
2,[#PrayForNigeria],695758763517730816,en,"I wish I was a little kid again, where all I h...",1454716790,518819812,allthingselliej,2016-02-05 23:59:50
3,"[#Nigeria, http://bit.ly/1odNc9y , #VOA]",695758537289367552,en,#Nigeria E-readers Help Thousands in Africa Le...,1454716736,2468196914,Vincecob,2016-02-05 23:58:56


We have scrapped as many information as possible from the html page of the search query, bit we still miss the most important thing : the location of the tweet.

------
### 2.3 Scrapping the location of the tweets 
From each tweet we take the `user_name` field and we go to the user profile to get the location information that the user has written on his profile. 
The function that does that is : [`location_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/location_acquisiton.py). As we don't need to scroll down the page we directly use the **requests** python package combined with **Beautiful Soup 4** and **LXML**. As the code is very slow, we launch several times the process in parrallel in order to get the tweets at the same rate. 

In the follwing we display the head of the *Located* version of the pickled dataframe. 



In [12]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Located_Tweets_1.pickle', 'rb'))
df.head(4)

Unnamed: 0,hashtags,id,language,text,time_stamp,user_id,user_name,date,location
0,"[@FitzMP, #Biafrans, #Nigeria, #TyrantBuhari]",695731474243440640,en,"@FitzMP,We #Biafrans have died enough, we don’...",1454710284,354778701,EmekaGift,2016-02-05 22:11:24,[www.radiobiafra.co]
1,"[#Nigeria, http://bit.ly/1SR2k89 ]",695758765627281408,es,"A más de dos años de que comenzó la crisis, ¿q...",1454716790,57683930,MSF_Mexico,2016-02-05 23:59:50,"[Ciudad, de, Mexico]"
2,[#PrayForNigeria],695758763517730816,en,"I wish I was a little kid again, where all I h...",1454716790,518819812,allthingselliej,2016-02-05 23:59:50,[SomewhereOnlyWeKnow]
3,"[#Nigeria, http://bit.ly/1odNc9y , #VOA]",695758537289367552,en,#Nigeria E-readers Help Thousands in Africa Le...,1454716736,2468196914,Vincecob,2016-02-05 23:58:56,"[Brussels,, Belgium]"


Now we have the raw location information for each event. We need to geocode it to the associated country. 

------

## 3.  Geocoding the tweets

------

When geocoding the tweets multiple factors needed to be taken into account : 
- Not all the tweets had a provided location, in which case we scraped the user's location. However, most users do not provide that information which means that a large number of tweets had to be discarded as no location could be attributed. 
- The locations provided are entered manually. That means that users can write their locations however they want, with any spelling ("USAAAAAA"), with any type of special characters ("P@ris"). Some locations are even invented ("Somewhere only we know", "Heaven"). 
- The locations can be either countries, cities, regions, neighborhoods... 
- The locations can be written in any language

This is only listing a few of the issues which were encountered. That is why the first step was to create a mapping dictionary which would link various country names (in different languages and with different spellings) as well as cities to the ISO2 country code. Afterwards we went through the various dataframes containing the tweets for the different events and mapped the locations using the different dictionaries created. Finally for each event we determined the activity level of each country as the number tweets associated to the country.

All of this was done using dictionaries to have a fast localization process. Currently, for pickles containing around 8000 tweets, we require around one second to identify the corresponding countries from the different dictionaries.


The notebooks used for the different steps are available here : 
-  [`Constructing the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Constructing%20the%20Mappings.ipynb)

- [`Geocoding Tweets Using the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Geocoding%20Tweets%20Using%20the%20Mappings.ipynb)

- [`Determining Number of Tweets Per Event`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Determining%20Number%20of%20Tweets%20Per%20Event.ipynb)

We will however explain some of the main points here. 

Remark : due to the size of certain files and some of these mappings they could not be put on the github. 

_____

### 3.1 Creating the Different Location Mappings

As mentioned previously we need to take into account multiple things for the mappings. The most important being languages and alternative spellings. That is why we created our mappings from databases which gave alternative names and spellings for each country, capital, city when possible. Once all of these we obtained we also removed all possible accents from the strings. We then combined the original spellings with the formatted strings and removed duplicates using sets. The remaining names were then used as keys of the dictionnaries and the corresponding country as the value.

String formatting functions and accent removal as well as list formatting functions

In [67]:
#https://stackoverflow.com/questions/8694815/removing-accent-and-special-characters
def remove_accents(data):
    if data is None:
        return None
    else :
        clean = ''.join(x.lower().strip() for x in unicodedata.normalize('NFKD', data) if \
                unicodedata.category(x)[0] == 'L').lower()
        return clean
    
def string_formatting(string):
    string = string.replace("-", " ").replace(" ", ",").split(",")
    formatted_string = [remove_accents(x.lower()) for x in string]
    return string,formatted_string

def clean_sublist(x):
    return list(set(filter(None, np.hstack(x))))

def remove_accents_in_sublist(l):
    return list(map(lambda x:remove_accents(x.lower()),l))
    
def remove_accents_in_list(lists):
    return list(map(lambda x:remove_accents_in_sublist(x),lists))

def clean_and_remove_accents_in_list(lists):
    return list(map(lambda x:clean_sublist(remove_accents_in_sublist(x)),lists))

Here we show an example of the string formatting used on two countries using the data provided in https://datahub.io/core/country-codes which gives the name of the county in different languages and the capital in english. 

In [68]:
test_list = [['أفغانستان', 'afganistán', '阿富汗', 'афганистан', 'Kabul', 'afghanistan'], ['阿尔巴尼亚', 'албания', 'Tirana', 'ألبانيا', 'albania', 'albanie']]
clean_and_remove_accents_in_list(test_list)

[['阿富汗', 'афганистан', 'afganistan', 'afghanistan', 'kabul', 'افغانستان'],
 ['阿尔巴尼亚', 'албания', 'tirana', 'albania', 'albanie', 'البانيا']]

The function below is used to convert the dataframe to a dictionary where the values is the index and the keys are all the elements in the different rows. We also take the variants of all the elements in the rows and use them as keys

In [90]:
def convert_df_to_dict(df, do_prints = False):
    
    # Converting the dataframe values to list and cleaning them
    t = time.time()
    df_list = list(map(lambda x:clean_sublist(x),df.values.tolist()))
    if do_prints : print("Converting to list :", time.time()-t)

    # Removing all the accents from the elements in the list
    t = time.time()
    df_variants = clean_and_remove_accents_in_list(df_list)
    if do_prints : print("Getting variants :", time.time()-t)
    
    # Combining the lists with original spellings and without accents
    t = time.time()
    df_all =  list(map(lambda x: list(set(df_list[x] + df_variants[x])),range(len(df))))
    if do_prints : print("Combining Lists :", time.time()-t)
        
    # Getting all the keys
    t = time.time()
    keys = list(map(lambda x: [df.index[x]]*(len(df_all[x])),range(len(df_all))))
    if do_prints : print("Getting all keys :", time.time()-t)
      
    # Creating the dictionary
    t = time.time()
    mapping = dict(zip(sum(df_all, []),sum(keys, [])))
    if do_prints : print("Converting to dict :", time.time()-t)
        
    return mapping

### 3.1.1 The country_mapping dictionary

Here is an example of how some of the previous functions were used to construct a portion of the country_mapping dictionary using the data from https://datahub.io/core/country-codes. At the end we print the first 18 elements of the dictionary which are used to identify Afghanistan and Albania.

In [98]:
# Load the country names in different languages mapping
country_codes = pd.read_csv("GeocodingTweets/Mapping Files/country-codes.csv")
keep_columns = ['official_name_ar', 'official_name_cn', 'official_name_en',
                'official_name_es', 'official_name_fr', 'official_name_ru',
                'ISO3166-1-Alpha-2', 'ISO3166-1-Alpha-3', 'ISO3166-1-numeric',
                'Capital', 'Continent', 'Region Name','Sub-region Name']       
# Keep only the desired columns
country_codes = country_codes[keep_columns]
country_codes.rename(inplace = True, index=str, columns={"official_name_ar": "arabic", "official_name_cn":"chinese", "official_name_en":"english", 
                                                         "official_name_es":"spanish", "official_name_fr":"french", "official_name_ru":"russian",
                                                         "ISO3166-1-Alpha-2":"ISO2", "ISO3166-1-Alpha-3":"ISO3", "ISO3166-1-numeric":"ISONum"})
# Remove the first element in the dataframe
country_codes = country_codes.iloc[1:]

# Create the dictionary
country_codes.set_index("ISO2", inplace = True)
col = ["english", "french", "spanish", "chinese", "russian", "arabic", "Capital"]
country_mapping1 = convert_df_to_dict(country_codes[col])

print(list(zip(country_mapping1.keys(), country_mapping1.values()))[:22])

[('Afghanistan', 'AF'), ('阿富汗', 'AF'), ('Афганистан', 'AF'), ('афганистан', 'AF'), ('Afganistán', 'AF'), ('afghanistan', 'AF'), ('Kabul', 'AF'), ('afganistan', 'AF'), ('kabul', 'AF'), ('افغانستان', 'AF'), ('أفغانستان', 'AF'), ('Албания', 'AL'), ('阿尔巴尼亚', 'AL'), ('албания', 'AL'), ('Albanie', 'AL'), ('tirana', 'AL'), ('albania', 'AL'), ('ألبانيا', 'AL'), ('albanie', 'AL'), ('Tirana', 'AL'), ('البانيا', 'AL'), ('Albania', 'AL')]


The second portion of the country mapping was created taking the data from https://raw.githubusercontent.com/mledoze/countries/master/countries.json
which gives the names of the different countries in different languages with alternative spellings

In [99]:
# Functions necessary to extract the information from the different cells of the dataframe

# Get the common native name from the dictionary in the native column
def extract_native_name(x):
    try:
        return x["native"][list(x["native"].keys())[0]]["common"]
    except:
        return 

# Get the different translations from the dictionary in the official column
def extract_translations(x):
    val = x.values()
    try:
        return[name["common"] for name in x.values()]
    except:
        return 

# Load the json into a dataframe and keep only relevant columns
country_df = pd.read_json("GeocodingTweets/Mapping Files/countries.json")
country_df = country_df[["altSpellings", "capital", "cca2", "name", "translations"]]
country_df.rename(inplace = True, index=str, columns={"cca2": "ISO2"})
country_df.set_index("ISO2", inplace = True)

# Extract from the different columns the alternative names and spellings in different languages
country_df["common"] = country_df["name"].apply(lambda x: x["common"])
country_df["official"] = country_df["name"].apply(lambda x: x["official"])
country_df["native"] = country_df["name"].apply(lambda x: extract_native_name(x))
country_df["common translations"] = country_df["translations"].apply(lambda x: extract_translations(x))
country_df["altSpellings"] = country_df["altSpellings"] .apply(lambda x: x[1:] if len(x)>1 else [])
country_df.drop(["name","translations"], axis = 1, inplace = True)

# Convert the dataframe to a dictionary
country_mapping2 = convert_df_to_dict(country_df)

Finally both mappings were merged into one. The entire process is only run once as the output is pickled. For more details refer to the original [`Constructing the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Constructing%20the%20Mappings.ipynb) notebook. Here we output the mapping for Afghanistan using respectively the first dataset, the second and the combination of both.


In [100]:
country_mapping = {**country_mapping1, **country_mapping2}

for mapping in [country_mapping1, country_mapping2,country_mapping]:
    print("---------------------------------------------------------------")
    for key, val in zip(mapping.keys(), mapping.values()):
        if val == "AF":
            print(key, val)
print("---------------------------------------------------------------")

---------------------------------------------------------------
Afghanistan AF
阿富汗 AF
Афганистан AF
афганистан AF
Afganistán AF
afghanistan AF
Kabul AF
afganistan AF
kabul AF
افغانستان AF
أفغانستان AF
---------------------------------------------------------------
Afġānistān AF
アフガニスタン AF
Afghanistan AF
Афганистан AF
Islamic Republic of Afghanistan AF
阿富汗 AF
афганистан AF
アフカニスタン AF
islamicrepublicofafghanistan AF
Afganistan AF
Afganistán AF
Kabul AF
afganistan AF
afghanistan AF
افغانستان AF
Afeganistão AF
Affganistan AF
kabul AF
afeganistao AF
affganistan AF
---------------------------------------------------------------
Afghanistan AF
阿富汗 AF
Афганистан AF
афганистан AF
Afganistán AF
afghanistan AF
Kabul AF
afganistan AF
kabul AF
افغانستان AF
أفغانستان AF
Afġānistān AF
アフガニスタン AF
Islamic Republic of Afghanistan AF
アフカニスタン AF
islamicrepublicofafghanistan AF
Afganistan AF
Afeganistão AF
Affganistan AF
afeganistao AF
affganistan AF
--------------------------------------------------------

### 3.1.2 The city_mapping dictionary

This mapping was created using the Cities of the world dataset in Json format which is based on GeoNames Gazetteer taken from https://github.com/lutangar/cities.json. 

In [101]:
city_df = pd.read_json("GeocodingTweets/Mapping Files/cities.json")
city_df.drop(["lat", "lng"], axis = 1, inplace = True)
city_df.rename(inplace = True, index=str, columns={"country": "ISO2", "name":"city"})
city_df.set_index("city", inplace = True)
city_df.head()

Unnamed: 0_level_0,ISO2
city,Unnamed: 1_level_1
Sant Julià de Lòria,AD
Pas de la Casa,AD
Ordino,AD
les Escaldes,AD
la Massana,AD


The issue with this mapping is that there are multiple cities with the same name in different countries. As we have no way of determining which city is the most likely, we drop those rows from the dataframe and store them in a second one. 

In [102]:
doublons = city_df.copy()
doublons["num"] = 1
doublons = doublons.groupby("city").sum()
doublons = doublons[doublons.num>1]
doublons = doublons.index.tolist()
print(len(doublons))

10409


Here we have an example of why the mapping provided is problematic, especially since we cannot rely on language to determine to which country the city belongs to. 

In [103]:
city_df.loc["Toronto","ISO2"]

city
Toronto    AU
Toronto    CA
Toronto    US
Name: ISO2, dtype: object

Dropping all problematic cities from the mapping and creating a dictionary from the remaining cities. 

In [104]:
reduced_city_df = city_df.drop(doublons)
city_mapping = dict(zip(reduced_city_df.index, reduced_city_df.ISO2))

alt_names = [remove_accents(x) for x in reduced_city_df.index]
city_mapping = {**dict(zip(alt_names, reduced_city_df.ISO2)), **city_mapping}


Unfortunately, this mapping is far from complete and is missing many cities, especially after having removed the cities with identical names. However we can quickly check a few of the cities

In [105]:
print("Nantes in :", city_mapping[remove_accents("Nantes")])
print("Lausanne in :", city_mapping[remove_accents("Lausanne")])
print("Abu Dhabi in :", city_mapping[remove_accents("Abu Dhabi")])
print("Shanghai in :", city_mapping[remove_accents("Shanghai")])
print("Beijing in :", city_mapping[remove_accents("Beijing")])
print("Tokyo in :", city_mapping[remove_accents("Tokyo")])

Nantes in : FR
Lausanne in : CH
Abu Dhabi in : AE
Shanghai in : CN
Beijing in : CN
Tokyo in : JP


### 3.1.3 The full_city_mapping dictionaries

This mapping was constructed using a databse of cities in each country taken from http://download.geonames.org/export/dump/. This database contains a zip file for each country with a textfile containing the different cities as well as alternate names. The functions used to process this database are much longer which is why we urge any curious readers to refer to the original notebook used to construct this mapping [`Constructing the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Constructing%20the%20Mappings.ipynb) for more details (see the section Method 4 : Using the Geonames Database) 

The main steps are the following: 
- Load the text file for each country
- If the text file is larger than a max size, split it into smaller text files to speed up the processing
- For each text file load the data into a dataframe 
- Use the convert df to dict function to save a dictionary for each text file
- Once the text files have been processed, load the different dictionaries and merge them into dictionaries with the name full_city_mapping_i which each contain 100 dictionaries. 

Breaking up the data into smaller subsets was necessary to speed up the processing time as well as to account for the large amount of memory needed and the time it takes to load the dictionaries from the pickle formats. 

It is important to mention that we did not handle cities with the same name in this dataset. One idea was to use the information regarding the population of each city to determine which was the most likely. This question still has to be adressed. 

------
### 3.2. Using the Dictionaries to Map the Locations 


To map the tweets to their locations we used in order : 

- The country_mapping dictionary to check whether the country name or capital was contained in the string using the country_mapping dictionary. This was possible using the data obtained from : https://mledoze.github.io/countries/ and https://datahub.io/core/country-codes. The first links the country iso codes to country names in multiple languages with not only the official but also the common names of a country. The latter links the country iso codes to country names in different languages (arabic, chinese, english, spanish, french, russian). 

- The city_mapping dictionary which was constructed using data from https://github.com/lutangar/cities.json from which we removed duplicate cities. Note that this list was not exhaustive to being with which is why we also used the last mapping 

- A city to country mapper extracted from : http://www.geonames.org/export/ and http://download.geonames.org/export/dump/. The issue with this dataframe is that the duplicate cities were not handled. The advantage of this mapper however is that it is more extensive than the previous one, contaning a larger number of cities as well as alternative spellings and different languages. Ideally, what should have been done in the case of multiple cities with same name would be to select based on the population of the cities. However we do not have the population size for all the cities provided. 

- If none provided a valid location, one idea which would have been a good solution if we had payed the subscriptions would have been to use APIs such as the google API to obtain the most likely location corresponding to a given input. The issue with these APIs is that they tend to be slow (1 second per tweet) and are limited to a given number of queries which is why they were not used. Some of the APIs we looked at were those based on the same dataset as the one used to create the full_city_mapping dictionaries such as the ones which can be found here http://geocoder.readthedocs.io/results.html. It outputs the most probable location to which the user selected location corresponds to. Even if we had subscriptions to these services, given that the number of tweets is in the order of magnitude of the millions and that the query was relatively slow, this would not have been feasible on the entire dataset. Note : we do not know whether it would have been faster with a subscription but it would have had to be at least 100 times faster to be a viable candidate solution.

Here is a simplified version of the function used in the [`Geocoding Tweets Using the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Geocoding%20Tweets%20Using%20the%20Mappings.ipynb) notebook. The function here only looks at the country mapping which is already in the notebook just to serve as a proof of concept and see how much time it takes to identify the country / city using the dictionaries. 

The idea is that we need to take into account that certain locations are made up of multiple words which is why we test the combination of adjacent words. We then check for each of the comabinations whether the combination is in the mapping. If it is then we output the result. 

In [81]:
def country_in_string(loc, do_prints = False): 
    t = time.time()
    
    # Get the formatted and non formatted version of the words
    words, formatted_words = string_formatting(loc)
    if do_prints : print(words)
        
    # Remove words smaller than 2 characters and get all their combinations
    # considering only adjacent words
    words = [x.lower() for x in words if len(x)>2]
    formatted_words = [x for x in formatted_words if len(x)>2]
    
    word_combinations = [" ".join(words[i:j]) for j in range(len(words)+1) for i in range(j)]
    word_combinations += [" ".join(words[i:j]) for j in range(len(formatted_words)+1) for i in range(j)]
    if do_prints : print(word_combinations)
    
    # If one of the combinations is in the dict then output it
    matching = []
    for word in word_combinations:
        if do_prints : print("Testing: ", word)
        if word in country_mapping:
            return country_mapping[word], time.time()-t

    return None, time.time()-t


Testing the function above with different countries, in different languages playing with the upper case and lower case letters

In [82]:
print(country_in_string("أفغانستان hello I am bored"))
print(country_in_string("أفغانستان"))
print(country_in_string("España oiejdoew sdjoidsjf sdnfosid"))
print(country_in_string("autriche"))
print(country_in_string("oesterreich"))
print(country_in_string("osterreich"))
print(country_in_string("austria"))
print(country_in_string("vienna"))
print(country_in_string("Hello New Zealand"))
print(country_in_string("Hello New ZeAlAND"))
print(country_in_string("Washington"))
print(country_in_string("CAIro"))

('AF', 8.988380432128906e-05)
('AF', 3.409385681152344e-05)
('ES', 8.416175842285156e-05)
('AT', 5.817413330078125e-05)
('AT', 3.314018249511719e-05)
('AT', 3.62396240234375e-05)
('AT', 2.288818359375e-05)
('AT', 2.2172927856445312e-05)
('NZ', 3.409385681152344e-05)
('NZ', 3.0279159545898438e-05)
('US', 2.1219253540039062e-05)
('EG', 1.9073486328125e-05)


Examples of tests using all the different mappings can be seen in the original notebook [`Geocoding Tweets Using the Mappings`](https://github.com/LailaHms/ADA2017_Homeworks/blob/Laila_Project/Project/GeocodingTweets/Geocoding%20Tweets%20Using%20the%20Mappings.ipynb). 

This function was then run on the different dataframes of the tweets which were acquired and the results were stored into the Geolocated folder. This way we can always access the different tweets and their locations if we want. 

----
### 3.3. Determining the Number of Tweets Per Country Per Event 

Using the dataframes with the tweets and their locations we then used the groupby functionality and count to determine hoe many tweets there were per dataframe per country and merged them all into one summary_dataframe in the Geolocated folder of each event. The code used can be seen here : 

In [None]:
execute = False

if execute:
    cwd = os.getcwd()
    path = os.path.join(cwd, "../../../Project Data/Tweets")
    # Get all the files in the current working directory
    folders = os.listdir(path)
    # Keep only the folders excluding the checkpoints folder -> event folders
    folders = [x for x in folders if os.path.isdir(os.path.join(path, x)) if "checkpoints" not in x if "DS_Store" not in x]

    do_prints = False

    # Get the country codes from the country mapping pickle. This will be used to init
    # the dataframe which will contain the overall number of tweets per country per event. 

    country_codes = pd.read_pickle("country_mapping.pickle")
    if do_prints : print(type(list(set(country_codes.values()))[0]))
    country_codes = [x for x in list(set(country_codes.values())) if type(x) is not float]

    # Go through all the different events folders
    for folder in folders:
        # Get all the files in the event folder
        files_path = os.path.join(path, folder, "Geocoded")
        located_files = [x for x in  os.listdir(files_path) if "Located" in x]

        # Create the first empty dataframe in which all the counts will be stored
        event_locations = pd.DataFrame(pd.Series(country_codes), columns = ["country"])
        event_locations.set_index("country", inplace = True)
        event_locations["text"] = 0
        event_locations["text"] = event_locations["text"]

        # Go through all the different files in the folder and process them.
        for pkl_file in tqdm(located_files):
            # Read the pickle file, groupby country and count the number of tweets then add
            # to the final df for the event
            df = pd.read_pickle(os.path.join(files_path,pkl_file))
            interm_df = df[["country", "text"]].groupby("country").count()
            event_locations = event_locations.add(interm_df, fill_value=0)

        # Pickle the event dataframes
        if do_prints: print(event_locations["text"].tolist())
        event_locations.to_pickle(os.path.join(files_path, "summary.pickle"))


------
## 4.  Enriching the Data
------
### 4.1 Need for enriching with data of each country
So far we have explained all the process from retrieving the tweets to geolocalizing them. That would be enough for visualizing the data and have a general overview of the awareness in each country. However, we didn't want to stop here. As explained in Milestone 1, in part two of our project we want to study the different factors that influence the level of awareness of a given country to a certain event. 

In this section, we describe the process of gathering and cleaning the data. Where did we take the datasets from? What do they look like? We will include a description of the raw data and the necessary steps to clean and transform the data according to our needs.

Basically, we want to have a final dataframe with one row that contains all the features of each country. The neccessary features are listed in the last cell in section 4.1.3.

### 4.1.1.Size of data
Due to the relatively few number of existing countries in the world we did not have any problems for this part regarding the size of the data and memory usage. The dataframes have all around 250 rows, one for each country, and less than 100 feaures.

### 4.1.2. Description of the raw data
The datasets that we used were taken from the Internet. They are all open projects that use official data that can be freely used for studies. We kept some more features than we actually need which we think they could be used in the future. Here is the list of features and description of the raw data that we got:

Link dataset 1: https://github.com/mledoze/countries
- Country name: dictionary of dictionaries.
    - common : common name in english
    - official : official name in english
    - native : list of all native names
        - key: ISO 3166-1 alpha-3 language code
        - value: name object
            - key: official - official name translation
            - key: common - common name translation
            
- Country code: code ISO 3166-1 alpha-2
- Country code: code ISO 3166-1 alpha-3
- Borders: list of all country codes (alpha-3) that touch the border of each country
- Land area (in $km^2$)
- Latitude and Longitude: in a list [latitude, longitude]
- Official languages: dictionary of dictionaries.
    - key: ISO 3166-1 alpha-3 language code
    - value: name of the language in english

Link dataset 2: http://www.thearda.com/Archive/Files/Downloads/WRDNATL_DL2.asp
- Population: total
- Religions: each religion is one column and the data is given both in total number of adherents in each country and also as a percentage
- Country code: code ISO 3166-1 alpha-3

Link dataset 3a: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
- Total GDP: in USD

Link dataset 3b: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
- Per capita GDP: in USD

Link dataset 4: https://github.com/opendatajson/factbook.json
- Type of government: not categorized. Needs string processing to create categorical government types
- Country code: GEC codes

##### Missing feature
- Number of active tweeter users.

This issue will be adressed in section 6.

#### Examples of the first dataset


In [None]:
# Show the useful columns as they are directly read from the file
cols = ['area', 'cca2', 'cca3', 'borders', 'name', 'latlng', 'languages']
countries = pd.read_json(r'DataEnriching/countries.json')[cols]
countries.head()

In [None]:
# Example of format of name column and language column 
print('name', countries.name[0], '\n')
print('languages', countries.languages[0])

#### Examples of the second dataset

In [None]:
# Show the useful columns as they are directly read from the file  
pop_rel_df = pd.read_excel('DataEnriching/World Religion Dataset - National Religion Dataset.xlsx')
cols = ['YEAR', 'ISO3', 'POP', 'DUALREL'] + \
        [col for col in pop_rel_df.columns if 'PCT' in col]
    
pop_rel_df = pop_rel_df[cols]
pop_rel_df.head()

As we have data for every year, we selected the most recent data which is from 2010.

We only selected the columns that have the data of the percentage of adherents, not the totals. The percentage data columns have PCT in the column name. For any given religion, we have the percentage of adherents for its main branches but also for the whole religion itself (i.e. for Chrisianity we have CHGENPCT: Total percentage adherents but also CHPRTPCT: Protestants percentage,
CHANGPCT: Anglican percentage, etc). For our project we are only interested in the total percentage of adherents in the whole religion, not its branches.

To consider wheter a religion is practiced in one country or not we will need to set a treshold on the percentage of adherents.

We also took the DUALREL and the SUMPCT columns to show that the sum of the percentages ('SUMPCT') can add up to more than 1 because in some countries they have dual religion as shown in the next example.

In [None]:
pop_rel_df[pop_rel_df.DUALREL == 1][['ISO3', 'DUALREL', 'SUMPCT']].head(1)

#### Examples of the third dataset

In [None]:
gdp_total = pd.read_csv('DataEnriching/gdp_total.csv', skiprows=3)[['Country Name', 'Country Code', '2016']]
gdp_total.rename(columns={'2016': '2016_gdp_total', 'Country Code': 'ISO3'}, inplace=True)
gdp_capita = pd.read_csv('DataEnriching/gdp_per_capita.csv', skiprows=4)[['Country Name', 'Country Code', '2016']]
gdp_capita.rename(columns={'2016': '2016_gdp_capita', 'Country Code': 'ISO3'}, inplace=True)

gdp_df = pd.merge(gdp_total, gdp_capita, on=['ISO3', 'Country Name'])
gdp_df.head()

#### Examples of the fourth dataset

In [None]:
# Names of the folders
region_folders = ['africa', 'australia-oceania', 'central-america-n-caribbean', 'central-asia', 'east-n-southeast-asia',
          'europe', 'middle-east', 'north-america', 'south-america', 'south-asia']

# We use a temporaty df to load the data for a particular country and we append it to the main GEC_gov_type_df
GEC_gov_type_df = pd.DataFrame()
for region in region_folders:
    for country_file in os.listdir(r'DataEnriching/factbook.json/' + region):
        df = pd.read_json(r'DataEnriching/factbook.json/' + region + '/' + country_file)
        try:
            gov_type = df.loc['Government type', 'Government']['text']
        except:
            gov_type = 'unknown'
        
        GEC_gov_type_df = GEC_gov_type_df.append({'GEC_code': country_file[:2], 'gov_type': gov_type}, ignore_index=True)
    
GEC_gov_type_df.head()

This dataset doesn't contain any ISO code for the countries. Instead, we could only get the GEC code, so we will have to map it with the ISO codes that we will use to merge all the information together in a single dataframe.

With respect to the government type, we can see in the cell below that it's not standarized. We will have to analyse each one and group them in broader categories so that we end up with a categorical feature.

In [None]:
# Example to show that the gov_type is not standarized
print(GEC_gov_type_df.gov_type[10])
print(GEC_gov_type_df.gov_type[29])
print(GEC_gov_type_df.gov_type[221])
print(GEC_gov_type_df.gov_type[234])

### 4.1.3. Filtering, transforming the data according to our needs
In this section the pipeline will be inverted. We will explain all the cleaning and selection process and finally we will present the final dataframe with a description of the features.

The actual code of all the procedures is in the notebook: "Country data gathering.ipynb".
Here we just load the pickled dataframes in order to show the results.

#### Dataset 1
For this dataset we added some columns to the initial dataframe. From the original languages column we extracted the official languages and the codes  and we added them separately as two new columns, while dropping the original one. The same procedure was applied to the name, to create the name and name_native columns. The name that was extracted to create the column name is the 'common' name in english of the country (see raw data description for all posible options).

In [None]:
countries_df = pd.read_pickle('DataEnriching/countries_df.pickle')
countries_df.head()

#### Dataset 2
The only things that we did were to select the most recent data of 2010 with a simple query and select the columns that we will need, which are the percentages.

In [None]:
# Showing just data of 2010
pop_rel_df = pd.read_pickle('DataEnriching/pop_rel_df.pickle')
pop_rel_df.head()

#### Dataset 3
No cleaning needed

#### Dataset 4
We only read the json files and extracted the government type. There were some countries that we didn't have this data so we set the variable to 'unknown'. The GEC code was taken from the two first strings of the file (i.e. sp for Spain was taken from sp.json file).

To merge the datasets together we need ISO3 codes, so we mapped the GEC codes. To do that we did some web scraping in http://www.statoids.com/wab.html and then we merged into a single dataframe.

In [None]:
gov_type_df = pd.read_pickle('DataEnriching/gov_type_df.pickle')
gov_type_df.head()

#### Merging into one dataframe & transforming columns
We merged the four previous dataframes on ISO3 country codes. Now we need to categorize the gov_type column and compress the religions into one column.

For the religions, we proceded with the following:
- we only consider that a country has a certain religion if the corresponding percentage is greater than a treshold of 10%
So we run a loop for every religion in every country and we only keep the ones that pass the treshold.

For the gov_type the methodology was the following. We are going to run a function to all the rows of the gov_type column that will return the most common sequences of words, so that we can then manually check which are the main type of government. 
Once the types of government are defined, we will run loop through each row and replace the value with the mapped categorical government type.

Here is an example of the code we used to make the manual checking. As we noticed most gov_types had 2 or 3 words, we filtered to sequences of that number of words.

In [None]:
def phrases(string):
    """Splits the input string on whitespace and returns all possible substrings of any length"""
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

# Example
phrases('Hi my name is Jacob')

In [None]:
data = pd.read_pickle('DataEnriching/data.pickle')

all_strings = list(data.gov_type)

# Counts all ocurrences of a substring 
all_phrases = collections.Counter(phrase for subject in all_strings for phrase in phrases(subject))

# Printing the most common substrings and the number of occurences
ocurrences = [(phrase, count) for phrase, count in all_phrases.items() if count > 1]
filtered_ocurrences = [ocurrences[i][0] for i in range(len(ocurrences)) if 2 <= len(ocurrences[i][0].split()) <= 3]
filtered_ocurrences[:10]

After the manual checking, the types of government considered where:
- parliamentary democracy
- parliamentary republic
- presidential republic
- semi-presidential republic
- presidential democracy
- absolute monarchy
- federal republic
- communist state
- monarchy
- others

Missing data is under the category 'unknown'.

After all the modifications, the final dataframe with all the data looks like this.

In [None]:
data.head(20)

### Description of the cleaned data
Index:
- name: common country name in english

Features:
- area: land area (in $km^2$)
- ISO2: code ISO 3166-1 alpha-2
- languages: list of official languages
- latlng: latitude and longitude
- language_codes: list of the official language codes
- POP: total number of inhabitants
- religion: dictionary of main religions (PCT>10%). Example of value: {rel1: percentage1, ... , relN: percentageN}. Value can be empty dict {} when we did not have the data (see note below)
- 2016_gdp_total: total gdp in USD
- 2016_gdp_capita: per capita gdp in USD
- gov_type: categorical value. List of categories above.
- active tweeter users: we did not find this data yet. We explain our solution to this problem in section 6.

------
## 5. Data Visualization


------
## 6. Critical Assessment

- **Fact that twitter is biased by nature**

It is very unlikely that the distribution of people that use Twitter in each country will be the same between countries which results in a probable biase due to the fact that we only used Twitter to define the awareness. Ideally, to counteract this effect, we should have scrapped data from different social media.  

-  **Location is not always provided.** 

There are certain countries which are more aware about the risks of providing locations on social media. Therfore when discarding the tweets which had no location provided this adds to the bias. For example, when comparing the number of tweets between two similar countries, one may have been more reactive to an event but with very few users actually providing locations. Therefore the reactions we are measuring are only true for a subset of users. We have to assume that the same proportion of users in the different countries provide their locations so that the results be pertinent.


- **Locations are never to be perfect, the location information is not objective**

The mapping is not perfect. As there are multiple cities with the same name all over the world, when we obtain a mapping we can never be sure where it came from. Of course there are some locations which are more probable than others which is something we will attempt to address for the next milestone. 

We could have used the Google API for example, but it is limited in the number of queries, and I won't be perfect either because usually google maps uses contextual infomation to find the location you are looking for. 


- **Didn't need to use GDELT dataset for the moment**

We thought that we would need the GDELT dataset to gather extra information on the events but after searching for these information we found that we didn't have useful features so although we said we would use it, for the moment we can continue without it.

- **Didn't find number of active twitter users**

This data exists (for example on statista) but to access it we need to pay a yearly subscription fee of 600 dollars. This was not an option as the course was not going to fund it. We asked the TAs whether the lab had access to a similar dataset but it did not seem to be the case. For the time being, this data is not available for the project.


------
## 7. What's next ? 
- Merge retrieved tweets dataframe with the enriching country information.
- Define how we are going to normalize the number of tweets of each country given that we didn't find the number of active users in Twitter.
- For each event, plot normalized awareness (number of tweets) & time evolution in a choropleth map.


- awareness metric
- Start thinking about the data story