In [15]:
import pandas as pd
import numpy as np
import os
import collections

# Abstract

*What's the motivation behind your project? A 150 word description of the project idea, goals, dataset used. What story you would like to tell and why?*

Major events happen on a regular basis all around the world, some involving high number of casualties but the resulting reaction on the international scale is often far from proportional. Most of the time the largest reaction comes from the place where the incident occurred or places which are closeby. The objective would be to create an awareness map, and determine why people react to an event. From that we would attempt to define an awareness metric. We want to see how factors other than physical proximity come into play such as country, culture, language, religion. With this we could determine which country has the highest level of international awareness. The project would require the Twitter API to acquire hashtag specific tweets with geolocation and therefore measure the awareness and reactions of different communities to a given event. GDELT would be used to recover standardised information regarding different events.

# Milestone 2 : Data Collection and Description
------
In this notebook we are going to review everything that was done so far in the project and evaluate what the remaining tasks are.

____
____


## 1.  Identifying Relevant Tweets
-----

### 1.1 Reminder of the events that we chose

**Case 1**: Events of similar magnitude, civilian casualties, 6 months timeframe

- Nigeria 30/01/2016, Shooting 65 Deaths, 136 Injured
- Belgium 22/03/2016, Bombing in airport, 35 Deaths, 300+ Injured
- Pakistan 27/03/2016, Bombing, 70 Deaths, 300 Injured
- US 12/06/2016 Shooting in gay bar, 49 Deaths, 53 Injured
- Turkey 28/06/2016, Shooting + bombing in airport, 45 Deaths, 230 Injured

**Case 2**: Events of different magnitude

- France 07/01/2015, Charlie Hebdo, 12 Deaths, 11 Injured
- Nigeria 08/01/2015, Massacre Boko Haram, 200+ Deaths, unknown Injured
- Lebanon 10/01/2015, suicide bombing, 9 Deaths, 30+ Injured


### 1.2 Hashtags as Key Elements for Searching

On twitter the Hashtags are mainly during events. In our case it is the perfect tool to evaluate the awareness across the world. It is very convenient because it is often specifically related to one event and tends to be in english even though the rest of the tweet is in a different language. In order to find all the tweets related to an event, we needed to find as many hashtags which were related and in as many languages as possible.  

### 1.3 Selection of Hashtags 
For the selection of the hastags we need to take into acount these factors:
- Which hashtags do we select?
- How far do we have to go in time to make sure we get all the tweets to study the time evolution?
- Hashtags can be written in different languages


#### Which hashtags do we select?
For selecting the hashtags we used the website http://hashtagify.me/hashtag/smm which after a given search for an initial hashtags it returns the most related hashtags given the timeframe and the actual hashtag similarity. In addition, we manually did an advanced search on twitter to manually check if the hashtags were related to the event and to search for another hashtags that may not appear in the website (some people use more than one related hashtag so that's why we also checked manually).

Here is an example of the hashtags that we selected for Charlie Hebdo:

- PrayForParis
- JeSuisCharlie
- NousSommesChalie
- CharlieHebdo
- LaFranceEstCharlie
- LeMondeestCharlie
- IAmCharlie
- ParisShooting
- FreedomOfSpeech
- somCharlie
- soyCharlie
- SomCharlieHebdo
- YoSoyCharlie
- YoTambienSoyCharlieHebdoç
- أنا_شارلي  
- IchBinCharlie
- EuSouCharlie
- JsemCharlie
- TodosSomosCharlieHebdo
- ЯШарлиЭбдо
- من‌شارلی‌هستم

#### How far do we have to go in time to make sure we get all the tweets to study the time evolution?
We decided that we would only retrieve the tweets done from the day of the event until one week after at maximum. We consider that it will be enough because we concluded that after one week people normally stop massively commenting about these kind of events. Although we think our assumption will be correct, it might not be true for Charlie Hebdo which is the most commented event, but after analyzing all the events we can always rerun the code that gathers all the tweets and get more.

#### Hashtags can be written in different languages
The methodology we applied to get all possible languages was first getting the main hashtags in English and then we manually checked if the translations were also used. We searched for the translations in the website above and also we checked manually in twitter, to check other similar hashtags but with the other languages rather than English. 

____
____

## 2.  Tweets Acquisition
We had originally planned to use the twitter dataset that was given in the course. Unfortunatelly it was containing only 10% of the tweets in a given time period and wasn't including any information on the location of the user nor the user profile. Because of this we decided to go get the tweets about specific events by ourselves. 

------
### 2.1 Twitter API 
Our initial idea was to get the information we needed with the Twitter API, but there again we encountered several problems : 

- The **Rate Limit** of the Twitter API :  It would have taken a lot of time to get the tweets of a specific event, but we were ready to wait and launch the code on several computers (or on clusters)
- The **Search Query** limitations : After designing a code that would allow us to get the tweets by searching specific hashtags over a time interval, we discovered a huge limitation : tweets can only we searched with the API if they are *less than one week old*. 

So we have to discard the idea to use the Twitter API.

------
### 2.2 Scrapping Manually the Tweets 
Fortunatelly the twitter html interface (the website) allows us to search for any query on anytime interval. So we decided get the data by scraping directly the website. For that we use a browser that doesn't have a user interface **PhantomJS** and **Selenium** a python package that allows us to load urls in this browser and scroll down the search page in order to load results. Once loaded the use **Beautifull Soup 4** with the parser **LXML** To get every tweets of the page.

This was done using one script : [`tweet_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/tweet_acquisition.py). 
For each event a new folder is created (for example here `Nigeria_1`). The logs of the tweet acquisition has been saved in this folder with an obvious name (Here `Nigeria_1.log`). Here is an example of the start of the log file : 

-----
```javascript
------------------------------------------- ACQUISITION PARAMETERS -------------------------------------------
Started at : 2017-11-27 10:10:47.485905
Tweets saved in ./Nigeria_1/
Searching from 2016-01-29 to 2016-02-06
Hastags used : ['Dalori', 'Dalorilivesmatter', 'Nigeria', 'BokoHaram', 'Bokoharam', 'bokoharam', 'Borno', 'StopBokoHaram', 'PrayForNigeria']
------------------------------------------- STARTING ACQUISITION -------------------------------------------
1 - Tweets : 2772 - Total : 2772 - Date : 2016-02-05 07:39:06 - Elapsed Time : 810.799 s - Delay : 810.799 s - Rate : 3.419 tw/s - Executed at 2017-11-27 10:24:20.470199
     + First Tweet Time : 2016-02-05 22:11:24
     + Last Tweet Time : 2016-02-05 07:39:06
```

------
The query url is created using the list of hashtags specified inside the script. The explanations on how to use the scripts are in the [`README.md`](ADA2017_Homeworks/Project/TweetAcquisition/README.md) file.


The tweets are acquired by segments : we scroll 500 times the page before parsing the html and saving a pickle containing the Raw data. Each pickle contains an average of 7000 tweets.  We show here an example of the structure of the dataframe acquired :


In [16]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Tweets_1.pickle', 'rb'))
df.head(4)

NameError: name 'pickle' is not defined

We have scrapped as many information as possible from the html page of the search query, bit we still miss the most important thing : the location of the tweet.

------
### 2.3 Scrapping the location of the tweets 
From each tweet we take the `user_name` field and we go to the user profile to get the location information that the user has written on his profile. 
The function that does that is : [`location_acquisiton.py`](ADA2017_Homeworks/Project/TweetAcquisition/location_acquisiton.py). As we don't need to scroll down the page we directly use the **requests** python package combined with **Beautiful Soup 4** and **LXML**. As the code is very slow, we launch several times the process in parrallel in order to get the tweets at the same rate. 

In the follwing we display the head of the *Located* version of the pickled dataframe. 



In [17]:
df = pickle.load(open('TweetAcquisition/Nigeria_1/Located_Tweets_1.pickle', 'rb'))
df.head(4)

NameError: name 'pickle' is not defined

Now we have the raw location information for each event. We need to geocode it to the associated country. 

------

## 3.  Geocoding the tweets

------
### 3.1 



------
## 4.  Enriching the Data
------
### 4.1 Need for enriching with data of each country
So far we have explained all the process from retrieving the tweets to geolocalizing them. That would be enough for visualizing the data and have a general overview of the awareness in each country. However, we didn't want to stop here. As explained in Milestone 1, in part two of our project we want to study the different factors that influence the level of awareness of a given country to a certain event. 

In this section, we describe the process of gathering and cleaning the data. Where did we take the datasets from? What do they look like? We will include a description of the raw data and the necessary steps to clean and transform the data according to our needs.

Basically, we want to have a final dataframe with one row that contains all the features of each country. The neccessary features are listed in the last cell in section 4.1.3.

### 4.1.1.Size of data
Due to the relatively few number of existing countries in the world we did not have any problems for this part regarding the size of the data and memory usage. The dataframes have all around 250 rows, one for each country, and less than 100 feaures.

### 4.1.2. Description of the raw data
The datasets that we used were taken from the Internet. They are all open projects that use official data that can be freely used for studies. We kept some more features than we actually need which we think they could be used in the future. Here is the list of features and description of the raw data that we got:

Link dataset 1: https://github.com/mledoze/countries
- Country name: dictionary of dictionaries.
    - common : common name in english
    - official : official name in english
    - native : list of all native names
        - key: ISO 3166-1 alpha-3 language code
        - value: name object
            - key: official - official name translation
            - key: common - common name translation
            
- Country code: code ISO 3166-1 alpha-2
- Country code: code ISO 3166-1 alpha-3
- Borders: list of all country codes (alpha-3) that touch the border of each country
- Land area (in $km^2$)
- Latitude and Longitude: in a list [latitude, longitude]
- Official languages: dictionary of dictionaries.
    - key: ISO 3166-1 alpha-3 language code
    - value: name of the language in english

Link dataset 2: http://www.thearda.com/Archive/Files/Downloads/WRDNATL_DL2.asp
- Population: total
- Religions: each religion is one column and the data is given both in total number of adherents in each country and also as a percentage
- Country code: code ISO 3166-1 alpha-3

Link dataset 3a: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
- Total GDP: in USD

Link dataset 3b: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
- Per capita GDP: in USD

Link dataset 4: https://github.com/opendatajson/factbook.json
- Type of government: not categorized. Needs string processing to create categorical government types
- Country code: GEC codes

##### Missing feature
- Number of active tweeter users.

This issue will be adressed in section 6.

#### Examples of the first dataset

In [18]:
# Show the useful columns as they are directly read from the file
cols = ['area', 'cca2', 'cca3', 'borders', 'name', 'latlng', 'languages']
countries = pd.read_json(r'DataEnriching/countries.json')[cols]
countries.head()

Unnamed: 0,area,cca2,cca3,borders,name,latlng,languages
0,180.0,AW,ABW,[],"{'common': 'Aruba', 'official': 'Aruba', 'nati...","[12.5, -69.96666666]","{'nld': 'Dutch', 'pap': 'Papiamento'}"
1,652230.0,AF,AFG,"[IRN, PAK, TKM, UZB, TJK, CHN]","{'common': 'Afghanistan', 'official': 'Islamic...","[33, 65]","{'prs': 'Dari', 'pus': 'Pashto', 'tuk': 'Turkm..."
2,1246700.0,AO,AGO,"[COG, COD, ZMB, NAM]","{'common': 'Angola', 'official': 'Republic of ...","[-12.5, 18.5]",{'por': 'Portuguese'}
3,91.0,AI,AIA,[],"{'common': 'Anguilla', 'official': 'Anguilla',...","[18.25, -63.16666666]",{'eng': 'English'}
4,1580.0,AX,ALA,[],"{'common': 'Åland Islands', 'official': 'Åland...","[60.116667, 19.9]",{'swe': 'Swedish'}


In [19]:
# Example of format of name column and language column 
print('name', countries.name[0], '\n')
print('languages', countries.languages[0])

name {'common': 'Aruba', 'official': 'Aruba', 'native': {'nld': {'official': 'Aruba', 'common': 'Aruba'}, 'pap': {'official': 'Aruba', 'common': 'Aruba'}}} 

languages {'nld': 'Dutch', 'pap': 'Papiamento'}


#### Examples of the second dataset

In [20]:
# Show the useful columns as they are directly read from the file  
pop_rel_df = pd.read_excel('DataEnriching/World Religion Dataset - National Religion Dataset.xlsx')
cols = ['YEAR', 'ISO3', 'POP', 'DUALREL'] + \
        [col for col in pop_rel_df.columns if 'PCT' in col]
    
pop_rel_df = pop_rel_df[cols]
pop_rel_df.head()

Unnamed: 0,YEAR,ISO3,POP,DUALREL,CHPRTPCT,CHCATPCT,CHORTPCT,CHANGPCT,CHOTHPCT,CHGENPCT,...,SHGENPCT,BAGENPCT,TAGENPCT,JAGENPCT,COGENPCT,SYGENPCT,ANGENPCT,NORELPCT,OTGENPCT,SUMPCT
0,1945,USA,139928000,0,0.4722,0.2767,0.007999,0.017199,0.013999,0.788,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1635,0.003899,0.9961
1,1950,USA,152271008,0,0.48,0.28,0.019999,0.019999,0.007699,0.8077,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1482,0.0041,0.9959
2,1955,USA,165931000,0,0.4779,0.2796,0.020799,0.015499,0.013699,0.8076,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1404,0.019299,0.9807
3,1960,USA,180671000,0,0.502,0.28,0.018499,0.014999,0.016099,0.8315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.119299,0.007599,0.9924
4,1965,USA,194631000,0,0.4838,0.3327,0.024599,0.014499,0.004999,0.8607,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101999,0.003,0.997


As we have data for every year, we selected the most recent data which is from 2010.

We only selected the columns that have the data of the percentage of adherents, not the totals. The percentage data columns have PCT in the column name. For any given religion, we have the percentage of adherents for its main branches but also for the whole religion itself (i.e. for Chrisianity we have CHGENPCT: Total percentage adherents but also CHPRTPCT: Protestants percentage,
CHANGPCT: Anglican percentage, etc). For our project we are only interested in the total percentage of adherents in the whole religion, not its branches.

To consider wheter a religion is practiced in one country or not we will need to set a treshold on the percentage of adherents.

We also took the DUALREL and the SUMPCT columns to show that the sum of the percentages ('SUMPCT') can add up to more than 1 because in some countries they have dual religion as shown in the next example.

In [21]:
pop_rel_df[pop_rel_df.DUALREL == 1][['ISO3', 'DUALREL', 'SUMPCT']].head(1)

Unnamed: 0,ISO3,DUALREL,SUMPCT
36,CUB,1,1.7277


#### Examples of the third dataset

In [22]:
gdp_total = pd.read_csv('DataEnriching/gdp_total.csv', skiprows=3)[['Country Name', 'Country Code', '2016']]
gdp_total.rename(columns={'2016': '2016_gdp_total', 'Country Code': 'ISO3'}, inplace=True)
gdp_capita = pd.read_csv('DataEnriching/gdp_per_capita.csv', skiprows=4)[['Country Name', 'Country Code', '2016']]
gdp_capita.rename(columns={'2016': '2016_gdp_capita', 'Country Code': 'ISO3'}, inplace=True)

gdp_df = pd.merge(gdp_total, gdp_capita, on=['ISO3', 'Country Name'])
gdp_df.head()

Unnamed: 0,Country Name,ISO3,2016_gdp_total,2016_gdp_capita
0,Aruba,ABW,,
1,Afghanistan,AFG,19469020000.0,561.778746
2,Angola,AGO,89633160000.0,3110.808183
3,Albania,ALB,11926890000.0,4146.89625
4,Andorra,AND,,


#### Examples of the fourth dataset

In [23]:
# Names of the folders
region_folders = ['africa', 'australia-oceania', 'central-america-n-caribbean', 'central-asia', 'east-n-southeast-asia',
          'europe', 'middle-east', 'north-america', 'south-america', 'south-asia']

# We use a temporaty df to load the data for a particular country and we append it to the main GEC_gov_type_df
GEC_gov_type_df = pd.DataFrame()
for region in region_folders:
    for country_file in os.listdir(r'DataEnriching/factbook.json/' + region):
        df = pd.read_json(r'DataEnriching/factbook.json/' + region + '/' + country_file)
        try:
            gov_type = df.loc['Government type', 'Government']['text']
        except:
            gov_type = 'unknown'
        
        GEC_gov_type_df = GEC_gov_type_df.append({'GEC_code': country_file[:2], 'gov_type': gov_type}, ignore_index=True)
    
GEC_gov_type_df.head()

Unnamed: 0,GEC_code,gov_type
0,ag,presidential republic
1,ao,presidential republic
2,bc,parliamentary republic
3,bn,presidential republic
4,by,presidential republic


This dataset doesn't contain any ISO code for the countries. Instead, we could only get the GEC code, so we will have to map it with the ISO codes that we will use to merge all the information together in a single dataframe.

With respect to the government type, we can see in the cell below that it's not standarized. We will have to analyse each one and group them in broader categories so that we end up with a categorical feature.

In [24]:
# Example to show that the gov_type is not standarized
print(GEC_gov_type_df.gov_type[10])
print(GEC_gov_type_df.gov_type[29])
print(GEC_gov_type_df.gov_type[221])
print(GEC_gov_type_df.gov_type[234])

presidential republic
parliamentary constitutional monarchy
parliamentary democracy (Parliament); self-governing overseas territory of the UK
parliamentary democracy (Legislative Assembly); self-governing overseas territory of the UK


### 4.1.3. Filtering, transforming the data according to our needs
In this section the pipeline will be inverted. We will explain all the cleaning and selection process and finally we will present the final dataframe with a description of the features.

The actual code of all the procedures is in the notebook: "Country data gathering.ipynb".
Here we just load the pickled dataframes in order to show the results.

#### Dataset 1
For this dataset we added some columns to the initial dataframe. From the original languages column we extracted the official languages and the codes  and we added them separately as two new columns, while dropping the original one. The same procedure was applied to the name, to create the name and name_native columns. The name that was extracted to create the column name is the 'common' name in english of the country (see raw data description for all posible options).

In [25]:
countries_df = pd.read_pickle('DataEnriching/countries_df.pickle')
countries_df.head()

Unnamed: 0,area,ISO2,ISO3,ISO_num,borders,name,language_codes,latlng,languages,name_native
0,180.0,AW,ABW,533,[],Aruba,"[nld, pap]","[12.5, -69.96666666]","[Dutch, Papiamento]","[Aruba, Aruba]"
1,652230.0,AF,AFG,4,"[IRN, PAK, TKM, UZB, TJK, CHN]",Afghanistan,"[prs, pus, tuk]","[33, 65]","[Dari, Pashto, Turkmen]","[افغانستان, افغانستان, Owganystan]"
2,1246700.0,AO,AGO,24,"[COG, COD, ZMB, NAM]",Angola,[por],"[-12.5, 18.5]",[Portuguese],[Angola]
3,91.0,AI,AIA,660,[],Anguilla,[eng],"[18.25, -63.16666666]",[English],[Anguilla]
4,1580.0,AX,ALA,248,[],Åland Islands,[swe],"[60.116667, 19.9]",[Swedish],[Åland]


#### Dataset 2
The only things that we did were to select the most recent data of 2010 with a simple query and select the columns that we will need, which are the percentages.

In [26]:
# Showing just data of 2010
pop_rel_df = pd.read_pickle('DataEnriching/pop_rel_df.pickle')
pop_rel_df.head()

Unnamed: 0,ISO3,COUNTRY,POP,DUALREL,CHPRTPCT,CHCATPCT,CHORTPCT,CHANGPCT,CHOTHPCT,CHGENPCT,...,SHGENPCT,BAGENPCT,TAGENPCT,JAGENPCT,COGENPCT,SYGENPCT,ANGENPCT,NORELPCT,OTGENPCT,SUMPCT
0,USA,United States of America,312750000,0,0.3829,0.2507,0.022499,0.015499,0.0738,0.7454,...,0.0005,0.0015,0.0,0.0003,0.0003,0.002599,0.005699,0.19,0.0025,0.9975
1,CAN,Canada,34500000,0,0.2298,0.4202,0.022799,0.078899,0.014399,0.7661,...,0.0,0.0005,9.9e-05,9.9e-05,9.9e-05,0.0008,0.0021,0.1643,0.001,0.999
2,BHS,Bahamas,313312,0,0.676,0.14,0.0,0.15,0.0,0.966,...,0.0,0.0,0.0003,0.0,0.0,0.0,0.0032,0.028999,0.0005,0.9995
3,CUB,Cuba,11241161,1,0.048899,0.6,0.0,0.0,0.009999,0.6589,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.1315,0.0,1.2935
4,HTI,Haiti,9760832,1,0.1,0.72,0.0,0.0,0.0,0.82,...,0.0,0.0009,0.0,0.0,0.0,0.45,0.0,0.1,0.0,1.3711


#### Dataset 3
No cleaning needed

#### Dataset 4
We only read the json files and extracted the government type. There were some countries that we didn't have this data so we set the variable to 'unknown'. The GEC code was taken from the two first strings of the file (i.e. sp for Spain was taken from sp.json file).

To merge the datasets together we need ISO3 codes, so we mapped the GEC codes. To do that we did some web scraping in http://www.statoids.com/wab.html and then we merged into a single dataframe.

In [27]:
gov_type_df = pd.read_pickle('DataEnriching/gov_type_df.pickle')
gov_type_df.head()

Unnamed: 0,GEC_code,gov_type,ISO2,ISO3,ISO_num
0,ag,presidential republic,DZ,DZA,12
1,ao,presidential republic,AO,AGO,24
2,bc,parliamentary republic,BW,BWA,72
3,bn,presidential republic,BJ,BEN,204
4,by,presidential republic,BI,BDI,108


#### Merging into one dataframe & transforming columns
We merged the four previous dataframes on ISO3 country codes. Now we need to categorize the gov_type column and compress the religions into one column.

For the religions, we proceded with the following:
- we only consider that a country has a certain religion if the corresponding percentage is greater than a treshold of 10%
So we run a loop for every religion in every country and we only keep the ones that pass the treshold.

For the gov_type the methodology was the following. We are going to run a function to all the rows of the gov_type column that will return the most common sequences of words, so that we can then manually check which are the main type of government. 
Once the types of government are defined, we will run loop through each row and replace the value with the mapped categorical government type.

Here is an example of the code we used to make the manual checking. As we noticed most gov_types had 2 or 3 words, we filtered to sequences of that number of words.

In [28]:
def phrases(string):
    """Splits the input string on whitespace and returns all possible substrings of any length"""
    words = string.split()
    result = []
    for number in range(len(words)):
        for start in range(len(words)-number):
             result.append(" ".join(words[start:start+number+1]))
    return result

# Example
phrases('Hi my name is Jacob')

['Hi',
 'my',
 'name',
 'is',
 'Jacob',
 'Hi my',
 'my name',
 'name is',
 'is Jacob',
 'Hi my name',
 'my name is',
 'name is Jacob',
 'Hi my name is',
 'my name is Jacob',
 'Hi my name is Jacob']

In [29]:
data = pd.read_pickle('DataEnriching/data.pickle')

all_strings = list(data.gov_type)

# Counts all ocurrences of a substring 
all_phrases = collections.Counter(phrase for subject in all_strings for phrase in phrases(subject))

# Printing the most common substrings and the number of occurences
ocurrences = [(phrase, count) for phrase, count in all_phrases.items() if count > 1]
filtered_ocurrences = [ocurrences[i][0] for i in range(len(ocurrences)) if 2 <= len(ocurrences[i][0].split()) <= 3]
filtered_ocurrences[:10]

['parliamentary democracy',
 'presidential republic',
 'parliamentary republic',
 'presidential democracy',
 'federal republic',
 'communist state']

After the manual checking, the types of government considered where:
- parliamentary democracy
- parliamentary republic
- presidential republic
- semi-presidential republic
- presidential democracy
- absolute monarchy
- federal republic
- communist state
- monarchy
- others

Missing data is under the category 'unknown'.

After all the modifications, the final dataframe with all the data looks like this.

In [30]:
data

Unnamed: 0_level_0,area,ISO2,languages,latlng,language_codes,POP,religion,2016_gdp_total,2016_gdp_capita,gov_type,Active tweeter users
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Aruba,180,AW,"[Dutch, Papiamento]","[12.5, -69.96666666]","[nld, pap]",,{},,,parliamentary democracy,TBD
Afghanistan,652230,AF,"[Dari, Pashto, Turkmen]","[33, 65]","[prs, pus, tuk]",27000000,{'Islam': 0.9956},1.9469e+10,561.779,other,TBD
Angola,1.2467e+06,AO,[Portuguese],"[-12.5, 18.5]",[por],19114176,{'Christianism': 0.8912},8.96332e+10,3110.81,presidential republic,TBD
Anguilla,91,AI,[English],"[18.25, -63.16666666]",[eng],,{},,,parliamentary democracy,TBD
Åland Islands,1580,AX,[Swedish],"[60.116667, 19.9]",[swe],,{},,,unknown,TBD
Albania,28748,AL,[Albanian],"[41, 20]",[sqi],3195525,"{'Christianism': 0.2144, 'Islam': 0.63, 'Non-r...",1.19269e+10,4146.9,parliamentary republic,TBD
Andorra,468,AD,[Catalan],"[42.5, 1.5]",[cat],85500,{'Christianism': 0.907},,,parliamentary democracy,TBD
United Arab Emirates,83600,AE,[Arabic],"[24, 54]",[ara],6236650,"{'Islam': 0.6748, 'Hindu': 0.2225}",3.48743e+11,37622.2,other,TBD
Argentina,2.7804e+06,AR,"[Guaraní, Spanish]","[-34, -64]","[grn, spa]",40399992,"{'Christianism': 0.8515, 'Non-religious': 0.12}",5.45866e+11,12449.2,presidential republic,TBD
Armenia,29743,AM,"[Armenian, Russian]","[40, 45]","[hye, rus]",3245781,{'Christianism': 0.951},1.05473e+10,3606.15,presidential republic,TBD


### Description of the cleaned data
Index:
- name: common country name in english

Features:
- area: land area (in $km^2$)
- ISO2: code ISO 3166-1 alpha-2
- languages: list of official languages
- latlng: latitude and longitude
- language_codes: list of the official language codes
- POP: total number of inhabitants
- religion: dictionary of main religions (PCT>10%). Example of value: {rel1: percentage1, ... , relN: percentageN}. Value can be empty dict {} when we did not have the data (see note below)
- 2016_gdp_total: total gdp in USD
- 2016_gdp_capita: per capita gdp in USD
- gov_type: categorical value. List of categories above.
- active tweeter users: we did not find this data yet. We explain our solution to this problem in section 6.

------
## 5. Data Visualization


------
## 6. Critical Assessment

- Fact that twitter is biased by nature

It is very unlikely that the distribution of people that use Twitter in each country will be the same between countries which results in a probable biase due to the fact that we pnly used Twitter to define the awareness. Ideally, to counteract this effect, we should have scrapped data from different social media.  


- Locations are never to be perfect, the location information is not objective


 


- Could have used Google API for example, but it is limited in the number of queries, and I won't be perfect either because usually google maps uses contextual infomation to find the location you are looking for

- Didn't need to use GDELT dataset for the moment

We thought that we would need the GDELT dataset to gather extra information on the events but after searching for these information we found that we didn't have useful features so although we said we would use it, for the moment we can continue with it.

- Didn't find number of active twitter users

This data might not exist or it might not be available. We searched for it but we could not find it. 

------
## 7. What's next ? 


- Merge retrieved tweets dataframe with the enriching country information.
- Define how we are going to normalize the number of tweets of each country given that we didn't find the number of active users in Twitter.
- For each event, plot normalized awareness (number of tweets) & time evolution in a choropleth map.


- awareness metric
- Start thinking about the data story