# Milestone 2 : The notebook #

## Introduction ## 

In this notebook the reader will find the pre-analysis of the dataset, some code sample to showcase that the ideas presented in the README are actually realisable and finally a couple of data visualisation to explore some ideas on how we will use the data at our disposal. 

In [4]:
import numpy as np
import pandas as pd
import folium
import urllib
import json
import socket
from ipwhois import IPWhois
import pycountry


To showcase what we can do with the current data we downloaded three examples files.

The way we take care of big data will be explained further in the notebook.

In [5]:
# Load data
# 20150218230000.export.CSV  20150218230000.gkg.csv  20150218230000.mentions.CSV
DATA_PATH = "data/"
MAP_PATH = DATA_PATH + "world.geo.json/countries/"
COUNTRY_CODE_DATA = DATA_PATH + "country-codes/data/country-codes.csv"

export_d = pd.read_csv(DATA_PATH + "20150218230000.export.CSV",sep='\t', names=get_export_names())
gkg = pd.read_csv(DATA_PATH + "20150218230000.gkg.csv",sep='\t', header=None)
mention_d = pd.read_csv(DATA_PATH + "20150218230000.mentions.CSV", sep="\t", names=get_mentions_names())

### Helpers functions ###

In the next cell there is a couple of function that helps us easily interact with the data.

```isNaN```: is a quick way to see when a field = float('nan').

```get_export_names```: Fetch the columns names for the export table.

```get_mentions_name```: Same as before but for the mentions table.

```get_map_site```: This function return two dictionnaries one that takes a website extension as an input and return the ```ISO3166-1-Alpha-3``` code for said country. The other do the same thing but the other way around. 

In [6]:
def isNaN(num):
    return num != num

def get_export_names():
    file = open(DATA_PATH + "event_table_name", "r")
    names = file.readlines()[0].split(" ")
    return names

def get_mentions_names():
    file = open(DATA_PATH + "mentions_table_name", "r")
    names = file.readlines()[0].split(" ")
    return names

def get_map_site():
    file = pd.read_csv(COUNTRY_CODE_DATA)
    return dict(zip(file['TLD'], file['ISO3166-1-Alpha-3'])), dict(zip(file['ISO3166-1-Alpha-3'], file['TLD']))


## Analysis of the data ##

### Peak at the data ###

In the next dataframe we will display the head of all three dataframe to get a concrete view at how it is inside.

In [102]:
export_d

Unnamed: 0,GlobalEventID,Day,MounthYear,Year,FractionDate,Actor1Code,Actor1Name,Actor1CountryCode,Actor1KnownGroupCode,Actor1EthnicCode,...,ActionGeo_Type,ActionGeo_Fullname,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_ADM2Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,DATEADDED,SOURCEURL
0,410412347,20140218,201402,2014,2014.1315,,,,,,...,4,"Waterkloof, Free State, South Africa",SF,SF03,77359,-30.3098,25.29710,-1299321,20150218230000,http://www.dailymaverick.co.za/article/2015-02...
1,410412348,20140218,201402,2014,2014.1315,,,,,,...,4,"Bengaluru, Karnataka, India",IN,IN19,70159,12.9833,77.58330,-2090174,20150218230000,http://timesofindia.indiatimes.com/city/bengal...
2,410412349,20140218,201402,2014,2014.1315,,,,,,...,4,"Great Southern, Victoria, Australia",AS,AS07,5387,-36.0667,146.48300,-1576477,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
3,410412350,20140218,201402,2014,2014.1315,,,,,,...,1,New Zealand,NZ,NZ,,-41.0000,174.00000,NZ,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
4,410412351,20140218,201402,2014,2014.1315,,,,,,...,2,"Idaho, United States",US,USID,,44.2394,-114.51000,ID,20150218230000,http://www.eastidahonews.com/2015/02/neil-patr...
5,410412352,20140218,201402,2014,2014.1315,AUS,AUSTRALIA,AUS,,,...,4,"Brisbane, Queensland, Australia",AS,AS04,154654,-27.5000,153.01700,-1561728,20150218230000,http://www.businessspectator.com.au/article/20...
6,410412353,20140218,201402,2014,2014.1315,AUS,AUSTRALIAN,AUS,,,...,4,"Great Southern, Victoria, Australia",AS,AS07,5387,-36.0667,146.48300,-1576477,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
7,410412354,20140218,201402,2014,2014.1315,AUS,AUSTRALIA,AUS,,,...,1,New Zealand,NZ,NZ,,-41.0000,174.00000,NZ,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
8,410412355,20140218,201402,2014,2014.1315,AUS,AUSTRALIA,AUS,,,...,4,"Great Southern, Victoria, Australia",AS,AS07,5387,-36.0667,146.48300,-1576477,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...
9,410412356,20140218,201402,2014,2014.1315,AUS,AUSTRALIAN,AUS,,,...,1,New Zealand,NZ,NZ,,-41.0000,174.00000,NZ,20150218230000,http://www.voxy.co.nz/entertainment/coast-new-...


In [8]:
gkg.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17,18,19,20,21,22,23,24,25,26
0,20150218230000-0,20150218230000,2,BBC Monitoring,as listed in Russian /BBC Monitoring/(c) BBC,"ARREST#400#political#4#Rossiya, Orenburgskaya ...","ARREST#400#political#4#Rossiya, Orenburgskaya ...",TERROR;REBELS;TAX_ETHNICITY;TAX_ETHNICITY_UKRA...,"TAX_FNCACT,2011;TAX_FNCACT,3449;TAX_FNCACT,381...","4#Budapest, Budapest, Hungary#RS#HU05#47.5#19....",...,"wc:693,c1.1:1,c1.2:1,c12.1:43,c12.10:58,c12.11...",,,,,1332|134||prisoners are treated humanely ; the...,"Channel One,628;Channel One,755;Channel One,14...","3,channels,476;3,channels reported,824;1,corre...",,
1,20150218230000-1,20150218230000,2,BBC Monitoring,"Al-Sharq al-Awsat website, London/BBC Monitori...",,,TAX_FNCACT;TAX_FNCACT_ENVOY;TAX_ETHNICITY;TAX_...,"IDEOLOGY,6790;IDEOLOGY,10592;KILL,1960;KILL,13...",1#Qatar#SA#QA#25.5#51.25#QA;1#Syria#SA#SY#35#3...,...,"wc:2376,c1.2:1,c1.4:7,c12.1:163,c12.10:276,c12...",,,,,332|26||greatly complicate matters#4494|37||ma...,"Emrullah Isler,83;Arab Spring,3571;Development...","2,fronts at the political,1455;2,main strategi...",,
2,20150218230000-2,20150218230000,1,wjon.com,http://wjon.com/wjon-news-on-demand-wednesday-...,,,MANMADE_DISASTER;MANMADE_DISASTER_WITHOUT_POWE...,"MANMADE_DISASTER,91;POWER_OUTAGE,91;",,...,"wc:93,c12.1:7,c12.10:10,c12.12:5,c12.13:4,c12....",http://wac.450F.edgecastcdn.net/80450F/wjon.co...,http:/wac.450F.edgecastcdn.net/80450F/wjon.com...,,https://youtube.com/channel/;https://youtube.c...,,"Waite Park,125;City Food,233;Adobe Flash Playe...","500,people were without power,51;",,
3,20150218230000-3,20150218230000,1,wjol.com,http://www.wjol.com/common/more.php?m=15&r=3&i...,,,LEADER;TAX_FNCACT;TAX_FNCACT_GOVERNOR;TAX_POLI...,"TAX_FNCACT,131;TAX_POLITICAL_PARTY,664;TAX_FNC...","2#Wisconsin, United States#US#USWI#44.2563#-89...",...,"wc:103,c12.1:4,c12.10:13,c12.12:7,c12.13:6,c12...",,,,,,"Governor Rauner,41;Speaker Michael Madigan,158...",,,
4,20150218230000-4,20150218230000,1,straitstimes.com,http://www.straitstimes.com:80/news/world/unit...,,,DRONES;TAX_WORLDMAMMALS;TAX_WORLDMAMMALS_MICE;...,"TAX_FNCACT,634;TAX_FNCACT,1248;TAX_FNCACT,907;...","1#United States#US#US#38#-97#US;3#Miami, Flori...",...,"wc:255,c12.1:10,c12.10:28,c12.12:11,c12.13:7,c...",http://www.straitstimes.com/sites/straitstimes...,,,,,"United States,583;Columbia University Medical ...","5,weeks,1194;",,


In [9]:
mention_d.head()

Unnamed: 0,GlobalEventId,EventTimeDate,MentionTimeDate,MentionType,MentionSourceName,MentionIdentifier,SentenceID,ActorCharOffset,Actor2CharOffset,ActionCharOffset,InRawTest,Confidence,MentionDocLen,MentionDocTone,MentionDocTranslationinfo,Extras
0,410412347,20150218230000,20150218230000,1,dailymaverick.co.za,http://www.dailymaverick.co.za/article/2015-02...,19,-1,4594,4634,1,50,6665,-4.477612,,
1,410412348,20150218230000,20150218230000,1,indiatimes.com,http://timesofindia.indiatimes.com/city/bengal...,2,-1,300,344,1,50,2541,2.078522,,
2,410412349,20150218230000,20150218230000,1,voxy.co.nz,http://www.voxy.co.nz/entertainment/coast-new-...,4,-1,1297,1232,0,10,2576,7.517084,,
3,410412350,20150218230000,20150218230000,1,voxy.co.nz,http://www.voxy.co.nz/entertainment/coast-new-...,4,-1,1298,1233,1,20,2576,7.517084,,
4,410412351,20150218230000,20150218230000,1,eastidahonews.com,http://www.eastidahonews.com/2015/02/neil-patr...,1,-1,103,122,1,100,1432,0.0,,


### A story of duplicates ###

Every element is present twice [...] and this part explain how we deal with it

### Columns we keep and why ###

On a side note the reader can find an exhaustive description of each column on the GDELT documentation (available in this repo in the PDF folder)

#### In export ####

- GlobalEventID : This column holds the unique ID for each event displayed in the dataset. It is also the link between the table export and the table mention.

- Actor1Geo_Lat, Actor1Geo_Long : These fields holds the geographical coordinates of the party included in the event. Its content will be display on a map.

- Actor2Geo_Lat, Actor2Geo_Long : Same thing as before. This holds informations about a second actor, if there is one, relative to a specifiv event.

- ActionGeo_Lat, ActionGeo_Long : Same thing as before but this time it holds the geographical coordinates of where the event was. Like the two last paragraph these information will be displayed on a world map.

- GoldsteinScale : This value which calculated how the stability of the country is after an event will be use to compute a homemade index named "Bias".

- SOURCEURL : Maybe we don't need to keep it ..?

#### In mentions ####  

- GlobalEventId : This fields holds the ID of the event from which the article speaks of. On the contrary of the export table an this fileds can be present multiple times through the database (one for each mention).

- MentionSourceName	: This fields holds the short version of the URL of the source. This will be used the geolocalise the soucre of the information since the information is not provided by GDELT

- Confidence : ???

- MentionDocTone : This fields holds a numerical value which quantified the hostility of an article. This feature is very useful in viewing the genral opinion of an article on an event wich will help us highlight biases in informations

In [10]:
# Maybe display the relevent data without all the garbage around  ?

## Visualisations ##

### Homemade tools for event visualisation ###

This section will display the functions that we created to do the analysis of the data each of them will have its own section for the reader to have a good understanding of how everything goes together.

#### treat_event: ####

This function display on map where an event heppened and if presents one or two actors.

##### Arguments: #####

- export : The GDELT export table

- mentions : The GDELT mentions table !!! Maybe not needed for this one ? !!!!

- f_map : The map on which the event should be display

##### How it works: #####

This function goes through two steps. 

First we look at the specific event in the export table. Once found we gather all interesting geographical informations.

Second, if present the data for each actor and the action is added to the map

In [29]:
def treat_event(export, mention, id_event, f_map):
    """
    lll
    """
    dat = export.loc[export['GlobalEventID'] == id_event]
    
    act_one_lat = dat['Actor1Geo_Lat'].values[0]
    act_one_long = dat['Actor1Geo_Long'].values[0]

    act_two_lat = dat['Actor2Geo_Lat'].values[0]
    act_two_long = dat['Actor2Geo_Long'].values[0]

    a_lat = dat['ActionGeo_Lat'].values[0]
    a_long = dat['ActionGeo_Long'].values[0]
    
    src = dat['SOURCEURL\n'].values[0]
    
    if not isNaN(a_lat) and not isNaN(a_long):
        folium.Marker(
            location=[a_lat, a_long],
            popup=src,
            icon=folium.Icon(color="blue")
        ).add_to(f_map)
        
    if not isNaN(act_one_lat) and not isNaN(act_one_long):
        folium.Marker(
            location=[act_one_lat, act_one_long],
            popup=src,
            icon=folium.Icon(color="green")
        ).add_to(f_map)
        
    if not isNaN(act_two_lat) and not isNaN(act_two_long):
        folium.Marker(
            location=[act_two_lat, act_two_long],
            popup=src,
            icon=folium.Icon(color="red")
        ).add_to(f_map)

#### src_to_country ####

This function display on map a country with relation to the tone of its article.

##### Arguments : #####

- web : A website address
- f_map : A map on which we shall draw
- color : What colour should the country drawn be
- ip : How should we localise the article

##### How it works : ##### 

If ip = True : we locate the news outlet via its ip. First we do a DNS lookup to optain the IP of the website. From that we run a linux command ```whois``` which allows us to see where the website is based via a ```country``` field. From that we transform the alpha-2 country code into alpha-3 to fetch the json file containing the drawing information for said country.

If ip = False : we geolocalise the website using its extension. Earlier in the notebook we showcased in the helper function the creation of dictionnaries that offers a one to one mapping from website extensions to alpha-3 country code. Once the country code has been retrieved we use the same process as before to draw the country.

##### Note : #####

It is useful to highlight that the geolocalisation through DNS lookup and whois command needs an active internet connection

In [None]:
def src_to_country(web, f_map, color, ip = True): # True -> Country by IP, False -> Country by www.[..].country
    if ip:
        try:
            obj = IPWhois(socket.gethostbyname(web))
            results = obj.lookup_rdap(obj)
            country = pycountry.countries.get(alpha_2=results['asn_country_code'])
            layout = MAP_PATH + country.alpha_3 + ".geo.json"

            style_function = lambda x: {'fillColor': color}

            folium.GeoJson(layout, style_function).add_to(f_map)
        except:
            pass
        #return country.alpha_3
        
    else:
        try:
            site, _ = get_map_site() # get dictionnary
            country = site[str(".") + web.split('.')[-1]] # get web extension
            layout = MAP_PATH + country + ".geo.json"

            style_function = lambda x: {'fillColor': color})

            folium.GeoJson(layout, style_function).add_to(f_map)
        except:
            pass
        #return country.alpha_3

def src_to_country_v2(web_data): # Look at extension if fail go for ip alpha 3 
    def extension_lookup(website):
        site, _ = get_map_site() # Get dictionary Extensions to County code
        try:
            return site[str('.') + website.split('.')[-1]]
        except:
            return None
        
    def ip_lookup(website):
        try:
            obj = IPWhois(socket.gethostbyname(website))
            results = obj.lookup_rdap(obj)
            country = pycountry.countries.get(alpha_2=results['asn_country_code'])
            return country.alpha_3
        except:
            return None
        
    def two_way_lookup(website):
        ret = extension_lookup(website)
        if ret == None:
            ret = ip_lookup(website)
            return ret
        return ret
        
    return web_data.map(lambda x: two_way_lookup(x))

#### tone_to_color ####

Provides a mapping from a tone value to a color

##### Arguments : #####

- tone : the tone of an article ]-100, 100[

##### How it works : #####

Trivial for the moment needs redesign

In [14]:
def tone_to_color(tone): # Tone is between -100 , 100 but -10 , 10 for 99% du time 
    if tone < 0:
        if tone < -5:
            return "red"
        else:
            return "pink"
    else:
        if tone > 5 :
            return "green"
        else:
            return "lightgreen"

### Visualisation examples ###

In this section we will show the versatility of the tools we created by showcasing 

#### Example 1 : ####

Display 

In [40]:
map_d = folium.Map(
    location=[39, 36],
    zoom_start=4,
    tiles='Stamen Terrain'
)

treat_event(export_d, mention_d,410412361, map_d )

references = mention_d.loc[mention_d['GlobalEventId'] == 410412361]

for web, tone in zip(references['MentionSourceName'], references['MentionDocTone']) :
    if web.split('.')[-1] == 'com' or web.split('.')[-1] == 'org':
        print(src_to_country(web, map_d, tone_to_color(tone), ip = True))
    else:
        print(src_to_country(web, map_d, tone_to_color(tone), ip = False))

map_d

38.7918
35.6592
TUR
TUR


#### Example 2 : ####

Display media coverage for all event in a country

In [96]:
us_event = export_d.loc[export_d['Actor1CountryCode'] == 'USA']

map_e = folium.Map(
    location=[39, 36],
    zoom_start=4,
    tiles='Stamen Terrain'
)

for ids in us_event['GlobalEventID'][:150]:
    references = mention_d.loc[mention_d['GlobalEventId'] == ids]

    for web, tone in zip(references['MentionSourceName'], references['MentionDocTone']) :
        if web.split('.')[-1] == 'com' or web.split('.')[-1] == 'org' or web.split('.')[-1] == 'net':
            src_to_country(web, map_e, tone_to_color(tone), ip = True)
        else:
            src_to_country(web, map_e, tone_to_color(tone), ip = False)

map_e

IND
USA
USA
USA
USA
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
USA
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
Country resolution for ap.org via IP has failed
Country resolution for stamfordadvocate.com via IP has failed
USA
Country resolution for ap.org via IP has failed
Coun