# MEDIA FRAMING : THE BIAS IN INTERNATIONAL NEWS COVERAGE #

This notebook presents all of the work we have done to understand the data.

In [18]:
import os
import pandas as pd
import numpy as np
import pandas as pd
import folium
import urllib
import json
import socket
from ipwhois import IPWhois
import pycountry
import io
import requests
import xarray as xr
import numpy as np
import pandas as pd
import holoviews as hv
import geoviews as gv
import geoviews.feature as gf
import geoviews.tile_sources as gts
import geopandas

from bokeh.palettes import YlOrBr3 as palette

import cartopy
from cartopy import crs as ccrs

from bokeh.tile_providers import STAMEN_TONER
from bokeh.models import WMTSTileSource

hv.notebook_extension('bokeh')

In [10]:
DATA_PATH = "data/"
DATA_PATH = "data/"
MAP_PATH = DATA_PATH + "world.geo.json/countries/"
COUNTRY_CODE_DATA = DATA_PATH + "country-codes/data/country-codes.csv"

## 1. A story of loading the data ##

The first step to start the project is to find a way of handling the data. The GDELT project uses a system of archives where they scrap the whole internet for 15 min to search for news and save their findings in three csv files:
- export: contains information about the events
- mentions: contains information about the articles that mention the events
- gkg: contains analysis of the articles made by gdelt

The main problem here is that we have a tremendous amount of data divided into multiple csv files. Scraping every 15 min means:
- We have 4 csv files per hour for each dataset which represents a total of 4x3 = 12 csv files
- We have 24 hours in a day which means we have 4x24 = 96 csv files per day per dataset which represents a total of 96x3 = 288 csv files per day
- We have a maximum of 31 days in a month which means we have 31x96 = 2 976 csv files per month per dataset which represents a total of 2976x3 = 8 928 csv files per month
- We have 12 months per year which means that we have 12x2976 = 35 712 csv files per month per dataset which represents a total of 3x35 712 = 107 136 csv files per year

In terms of size each dataset represents: 
- Between [300,500] kO <-> [0.3,0.5] MO for the export dataset
  0.5x4 = 2 MO per hour <-> 0.5x96 = 48 MO per day <-> 0.5x2976 = 1 488 MO = 1.5 GO per month <-> 12x1.5 = 18 GO per year 
- Between [600,1500] k0 <-> [0.6,1.5] MO for the mention dataset
  1.5x4 = 6 MO per hour <-> 1.5x96 = 144 MO per day <-> 1.5x2976 = 2 232 MO = 2.25 GO per month <-> 12x2.25 = 27 GO per year
- Between [15,30] MO for the gkg dataset 
  30x4 =  120 MO per hour <-> 30x96 = 2 880 MB = 2.9 GB per day <-> 30x2976 = 89 280 MB = 90 GO per month <-> 12x90 = 1 080 GO = 1 TB per year

Therefore, in a year we have: 18 + 27 + 1 080 = 1 125 GB = 1.13 TB. As we know we have 3 years of data, its total size is 3.3 TB.

This means that the GDELT project database represents terabytes of data which corresponds to the amount of data saved in the cluster.

### 1.1 Get the csv file's URL from the master file list ##

Instead of using the cluster, we decied to start off with using the GDELT website's information. From the url of the masterfilelist.txt we stored the urls of the csv files in a pandas dataframe. Eventually, instead of using the url of the csv files, we will use their pathnames in the cluster.

In [3]:
url='http://data.gdeltproject.org/gdeltv2/masterfilelist.txt'
s=requests.get(url).content

In [4]:
df_list=pd.read_csv(io.StringIO(s.decode('utf-8')), sep='\s', header=None, names=['Size', 'Code', 'url'])

  if __name__ == '__main__':


In [5]:
df_list = df_list.dropna(subset=['url'])

In [6]:
df_list['url'].head()

0    http://data.gdeltproject.org/gdeltv2/201502182...
1    http://data.gdeltproject.org/gdeltv2/201502182...
2    http://data.gdeltproject.org/gdeltv2/201502182...
3    http://data.gdeltproject.org/gdeltv2/201502182...
4    http://data.gdeltproject.org/gdeltv2/201502182...
Name: url, dtype: object

In [7]:
df_list[df_list['url'].str.contains('.export.CSV')].head()

Unnamed: 0,Size,Code,url
0,150383,297a16b493de7cf6ca809a7cc31d0b93,http://data.gdeltproject.org/gdeltv2/201502182...
3,149211,2a91041d7e72b0fc6a629e2ff867b240,http://data.gdeltproject.org/gdeltv2/201502182...
6,149723,12268e821823aae2da90882621feda18,http://data.gdeltproject.org/gdeltv2/201502182...
9,158842,a5298ce3c6df1a8a759c61b5c0b6f8bb,http://data.gdeltproject.org/gdeltv2/201502182...
12,362610,c4268d558bb22c02b3c132c17818c68b,http://data.gdeltproject.org/gdeltv2/201502190...


In [None]:
# We get the columns names of the datasets from the text files we've created
col_ex = get_export_names()
col_men = get_mentions_names()

# We define create a list of the column names of the columns we want to keep in the datasets
col_ex_list = ['GlobalEventID', 'Day', 'MounthYear', 'Year', 'ActionGeo_CountryCode', 'ActionGeo_Lat', 'ActionGeo_Long', 'GoldsteinScale', 'NumMentions']
col_men_list = ['GlobalEventId', 'MentionSourceName', 'Confidence', 'MentionDocTone']

# We create the empty the aggregated dataframes with the column names we want to keep
export_df = pd.DataFrame(columns=col_ex_list)
mentions_df = pd.DataFrame(columns=col_men_list)

display(export_df.head())
display(mentions_df.head())

print(col_ex)

In [None]:

def scrape_list(url_ex, url_men, export_df, mentions_df):
    '''
    This function will use the list of export.csv and mentions.csv files to cash their contents and only keep relavant
    columns
    '''
    for i in range(url_ex.shape[0]):
        # Appending is slightly faster than Concat when ignore_index=True, so we used append to add  new scraped dataFrame
        ## But appending gets inefficient for large dataFrame, so instead of appending the new scraped dataframe to a ...
        ## ... large dataFrame, we recursively call our function to use a new empty dataFrame for appending to achieve...
        ## ... much faster speed in scraping large number of dataframes
        if i>= 100:
            level_f += 1
            export_df_2 = pd.DataFrame(columns=col_ex_list)
            mentions_df_2 = pd.DataFrame(columns=col_men_list)
            export_df_2, mentions_df_2 = scrape_list(url_ex.iloc[100:], url_men.loc[100:], export_df_2, mentions_df_2)
            export_df = export_df.append(export_df_2,ignore_index=True)
            mentions_df = mentions_df.append(mentions_df_2,ignore_index=True) 
            break
        else:
            s_ex=requests.get(url_ex.iloc[i])
            s_men = requests.get(url_men.iloc[i])
            if s_ex.status_code==200 and s_men.status_code==200:
                df_i_m=pd.read_csv(io.BytesIO(s_ex.content), sep='\t', compression='zip', names=col_ex)
                df_i_x=pd.read_csv(io.BytesIO(s_men.content), sep='\t',compression='zip', names=col_men)
                export_df = export_df.append(df_i_m[col_ex_list],ignore_index=True)
                mentions_df = mentions_df.append(df_i_x[col_men_list],ignore_index=True)
    return export_df, mentions_df
     

In [None]:
# Parsing the data and returning the aggregated dataFrame
export_df, mentions_df = scrape_list(df_ex_w01['url'], df_men_w01['url'], export_df, mentions_df)

In [5]:
# Saving the resulted dataframes
#export_df.to_csv(os.path.join('results','export_df.csv'))
#mentions_df.to_csv(os.path.join('results','mentions_df.csv'))

# Loading the dataFrame
export_df = pd.read_csv('export_df.csv', index_col=0)
mentions_df = pd.read_csv('mentions_df.csv', index_col=0)

  mask |= (ar1 == a)


In [6]:
# Merging the two dataFrames (export and mentions)
df_merged = export_df.set_index('GlobalEventID').join(mentions_df.set_index('GlobalEventId'), how='left')
df_merged = df_merged.drop_duplicates(keep='first').reset_index()
df_merged = df_merged.rename(columns= {df_merged.columns[0]:'GlobalEventID'})

# Downcasting data types to decrease the file of dataframe
df_merged.iloc[:,np.r_[0:2,7:8]] = df_merged.iloc[:,np.r_[0:2,7:8]].apply(pd.to_numeric, downcast='integer', errors='coerce')
df_merged = df_merged.astype({"ActionGeo_CountryCode": str, "MentionSourceName": str})
display(df_merged.head())
df_merged.dtypes

Unnamed: 0,GlobalEventID,Day,ActionGeo_CountryCode,ActionGeo_Lat,ActionGeo_Long,GoldsteinScale,MentionSourceName,Confidence,MentionDocTone
0,410412347,20140218,SF,-30.3098,25.2971,2.8,dailymaverick.co.za,50.0,-4.477612
1,410412348,20140218,IN,12.9833,77.5833,1.9,indiatimes.com,50.0,2.078522
2,410412349,20140218,AS,-36.0667,146.483,1.9,voxy.co.nz,10.0,7.517084
3,410412350,20140218,NZ,-41.0,174.0,1.9,voxy.co.nz,20.0,7.517084
4,410412351,20140218,US,44.2394,-114.51,1.9,eastidahonews.com,100.0,0.0


GlobalEventID              int32
Day                        int32
ActionGeo_CountryCode     object
ActionGeo_Lat            float64
ActionGeo_Long           float64
GoldsteinScale           float64
MentionSourceName         object
Confidence               float64
MentionDocTone           float64
dtype: object

In [None]:
### Getting News Source Countries
# Selecting MentionSourceName column
df_sourceName = df_merged.iloc[:,6:7]
# Dropping duplicates to have only the unique source names
df_sourceName = df_sourceName.drop_duplicates(keep='first')

df_sourceName.shape

In [None]:
# Getting the News Sources Countries
df_sourceName.loc[:,'Source_Country'] = src_to_country_v2(df_sourceName.loc[:,'MentionSourceName'].copy())

In [7]:
# for Keeping only US news Sources
#df_data = df_merged[df_merged['Source_Country']=='US'].copy()

# For now we keep the whole database since we have only a sample of data
df_data = df_merged.copy()
df_data.head()

Unnamed: 0,GlobalEventID,Day,ActionGeo_CountryCode,ActionGeo_Lat,ActionGeo_Long,GoldsteinScale,MentionSourceName,Confidence,MentionDocTone
0,410412347,20140218,SF,-30.3098,25.2971,2.8,dailymaverick.co.za,50.0,-4.477612
1,410412348,20140218,IN,12.9833,77.5833,1.9,indiatimes.com,50.0,2.078522
2,410412349,20140218,AS,-36.0667,146.483,1.9,voxy.co.nz,10.0,7.517084
3,410412350,20140218,NZ,-41.0,174.0,1.9,voxy.co.nz,20.0,7.517084
4,410412351,20140218,US,44.2394,-114.51,1.9,eastidahonews.com,100.0,0.0


## 2. A story of saving the datasets in pandas dataframes ##

Now that we have the urls to the csv files we will need to dowload them to aggregate them into one single dataframe. 
For each dataset (export, mentions and gkg) we create a new pandas dataframe. To name the columns of the datagrame we use the list of column names stored locally in a text file. When aggregating each dataset we try to do it in the least costly way: 
- When we download the csv files we only keep the columns we want to use
- We only download one instance of each type of csv files (export, mentions and gkg), add the csv content to the aggregated dataframe and overwrite the instances for each new url. 

### 2.1 Helpers functions to load the datasets ##

We defined a few functions which help us to easily interact with the data.

```isNaN```: This function detects when a field = float('nan').

```get_export_names```: This function fetches the columns names in the local text file for the *export* dataset.

```get_mentions_name```: This function fetches the columns names in the local text file for the *mentions* dataset.

```get_map_site```: This function return two dictionnaries one that takes a website extension as an input and return the ```ISO3166-1-Alpha-3``` code for said country. The other do the same thing but the other way around. 

In [8]:
def isNaN(num):
    return num != num

def get_export_names():
    file = open(DATA_PATH + "event_table_name", "r")
    names = file.readlines()[0].split(" ")
    return names

def get_mentions_names():
    file = open(DATA_PATH + "mentions_table_name", "r")
    names = file.readlines()[0].split(" ")
    return names

def get_map_site():
    file = pd.read_csv(COUNTRY_CODE_DATA)
    return dict(zip(file['TLD'], file['ISO3166-1-Alpha-3'])), dict(zip(file['ISO3166-1-Alpha-3'], file['TLD']))

### 2.2 Columns selection ##

On a side note the reader can find an exhaustive description of each column on the GDELT documentation (available in this repo in the PDF folder)

#### In export ####

- GlobalEventID : This column holds the unique ID for each event displayed in the dataset. It is also the link between the table export and the table mention.

- Actor1Geo_Lat, Actor1Geo_Long : These fields holds the geographical coordinates of the party included in the event. Its content will be display on a map.

- Actor2Geo_Lat, Actor2Geo_Long : Same thing as before. This holds informations about a second actor, if there is one, relative to a specifiv event.

- ActionGeo_Lat, ActionGeo_Long : Same thing as before but this time it holds the geographical coordinates of where the event was. Like the two last paragraph these information will be displayed on a world map.

- GoldsteinScale : This value which calculated how the stability of the country is after an event will be use to compute a homemade index named "Bias".

- SOURCEURL : The URL of the source article, it will be usefull to add this URL to the marker of an event on the map.

#### In mentions ####  

- GlobalEventId : This fields holds the ID of the event from which the article speaks of. On the contrary of the export table an this fileds can be present multiple times through the database (one for each mention).

- MentionSourceName	: This fields holds the short version of the URL of the source. This will be used the geolocalise the soucre of the information since the information is not provided by GDELT

- Confidence : ???

- MentionDocTone : This fields holds a numerical value which quantified the hostility of an article. This feature is very useful in viewing the genral opinion of an article on an event wich will help us highlight biases in informations

In [9]:
# We get the columns names of the datasets from the text files we've created
col_ex = get_export_names()
col_men = get_mentions_names()

# We define create a list of the column names of the columns we want to keep in the datasets
col_ex_list = ['GlobalEventID', 'Day', 'MounthYear', 'Year', 'ActionGeo_CountryCode', 'ActionGeo_Lat', 'ActionGeo_Long', 'GoldsteinScale', 'NumMentions']
col_men_list = ['GlobalEventId', 'MentionSourceName', 'Confidence', 'MentionDocTone']

# We create the empty the aggregated dataframes with the column names we want to keep
export_df = pd.DataFrame(columns=col_ex_list)
mentions_df = pd.DataFrame(columns=col_men_list)

display(export_df.head())
display(mentions_df.head())

print(col_ex)

Unnamed: 0,GlobalEventID,Day,MounthYear,Year,ActionGeo_CountryCode,ActionGeo_Lat,ActionGeo_Long,GoldsteinScale,NumMentions


Unnamed: 0,GlobalEventId,MentionSourceName,Confidence,MentionDocTone


['GlobalEventID', 'Day', 'MounthYear', 'Year', 'FractionDate', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor1KnownGroupCode', 'Actor1EthnicCode', 'Actor1Religioni1Code', 'Actor1Religion2Code', 'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'Actor2KnownGroupCode', 'Actor2EthnicCode', 'Actor2Religioni1Code', 'Actor2Religion2Code', 'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code', 'IsRootEvent', 'EventCode', 'EventBaseCode', 'EventRootCode', 'QuadClass', 'GoldsteinScale', 'NumMentions', 'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Type', 'Actor1Geo_Fullname', 'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code', 'Actor1Geo_ADM2Code', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_FeatureID', 'Actor2Geo_Type', 'Actor2Geo_Fullname', 'Actor2Geo_CountryCode', 'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code', 'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID', 'ActionGeo_Type', 'ActionGeo_Fullname', 'ActionGeo_CountryCod

In [10]:
# We filter out the urls keeping only those containing an export dataset
df_ex_w01 = df_list[df_list['url'].str.contains('.export.CSV')]
df_ex_w01 = df_ex_w01.iloc[:96*30,2:3] #This will filter events for 7 days

# We filter the urls keeping only those containing an export dataset
df_men_w01 = df_list[df_list['url'].str.contains('.mentions.CSV')]
df_men_w01 = df_men_w01.iloc[:96*30,2:3] #This will filter events for 7 days

## 3. A story of visualisation ##

Using the export and mentions dataframes we came up with three different visualizations to help us answer some of the key questions we would like to answer in this project.

### 3.1 Helpers functions to visualize the data ###

In order to visualize different information, we've defined a few functions which help us to extract and interact with the data.

#### 3.1.1 Function 1: treat_event ####
```treat_event```: This function displays the location of an event on a map for one or two actors.

There are two steps in this function: 
- First step: we extract the information of a specific event from the export dataframe given a specfic given ID. Then we further filter all the geographical informations for the the Event, the Actor1 and the Actor2.
- Second step: if present the dataset, for each actor and action a marker for their location is added to the map. We use different colors for each marker: blue for the location of the event, green for the location of Actor 1 and red for the location of Actor 2.

In [11]:
def treat_event(export, mention, id_event, f_map):
    """
        Displays the location of an event on a map and the location of the 
        actors of this event if the information is provided.
    
    Inputs:
    
        export [Pandas dataframe]: 
                    The GDELT export dataset stored in a pandas dataframe
                    
        mentions [Pandas dataframe]:
                    The GDELT mentions dataset stored in a pandas dataframe
                    
        id_event [string]:
                    The ID of the event we are interested in visualizing 
                    
        f_map [folium map]:
                    The map on which the event should be display
    """
    
    # We extract the row corresponding to the event ID in the export dataframe
    data = export.loc[export['GlobalEventID'] == id_event]
    
    # We extract the longitude and the latitude of the Actor 1 from the data
    act_one_lat = data['Actor1Geo_Lat'].values[0]
    act_one_long = data['Actor1Geo_Long'].values[0]

    # We extract the longitude and the latitude of the Actor 1 from the data
    act_two_lat = data['Actor2Geo_Lat'].values[0]
    act_two_long = data['Actor2Geo_Long'].values[0]

    # We extract the longitude and the latitude of the event 
    a_lat = data['ActionGeo_Lat'].values[0]
    a_long = data['ActionGeo_Long'].values[0]
    
    # We extract the URL of one source so that it can be displayed on the map
    src = data['SOURCEURL\n'].values[0]
    
    # We check if the variables aren't empty and we add a marker in different colors
    if not isNaN(a_lat) and not isNaN(a_long):
        folium.Marker(location=[a_lat, a_long], popup=src, icon=folium.Icon(color="blue")) \
              .add_to(f_map)
        
    if not isNaN(act_one_lat) and not isNaN(act_one_long):
        folium.Marker(location=[act_one_lat, act_one_long], popup=src, icon=folium.Icon(color="green")) \
              .add_to(f_map)
        
    if not isNaN(act_two_lat) and not isNaN(act_two_long):
        folium.Marker(location=[act_two_lat, act_two_long], popup=src, icon=folium.Icon(color="red")) \
              .add_to(f_map)

#### 3.1.2 Function 2: src_to_country ####

```src_to_country```: This function displays a color map layer for each country depending on the average tone of the articles from websites based in the country.

If ip = True : we locate the news outlet via its ip. First we do a DNS lookup to optain the IP of the website. From that we run a linux command ```whois``` which allows us to see where the website is based via a ```country``` field. From that we transform the alpha-2 country code into alpha-3 to fetch the json file containing the drawing information for said country.

If ip = False : we geolocalise the website using its extension. Earlier in the notebook we showcased in the helper function the creation of dictionnaries that offers a one to one mapping from website extensions to alpha-3 country code. Once the country code has been retrieved we use the same process as before to draw the country.

Note: It is useful to highlight that the geolocalisation through DNS lookup and whois command needs an active internet connection.

In [12]:
def src_to_country(web, f_map, color, ip = True):
    """
    Displays a color map layer for each country depending on the average 
    tone of the articles from websites based in the country.

    Inputs:

    web [string]: 
                The URL of a website
    f_map [folium map]:
                The map on which we will add the colormap layer
    color [string]:
                The desired color layer for the country  
    ip [boolean]:
                The way we want to localise the article source website
                True: we find the country by IP addresses
                False: we find the country by URL
    """ 
    if ip:
        try:
            obj = IPWhois(socket.gethostbyname(web))
            results = obj.lookup_rdap(obj)
            country = pycountry.countries.get(alpha_2=results['asn_country_code'])
            layout = MAP_PATH + country.alpha_3 + ".geo.json"

            style_function = lambda x: {'fillColor': color}

            folium.GeoJson(layout, style_function).add_to(f_map)
        except:
            pass

    else:
        try:
            site, _ = get_map_site() # get dictionnary
            country = site[str(".") + web.split('.')[-1]] # get web extension
            layout = MAP_PATH + country + ".geo.json"

            style_function = lambda x: {'fillColor': color}

            folium.GeoJson(layout, style_function).add_to(f_map)
        except:
            pass

def src_to_country_v2(web_data): # Look at extension if fail go for ip alpha 3 
    def extension_lookup(website):
        site, _ = get_map_site() # Get dictionary Extensions to County code
        try:
            return site[str('.') + website.split('.')[-1]]
        except:
            return None
        
    def ip_lookup(website):
        try:
            obj = IPWhois(socket.gethostbyname(website))
            results = obj.lookup_rdap(obj)
            country = pycountry.countries.get(alpha_2=results['asn_country_code'])
            return country.alpha_3
        except:
            return None
        
    def two_way_lookup(website):
        ret = extension_lookup(website)
        if ret == None:
            ret = ip_lookup(website)
            return ret
        return ret
        
    return web_data.map(lambda x: two_way_lookup(x))

#### 3.1.3 Function 3: tone_to_color ####

```tone_to_color```: This function provides a mapping from a tone value to a color.

The tone of an article is a score delivered between -100 (negative) and 100 (positive). Most of the time the tone is between -10 and 10, therefore we defined a very simple function to easily get some basic insights. This function will be re-designed in the future.

In [13]:
def tone_to_color(tone): 
    """
    Provides a mapping from a tone value to a color

    Inputs:

    tone [float]: 
                The tone of an article, value in -100, 100[
    """ 
    if tone < 0:
        if tone < -5:
            return "red"
        else:
            return "pink"
    else:
        if tone > 5 :
            return "green"
        else:
            return "lightgreen"

### 3.2 Visualizations ###

Now that we have functions to extract information from the data, we can use them to make visualizations.

Note: There are better ways to do the visualization, the idea here was to try the functions and see what we can do. There will be a lot of improvements to be made, we will select a better map, better colors, display legends and color scales etc..

#### 3.2.1 Example 1: Tone of an article for a specific event ####

This is a very simple visualization to test our functions. Here [Example 1](http://localhost:8888/maps/Example1.html) we can see a special case where an event happened in Turkey and the Actor1 and Actor2 are both located in Turkey (the 3 marker are supperposed and we only see the last one corresponding to the location of the event in red). If we click on the marker, we get the URL of the source article. The tone of the article written about the event is very negative as the country is colored in red. 

In [14]:
# Here we try our functions on a small subset of all the data (only one csv for export and mention)
export_d = pd.read_csv(DATA_PATH + "20150218230000.export.CSV",sep='\t', names=get_export_names())
mention_d = pd.read_csv(DATA_PATH + "20150218230000.mentions.CSV", sep="\t", names=get_mentions_names())
#gkg = pd.read_csv(DATA_PATH + "20150218230000.gkg.csv",sep='\t', header=None)

In [18]:
map_d = folium.Map(location=[39, 36], zoom_start=4, tiles='Stamen Terrain')

treat_event(export_d, mention_d,410412361, map_d )

references = mention_d.loc[mention_d['GlobalEventId'] == 410412361]

for web, tone in zip(references['MentionSourceName'], references['MentionDocTone']) :
    if web.split('.')[-1] == 'com' or web.split('.')[-1] == 'org':
        print(src_to_country(web, map_d, tone_to_color(tone), ip = True))
    else:
        print(src_to_country(web, map_d, tone_to_color(tone), ip = False))

map_d.save('maps/Example1.html')

map_d

None
None


#### 3.2.2 Example 2: Tone of all events relayed by the media bases in one country ####

This visualization [Example 2](http://localhost:8888/maps/Example2.html) allows us to see the average tone of the articles written in different contries about the USA. Indeed, for each event that contains USA as a country code we add a color layer on the map for each articles made in each country. The resulting color is the supperposition of different colors. For instance, China is green therefore the tone of the articles are mostly positive wheareas in the USA the colors looks orange therefore the tone of the articles are mostly negative.

In [17]:
us_event = export_d.loc[export_d['Actor1CountryCode'] == 'USA']

map_e = folium.Map(
    location=[39, 36],
    zoom_start=4,
    tiles='Stamen Terrain'
)

for ids in us_event['GlobalEventID'][:150]:
    references = mention_d.loc[mention_d['GlobalEventId'] == ids]

    for web, tone in zip(references['MentionSourceName'], references['MentionDocTone']) :
        if web.split('.')[-1] == 'com' or web.split('.')[-1] == 'org' or web.split('.')[-1] == 'net':
            src_to_country(web, map_e, tone_to_color(tone), ip = True)
        else:
            src_to_country(web, map_e, tone_to_color(tone), ip = False)

map_e.save('maps/Example2.html')
map_e

#### 3.2.3 Example 3: Distribution of events being covered by various news sources ####

This visualization we will try to look at how different news sources being covered 

In [8]:
# Preparing the dataframe for plotting
df_plot = df_data.iloc[:,:5].copy()
df_plot = df_plot.groupby(by=['ActionGeo_CountryCode', 'ActionGeo_Lat', 'ActionGeo_Long'])\
                 .agg({'GlobalEventID': ['count']}).reset_index()
df_plot.columns = [col[0] for col in df_plot.columns]
df_plot = df_plot.rename(columns={'GlobalEventID':'Mentions'})
df_plot.head()

Unnamed: 0,ActionGeo_CountryCode,ActionGeo_Lat,ActionGeo_Long,Mentions
0,AA,12.5,-69.9667,54
1,AA,12.5167,-70.0333,1
2,AC,16.9167,-62.3167,1
3,AC,17.0,-61.7667,1
4,AC,17.0167,-61.7667,2


In [11]:
# Changing the two-alphas country code to 3-alphas country code
country_code_df = pd.read_csv(COUNTRY_CODE_DATA)
s = country_code_df.set_index('FIPS')['ISO3166-1-Alpha-3']
df_plot['ActionGeo_CountryCode'] = df_plot['ActionGeo_CountryCode'].replace(s)
df_plot.head()

Unnamed: 0,ActionGeo_CountryCode,ActionGeo_Lat,ActionGeo_Long,Mentions
0,ABW,12.5,-69.9667,54
1,ABW,12.5167,-70.0333,1
2,ATG,16.9167,-62.3167,1
3,ATG,17.0,-61.7667,1
4,ATG,17.0167,-61.7667,2


In [14]:
# PReparing dataframe for plotting
mentions_ds = gv.Dataset(df_plot[['ActionGeo_Long', 'ActionGeo_Lat','Mentions']])
points = mentions_ds.to(gv.Points, ['ActionGeo_Long', 'ActionGeo_Lat'], ['Mentions'])

In [17]:
# Plotting
(gts.CartoMidnight * points.options(width=900, height=500, tools=['hover'], size_index=2, size=0.1, alpha=0.6, color=palette[2],cmap='YlOrBr'))

<img src="images/plot_03.png">

We can clearly see from this visualization that countries like U.S are mentioning much more than others like countries in South America or latins.

#### 3.2.3 Example 4: Amount of mentions around the world ####

This visualization we will try to show the amount of mentions in different news sources for each countries. We will use this kind of plot for next milestone to show the bias of the news media.

In [20]:
# Defining the geolocations of different countries
path = geopandas.datasets.get_path('naturalearth_lowres')
df = geopandas.read_file(path)
df = df.rename(columns={'gdp_md_est': 'Mentions'})

In [21]:
# Adding the number of mentions to the dataframe
s = df_plot.set_index('ActionGeo_CountryCode')['Mentions']
df['Mentions']=df['iso_a3']
df['Mentions'] = df['Mentions'].replace(s)
df.Mentions = df.Mentions.apply(pd.to_numeric, downcast='integer', errors='coerce')
df = df.dropna()
df = df.drop(columns=['pop_est'])
df.head()

Unnamed: 0,continent,name,iso_a3,Mentions,geometry
0,Asia,Afghanistan,AFG,1.0,"POLYGON ((61.21081709172574 35.65007233330923,..."
1,Africa,Angola,AGO,1.0,(POLYGON ((16.32652835456705 -5.87747039146621...
2,Europe,Albania,ALB,4.0,"POLYGON ((20.59024743010491 41.85540416113361,..."
3,Asia,United Arab Emirates,ARE,1.0,"POLYGON ((51.57951867046327 24.24549713795111,..."
4,South America,Argentina,ARG,9.0,(POLYGON ((-65.50000000000003 -55.199999999999...


In [23]:
#plotting
#%%opts Polygons (cmap='Spectral')
plot_opts = dict(tools=['hover'], width=900, height=600, color_index='Mentions',
                 colorbar=True, toolbar='above', xaxis=None, yaxis=None)
gv.Polygons(df, vdims=['name', 'Mentions'], label='Number of mentions around the world for different countries').opts(plot=plot_opts).redim.range(Latitude=(-60, 90))

<img src="images/plot_04.png">
Again we can see that the number of mentions for U.S. is much more higher that other countries

## 4. What comes next? ##

#### Summary ####

For this milestone we've experimented a few ways to display the data using the information contained in the export and mentions datasets. We've been thinking of ways to merge the data and get a sense of the amount of data we will be handling. Here we only used subsets of data to quickly get some more insights. However, we will eventually need to deal with Terabytes of data, therefore it will be more appropriate to use pyspark. 

We've also cleaned the empty or the redundant rows of the datasets. We've decided to limit ourselves with a few columns of the datasets that we knew we could use just by looking at their content. These few columns were already enough to make different kinds of visualizations as presented in this notebook. However, we will need to do a proper study of the features, analysing their distributions and correlations. We will also need to look into the gkg dataset which is the heaviest dataset, containing some more information about the articles such as their themes which can be parsed and usefull to study news per theme. 

#### Key Questions ####

With our work in this notebook we think we can answer some of our key questions for this project: 
- How differently are international news covered depending on the country or the media? 

By looking at the gkg dataset and using the themes of the article we will be able to answer the question below:
- Which category of news tends to be framed or less covered depending on the country or the media?

However, we will need some more work to be able to answer the questions below:
- How can we use the graph visualization to demonstrate the bias in the news sources?
- What conclusions can we draw about systematic bias in different countries by studying its media coverage?

Also given the data it will be hard to answer questions such as:
- How do media try to keep their credibility as a source of information, while introduce bias the news?


#### Plan for next milestone: ####

1. Further study of the features:

This notebook helped us understand how to aggregate the data and its potential by making simple visualizations using a subset of the features. The next step is to make an exhaustive description and a statistical analysis of all of the features contained in the datasets, including the gkg dataset.

2. Biais measurement:

One issue with our project is to have a way to mesure bias which can be done by using the "Average Tone" or "Goldstein Scale" however this isn't enough. It will be necessary to find a better measurement for bias. For instance we could think of ways to use either supervised or unsupervised machine learning algorithms to group news sources into clusters. This type of study will help us answer some of the questions we couldn't answer yet.

3. Visualizations:

We will have to come up with news visualizations and choose the best ones to answer our different key questions. Instead of maps, we will also need to have more diversified visualizations such as bar charts and graphs.

4. Answer to the key questions

Using the the results of the previous step, we will ned to analyse the results and explain how we can notice biais in internation media coverage.

5. Create a website containing the data story

The idea is to aggreagate everyhting that we did in this project in the form a data story. We will create a website to showcase our visualizations and analysis with each key question getting us closer to answering the main reseach question.

6. Create a poster for the presentation

We will use everything we have done during the project to synthetize the main results in a poster format. 