# Milestone 3 - Media coverage of violent events in war zone

## 1. Hypothesis and Data retrieval

Modern media coverage of armed conflicts is at the center of many ethical discussions. The usual critisism is to say that newspapers focus their articles topics arbitrarily and often have a strange sense of priority.

If such a claim were true, its consequences could be important given the importance of modern medias in western countries' inhabitants daily lives. A lack of coverage of a conflict by newspapers could make people think that the situation in countries at war improved when it has not.

**Our hypothesis is that there exists a loss of interest of medias over time for some countries even when this country is still at war and that this coverage is biased towards western countries' views and in particular the U.S.A.**

To achieve this we chose to use the GDELT 2.0 dataset. 

We chose to analyse the media coverage of violent events in some specific countries (Afghanistan, Syria, Iraq, Pakistan and Mexico) over the years 2000 to 2016.

We used :

* *ActionGeo_CountryName*, the 2-character FIPS10-4 country code for the location of the events to get the countries.
* [*EventRootCode*](http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf) 18, 19 and 20 that correspond to the most violent events in the documentation.
* *MonthYear* to have the date (in month and year) of each event
* *Events* that we got from our SQL query as being the number of events per month
* *Articles* from our SQL query as well, being the number of Articles per month

We also chose to extract data from the [**UCDP**](http://ucdp.uu.se/) dataset, we kept only our featured countries and the best estimation given for the number of deaths per year.

## 2. Imports

In [4]:
import pandas as pd
import pyarrow.parquet as pq
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sbs
import numpy as np

## 3. Data preprocessing

### Processing of the UCDP dataset

In [5]:
articles_year = pd.read_csv('totart_code181920.csv')#('Tot_articles_per_year.csv')
articles_year = articles_year[(articles_year['Year'] >= 2000) & (articles_year['Year'] <= 2016)]

events_year = pd.read_csv('totevent_code181920.csv')#('Total_Events_per_Year.csv')
events_year = events_year[(events_year['Year'] >= 2000) & (events_year['Year'] <= 2016)]

events_year.head()

Unnamed: 0,Year,NumEvents
21,2000,330047
22,2001,390659
23,2002,344684
24,2003,403647
25,2004,401826


In [6]:
# Getting deaths datasets (UCDP)

deaths = pd.read_csv('ged171.csv')
deaths = deaths[(deaths['year'] >= 2000) & (deaths['year'] <= 2016)]
deaths = deaths[['year', 'country', 'best']]
deaths.columns = ['Date', 'Country', 'Deaths']

Create one dataframe per country:

In [7]:
def deaths_country(_df, country_name):
    x = _df[_df['Country'] == country_name].drop('Country', axis=1).groupby('Date').sum().reset_index()
    if(country_name == 'Mexico'):
        # We need to artificially add some rows for mexico due to missing (no deaths) data these years
        x = x.append(pd.DataFrame([[2003, 0], [2000, 0], [2001, 0]], columns=['Date', 'Deaths'])).sort_values('Date')
    x['Deaths'] = x['Deaths']
    return x.set_index('Date')

In [None]:
deaths_irq = deaths_country(deaths, 'Iraq')
deaths_afg = deaths_country(deaths, 'Afghanistan')
deaths_mex = deaths_country(deaths, 'Mexico')
deaths_pak = deaths_country(deaths, 'Pakistan')

deaths_irq.head()

## Processing of our aggregation of the GDELT dataset

We first tried to get the data from the cluseter but the dataset present on it was a subset of GDELT. Therefore most of the features were missing. To get the location of an event we used at first "Source" and "Target" field that matched the country we were intrested in. But those entries were not consistent.
The full dataset was supposed to have an ActionGeo_Country code which represent the county were the event took place. We were also supposed to have access to features like "MonthYear", "EventRootCode",...

For all those reasons we decided to stop using the cluster and make our query on the full GDELT dataset using the [Google Big Query](https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events)

The sql queries can be found on the repository

In [8]:
# Getting the aggregate dataset from our GDELT query (2000-2016)

df = pd.read_csv('big_query_2000_2016.csv')
df.columns = ['Country', 'Date', 'EventCode', 'Events', 'Articles']

tmp = pd.DataFrame(df['Date'].apply(lambda x: int(x / 100))).rename(columns={'Date': 'Year'})
tmp[['Articles', 'Events']] = df[['Articles', 'Events']]
tmp = tmp.merge(articles_year).merge(events_year)
tmp['Articles'] = (tmp['Articles'] / tmp['NumArticles'])
tmp['Events'] = (tmp['Events'] / tmp['NumEvents'])

df[['Articles', 'Events']] = tmp[['Articles', 'Events']]

tmp.drop(['NumEvents', 'NumArticles', 'Year'], axis=1)

Unnamed: 0,Articles,Events
0,5.236505e-03,0.004158
1,4.821809e-03,0.003789
2,4.537839e-03,0.004389
3,4.397721e-03,0.003578
4,4.289281e-03,0.003825
5,4.117429e-03,0.002609
6,3.901486e-03,0.003086
7,3.823644e-03,0.002495
8,3.594583e-03,0.002369
9,3.459495e-03,0.003180


The data that we get out of our query is the number of Event and the number of Articles aggregated per country, MonthYear and EventCode.
Aggregating the data this way enables us to work on a consise dataset(compared to the original enormous GDELT). However we still keep the different event code and month data separated for further analysis.

In [9]:
df.head()

Unnamed: 0,Country,Date,EventCode,Events,Articles
0,SY,201609,19,0.004158,0.005237
1,SY,201608,19,0.003789,0.004822
2,SY,201602,19,0.004389,0.004538
3,SY,201510,19,0.003578,0.004398
4,SY,201612,19,0.003825,0.004289


In [10]:
def df_country(_df, country):
    return _df[_df.Country == country].drop('Country', axis=1).set_index(['Date'])

## 3. Analysis

## 4. Results