# Exploration of GDELT Dataset

## Executive Summary

< Problem statement>

< Introduce dataset>

< Introduce models used>

< Present important results>

< Communicate relevant insights>

## Introduction

< Stated the source of the data and introduced it to provide a complete
context for understanding the rest of the report>

< Has an explicit and clearly stated problem statement>

< Has an explicit and convincing motivation statement>

## Data Collection and Description

< – Described the data format and provided some relevant metadata that
allow the reader to clearly understand the data processing code>

In [2]:
# importing packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import glob
import dask.dataframe as dd
import dask.bag as db
from dask.delayed import delayed
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from dask.diagnostics import ProgressBar

## Exploratory Data Analysis

< Performed complete exploratory data analysis that are fully explained>

< All figures are properly labeled and captioned>

## Methodology

< Complete methodology stated, described and justified>

## Results

< results presented are clean, robust and relevant that will allow a strong
answer or solution to the stated problem>

< explicit answer to the stated problem and presented interesting insights on
that can be traced back to the methods and discussion results >

## References

## Acknowledgements

## Test Code

In [6]:
# set to run Dask commands in this "cluster"
client = Client('10.233.29.219:8786')

In [7]:
# check the contents of the folder
path = '/mnt/data/public/gdeltv2/*'
glob.glob(path)

['/mnt/data/public/gdeltv2/20170101190000.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170131134500.gkg.csv.zip',
 '/mnt/data/public/gdeltv2/20170210224500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170213203000.gkg.csv.zip',
 '/mnt/data/public/gdeltv2/20170215044500.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170225004500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170407133000.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170412044500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170418044500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170427171500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170506113000.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170511183000.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170516070000.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170518223000.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170528044500.mentions.CSV.zip',
 '/mnt/data/public/gdeltv2/20170528213000.export.CSV.zip',
 '/mnt/data/public/gdeltv2/20170602141500.gkg.

In [9]:
# we see three kinds of files
# let's open the contents one by one

# we define sample sets
f1 = ['/mnt/data/public/gdeltv2/20170611004500.mentions.CSV.zip']
f2 = ['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']
f3 = ['/mnt/data/public/gdeltv2/20170611004500.gkg.csv.zip']

# we import the progress bar
# pbar = ProgressBar()
# pbar.register()

# we load mentions.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', engine='python') for fn in f1]
df = dd.from_delayed(dfs)
print(f1)
display(df.head().T)

# we load export.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', engine='python') for fn in f2]
df = dd.from_delayed(dfs)
print(f2)
display(df.head().T)

# we load gkg.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', engine='python') for fn in f3]
df = dd.from_delayed(dfs)
print(f3)
display(df.head().T)

['/mnt/data/public/gdeltv2/20170611004500.mentions.CSV.zip']


Unnamed: 0,0,1,2,3,4
0,549285546,549255686,549316095,549316095,663766551
1,20160611023000,20160611001500,20160611053000,20160611053000,20170611004500
2,20170611004500,20170611004500,20170611004500,20170611004500,20170611004500
3,1,1,1,1,1
4,nbcnews.com,nbcnews.com,cbs8.com,kaaltv.com,nbcnews.com
5,http://www.nbcnews.com/news/us-news/jeff-sessi...,http://www.nbcnews.com/news/us-news/jeff-sessi...,http://www.cbs8.com/story/35635737/london-brid...,http://www.KAALtv.com/world/london-bridge-atta...,http://www.nbcnews.com/storyline/isis-uncovere...
6,28,28,6,6,10
7,-1,-1,1174,1149,2894
8,5102,5102,-1,-1,2922
9,5060,5060,1122,1097,2909


['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']


Unnamed: 0,0,1,2,3,4
0,663766551,663766552,663766553,663766554,663766555
1,20160611,20160611,20160611,20160611,20160611
2,201606,201606,201606,201606,201606
3,2016,2016,2016,2016,2016
4,2016.4411,2016.4411,2016.4411,2016.4411,2016.4411
5,REB,USA,USA,USA,USA
6,SUICIDE BOMBER,UNITED STATES,THE US,UNITED STATES,UNITED STATES
7,,USA,USA,USA,USA
8,,,,,
9,,,,,


['/mnt/data/public/gdeltv2/20170611004500.gkg.csv.zip']


Unnamed: 0,0,1,2,3,4
0,20170611004500-0,20170611004500-1,20170611004500-2,20170611004500-3,20170611004500-4
1,20170611004500,20170611004500,20170611004500,20170611004500,20170611004500
2,1,1,1,1,1
3,dailyexcelsior.com,thebusinessjournal.com,therepublic.com,echonews.com.au,borderwatch.com.au
4,http://www.dailyexcelsior.com/ddc-kupwara-insp...,http://www.thebusinessjournal.com/news/constru...,http://www.therepublic.com/2017/06/10/us-comey...,https://www.echonews.com.au/news/going-electri...,http://www.borderwatch.com.au/story/4721481/pe...
5,,,,,
6,,,,,
7,GENERAL_HEALTH;MEDICAL;SOC_POINTSOFINTEREST;SO...,MANMADE_DISASTER_IMPLIED;TAX_FNCACT;TAX_FNCACT...,TAX_FNCACT;TAX_FNCACT_DIRECTOR;LEADER;TAX_FNCA...,MANMADE_DISASTER_IMPLIED;,
8,"ECON_STOCKMARKET,1471;TAX_FNCACT_PARAMEDICS,60...","TAX_FNCACT_BUILDER,675;ECON_HOUSING_PRICES,533...","TAX_FNCACT_DEPUTY,4410;TAX_POLITICAL_PARTY_REP...","MANMADE_DISASTER_IMPLIED,224;MANMADE_DISASTER_...",
9,"4#Kupwara, Jammu And Kashmir, India#IN#IN12#34...","2#California, United States#US#USCA#36.17#-119...","2#New York, United States#US#USNY#42.1497#-74....","4#Ewingsdale, New South Wales, Australia#AS#AS...","4#Penrith, New South Wales, Australia#AS#AS02#..."


In [10]:
# We see that there are no columns in the dataset
# We found the columns in GDELT website

events_columns = ['GlobalEventID', 'Day', 'MonthYear', 'Year', 'FractionDate',
                  'Actor1Code', 'Actor1Name', 'Actor1CountryCode',
                  'Actor1KnownGroupCode', 'Actor1EthnicCode',
                  'Actor1Religion1Code', 'Actor1Religion2Code',
                  'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code',
                  'Actor2Code', 'Actor2Name', 'Actor2CountryCode',
                  'Actor2KnownGroupCode', 'Actor2EthnicCode',
                  'Actor2Religion1Code', 'Actor2Religion2Code',
                  'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code',
                  'IsRootEvent', 'EventCode', 'EventBaseCode',
                  'EventRootCode', 'QuadClass', 'GoldsteinScale',
                  'NumMentions', 'NumSources', 'NumArticles', 'AvgTone',
                  'Actor1Geo_Type', 'Actor1Geo_Fullname',
                  'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code',
                  'Actor1Geo_ADM2Code', 'Actor1Geo_Lat', 'Actor1Geo_Long',
                  'Actor1Geo_FeatureID', 'Actor2Geo_Type',
                  'Actor2Geo_Fullname', 'Actor2Geo_CountryCode',
                  'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code',
                  'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID',
                  'ActionGeo_Type', 'ActionGeo_Fullname',
                  'ActionGeo_CountryCode', 'ActionGeo_ADM1Code',
                  'ActionGeo_ADM2Code', 'ActionGeo_Lat', 'ActionGeo_Long',
                  'ActionGeo_FeatureID', 'DATEADDED', 'SOURCEURL']
gkg_columns = ['GKGRECORDID', 'V2.1DATE', 'V2SOURCECOLLECTIONIDENTIFIER', 'V2SOURCECOMMONNAME',
               'V2DOCUMENTIDENTIFIER', 'V1COUNTS', 'V2.1COUNTS', 'V1THEMES', 'V2ENHANCEDTHEMES',
               'V1LOCATIONS', 'V2ENHANCEDLOCATIONS', 'V1PERSONS', 'V2ENHANCEDPERSONS',
               'V1ORGANIZATIONS', 'V2ENHANCEDORGANIZATIONS', 'V1.5TONE', 'V2.1ENHANCEDDATES',
               'V2GCAM', 'V2.1SHARINGIMAGE', 'V2.1RELATEDIMAGES', 'V2.1SOCIALIMAGEEMBEDS',
               'V2.1SOCIALVIDEOEMBEDS', 'V2.1QUOTATIONS', 'V2.1ALLNAMES', 'V2.1AMOUNTS',
               'V2.1TRANSLATIONINFO', 'V2EXTRASXML']

In [14]:
# We load the datasets again with columns defined above
# We skip mentions since we won't need those data

# we define sample sets
f2 = ['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']
f3 = ['/mnt/data/public/gdeltv2/20170611004500.gkg.csv.zip']

# we import the progress bar
# pbar = ProgressBar()
# pbar.register()

# we load export.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', names=events_columns, engine='python') for fn in f2]
df = dd.from_delayed(dfs)
print(f2)
display(df.head().T)

# we load gkg.CSV.zip into a delayed Pandas dataframe

dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', names=gkg_columns, engine='python') for fn in f3]
df = dd.from_delayed(dfs)
print(f3)
display(df.head().T)

['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']


Unnamed: 0,0,1,2,3,4
GlobalEventID,663766551,663766552,663766553,663766554,663766555
Day,20160611,20160611,20160611,20160611,20160611
MonthYear,201606,201606,201606,201606,201606
Year,2016,2016,2016,2016,2016
FractionDate,2016.4411,2016.4411,2016.4411,2016.4411,2016.4411
Actor1Code,REB,USA,USA,USA,USA
Actor1Name,SUICIDE BOMBER,UNITED STATES,THE US,UNITED STATES,UNITED STATES
Actor1CountryCode,,USA,USA,USA,USA
Actor1KnownGroupCode,,,,,
Actor1EthnicCode,,,,,


['/mnt/data/public/gdeltv2/20170611004500.gkg.csv.zip']


Unnamed: 0,0,1,2,3,4
GKGRECORDID,20170611004500-0,20170611004500-1,20170611004500-2,20170611004500-3,20170611004500-4
V2.1DATE,20170611004500,20170611004500,20170611004500,20170611004500,20170611004500
V2SOURCECOLLECTIONIDENTIFIER,1,1,1,1,1
V2SOURCECOMMONNAME,dailyexcelsior.com,thebusinessjournal.com,therepublic.com,echonews.com.au,borderwatch.com.au
V2DOCUMENTIDENTIFIER,http://www.dailyexcelsior.com/ddc-kupwara-insp...,http://www.thebusinessjournal.com/news/constru...,http://www.therepublic.com/2017/06/10/us-comey...,https://www.echonews.com.au/news/going-electri...,http://www.borderwatch.com.au/story/4721481/pe...
V1COUNTS,,,,,
V2.1COUNTS,,,,,
V1THEMES,GENERAL_HEALTH;MEDICAL;SOC_POINTSOFINTEREST;SO...,MANMADE_DISASTER_IMPLIED;TAX_FNCACT;TAX_FNCACT...,TAX_FNCACT;TAX_FNCACT_DIRECTOR;LEADER;TAX_FNCA...,MANMADE_DISASTER_IMPLIED;,
V2ENHANCEDTHEMES,"ECON_STOCKMARKET,1471;TAX_FNCACT_PARAMEDICS,60...","TAX_FNCACT_BUILDER,675;ECON_HOUSING_PRICES,533...","TAX_FNCACT_DEPUTY,4410;TAX_POLITICAL_PARTY_REP...","MANMADE_DISASTER_IMPLIED,224;MANMADE_DISASTER_...",
V1LOCATIONS,"4#Kupwara, Jammu And Kashmir, India#IN#IN12#34...","2#California, United States#US#USCA#36.17#-119...","2#New York, United States#US#USNY#42.1497#-74....","4#Ewingsdale, New South Wales, Australia#AS#AS...","4#Penrith, New South Wales, Australia#AS#AS02#..."


In [20]:
# We are ready to load a larger dataset

# we define sample sets = 1 month
f_events = glob.glob('/mnt/data/public/gdeltv2/201701*.export.CSV.zip')
f_gkg = glob.glob('/mnt/data/public/gdeltv2/201701*.gkg.csv.zip')

# we import the progress bar
# pbar = ProgressBar()
# pbar.register()

# we load export.CSV.zip into a delayed Pandas dataframe

dfs_events = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', names=events_columns, engine='python') for fn in f_events]
df_events = dd.from_delayed(dfs_events)
print('EVENTS')
display(df_events.head().T)

# we load gkg.CSV.zip into a delayed Pandas dataframe

dfs_gkg = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
                            dtype='str', names=gkg_columns, engine='python') for fn in f_gkg]
df_gkg = dd.from_delayed(dfs_gkg)
print('GLOBAL KNOWLEDGE GRAPH')
display(df.head().T)

EVENTS


Unnamed: 0,0,1,2,3,4
GlobalEventID,619124482,619124483,619124484,619124485,619124486
Day,20160121,20160121,20160121,20160121,20160121
MonthYear,201601,201601,201601,201601,201601
Year,2016,2016,2016,2016,2016
FractionDate,2016.0575,2016.0575,2016.0575,2016.0575,2016.0575
Actor1Code,,AZE,AZE,CAN,DEUGOVMED
Actor1Name,,AZERBAIJAN,AZERBAIJAN,CANADA,GERMANY
Actor1CountryCode,,AZE,AZE,CAN,DEU
Actor1KnownGroupCode,,,,,
Actor1EthnicCode,,,,,


GLOBAL KNOWLEDGE GRAPH


Unnamed: 0,0,1,2,3,4
GKGRECORDID,20170131134500-0,20170131134500-1,20170131134500-2,20170131134500-3,20170131134500-4
V2.1DATE,20170131134500,20170131134500,20170131134500,20170131134500,20170131134500
V2SOURCECOLLECTIONIDENTIFIER,2,2,2,2,2
V2SOURCECOMMONNAME,BBC Monitoring,BBC Monitoring,BBC Monitoring,BBC Monitoring,BBC Monitoring
V2DOCUMENTIDENTIFIER,"president.ir website, Tehran/BBC Monitoring/(c...","Voice of the Islamic Republic of Iran website,...",/BBC Monitoring/(c) BBC,The Moscow Times in English /BBC Monitoring/(c...,"Website of Ekho Kavkaza, Radio Liberty/Radio F..."
V1COUNTS,,,,,
V2.1COUNTS,,,,,
V1THEMES,LEADER;TAX_FNCACT;TAX_FNCACT_PRESIDENT;USPEC_P...,TAX_ETHNICITY;TAX_ETHNICITY_KYRGYZ;TAX_ETHNICI...,LEADER;TAX_FNCACT;TAX_FNCACT_PRESIDENT;USPEC_P...,MILITARY;CYBER_ATTACK;USPEC_POLITICS_GENERAL1;...,CONSTITUTIONAL;TAX_ETHNICITY;TAX_ETHNICITY_GEO...
V2ENHANCEDTHEMES,"ELECTION,633;ELECTION,740;GENERAL_GOVERNMENT,1...","GENERAL_GOVERNMENT,224;EPU_POLICY_GOVERNMENT,2...","WB_678_DIGITAL_GOVERNMENT,48;WB_678_DIGITAL_GO...","GENERAL_GOVERNMENT,1261;EPU_POLICY_GOVERNMENT,...","EXILE,5179;EXILE,8480;TAX_FNCACT_CHAIRMAN,765;..."
V1LOCATIONS,"4#Alborz, Markazi, Iran#IR#IR34#34.0048#49.390...","5#Sughd, Leninobod, Tajikistan#TI#TI03#40#69#-...","4#Beirut, Beyrouth, Lebanon#LE#LE04#33.8719#35...","4#Moscow, Moskva, Russia#RS#RS48#55.7522#37.61...",1#Georgia#GG#GG#42#43.5#GG;1#China#CH#CH#35#10...


In [21]:
type(df_events), type(df_gkg)

(dask.dataframe.core.DataFrame, dask.dataframe.core.DataFrame)

In [26]:
# how many entries in events for 1 month
len(df_events)

6154996

In [28]:
# how many entries in gkg for 1 month
# len(df_gkg)