# GDELT Python Script

Run each cell from top to bottom to execeute the code. You might need to copy this file to your private Google Drive as I am not sure if we can run the same file in parallel. You might also first need to link the AI Capstone folder to your private Drive to access it via the code. To do this right click on the Capstone folder and create a short cut in "My Drive". The runtime type can be changed by clicking *Runtime -> Change runtime* type in the top menu and selecting the desired hardware acceleration. I think setting the accelerator to TPU leads to the fastest performance in our case.
When you are done you can close the session to unmount your drive by clicking on *Runtime -> Manage sessions* and terminating the session.


---



Mount Google drive folder by running this cell. Click on the link and add your google account to get the access token. Enter the token into the prompt in the code cell.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Install GDELT to the runtime. This step has to be done every time you open this script as the runtime gets set up from scratch each time.

In [None]:
!pip install gdelt

Import the python librariers necessary for the next steps

In [1]:
import gdelt
import pandas as pd
import time
from pathlib import Path
from dateutil import relativedelta
from datetime import timedelta, date, datetime
from concurrent.futures import ProcessPoolExecutor

This function gets called by `parallel_gdelt` and loaads data for one single day from the global knowledge graph (GKG) of GDELT. Currently the filters are set so the location is Germany AND article urls end in .de, .net or .com AND either the url contains "brauer" OR the field AllNames contains "brewery" or "breweries". Information about the datafields/column names can be found in the GKG documentation. The function returns a pandas dataframe containing the filtered GDELT data. We are using version 2 of the GKG containing data from february 2015 onwards as this version provides articles in other languages (German in our case), which is why `translation` is set to True in the search statement. 

> [Link to GKG documentation](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf)

In [2]:
def load_single_date(single_date):
  gd = gdelt.gdelt()
  try:
    gkg = gd.Search(single_date.strftime("%Y-%m-%d"), table='gkg', output='df', coverage=True, translation=True)
    result = gkg[gkg['Locations'].str.contains("Germany", na=False, case=False)
              & (gkg['SourceCommonName'].str.contains(".de", na=False, case=False)
              | gkg['SourceCommonName'].str.contains(".net", na=False, case=False)
              | gkg['SourceCommonName'].str.contains(".com", na=False, case=False))
              & (gkg['DocumentIdentifier'].str.contains("brauer", na=False, case=False)   
              | gkg['AllNames'].str.contains("brewery", na=False, case=False)
              | gkg['AllNames'].str.contains("breweries", na=False, case=False))
              & (gkg['V2Persons'].str.contains("brauer",na=False, case=False)== False)
              ]
    return result
  except:
      pass

This function runs `load_single_date` for a specified date range (in our case one month). It is using multiprocessing to run multiple days at once. This is not super effective as we can only run two days at a time over Google colab but at least it provides some kind of parallelization. The GDELT data for the month is stored in a dataframe and saved to Google Drive as a csv file which name contains the year and the month of the downloaded data.

In [None]:
def parallel_gdelt(start_date):
  end_date = start_date + relativedelta.relativedelta(months=1, day=1)
  if end_date > datetime.now().date():
    end_date = datetime.now().date()
  output_file = Path(f'/content/drive/My Drive/AI Capstone/GDELT/Data/brauerei_{start_date.year}_{start_date.month}.csv')
  e = ProcessPoolExecutor()

  df_list = [fr for i, fr in zip(pd.date_range(start_date, end_date), e.map(load_single_date, pd.date_range(start_date, end_date)[:-1]))]
  df = pd.concat(df_list, ignore_index=True)
  df.to_csv(output_file)


In [None]:
def load_gdelt_csv_append(start_date):
    gd = gdelt.gdelt()
    end_date = start_date + relativedelta.relativedelta(months=1, day=1)
    if end_date > datetime.now().date():
      end_date = datetime.now().date()
    output_file = Path(f'/content/drive/My Drive/AI Capstone/GDELT/Data/brauerei_{start_date.year}_{start_date.month}.csv')
    for i in pd.date_range(start_date, end_date)[:-1]:
      result = load_single_date(i)
      if output_file.is_file():
        result.to_csv(output_file, mode='a', header=False)
      else:
        result.to_csv(output_file)

Run either of the next two code blocks. The first one loads the data for just one month and the second one iterates over one year. When changing the date only change the year and the month and leave the day as the first of the month. The warnings about GDELT not returning data for a specific time can be ignored as GDELT tries to access data in 15 minute intervals from each day and not every interval returns new data.

In [None]:
# single month
start_time = time.time()
start_date = date(2015, 1, 1)
parallel_gdelt(start_date)
end_time = time.time()
print(end_time - start_time)

In [None]:
# whole year
for month in range (3, 13):
  start_date = date(2015, month, 1)
  parallel_gdelt(start_date)

In [None]:
# hopefully fixes the RAM issue, at least as a temporary solution
start_time = time.time()
start_date = date(2018, 3, 1)
load_gdelt_csv_append(start_date)
end_time = time.time()
print(end_time - start_time)