### Requirements

In [1]:
# %pip install requests beautifulsoup4

### Data Scraping

To circumvent the use of Google's BigQuery, in this notebook we use alternate means to load the gdelt database.

In [2]:
import requests
from bs4 import BeautifulSoup

In [4]:
print("""Running all cells will overwrite current data. Do you want to continue? y/n""")
answer = input()

if answer.lower()[0] == 'n':
    raise Exception("Preventing the execution of the notebook as to not overwrite data.")

Running all cells will overwrite current data. Do you want to continue? y/n


Exception: Preventing the execution of the notebook to not overwrite data.

In [None]:
url = 'http://data.gdeltproject.org/gdeltv2/masterfilelist.txt'

# Once this cell is ran, it will take some time to finish (~1m) due to
# the retrieval of large amounts of data which need to be written to file
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    
    with open('./data/01events.txt', 'w', encoding='utf-8') as file:
        file.write(soup.prettify())
else:
    print(f'Failed to retrieve gdelt data. Status code: {response.status_code}')

The main interest of this project will be the events between countries, so from the retrieved data at `events.txt` we select only **events**, and discard *mentions* and *global knowledge graphs*.

In [None]:
import re
import json

In [None]:
pattern = r'gdeltv2\/(.*?)\.export\.CSV\.zip'
last_date_time = None

with (
    open('./data/01events.txt', 'r') as in_file, 
    open('./data/02relevant_events.json', 'w') as out_file
):
    out_file.write('{\n')

    for line in in_file:
        try:
            line_url = line.strip().split()[2]
            if not line_url.__contains__('export'):
                continue  # We skip mentions and gkg's

            match = re.search(pattern, line_url)

            if not match:
                print("Failed to match regex.")
                continue

            date_time = match.group(1)

            if not last_date_time:  # We write a temporary JSON file
                out_file.write('"' + date_time[:8] + '":[\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')

            elif date_time[:8] != last_date_time:
                out_file.write('\n],\n')
                out_file.write('"' + date_time[:8] + '":[\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')
                
            else:
                out_file.write(',\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')

            last_date_time = date_time[:8]
        except:
            continue
    
    out_file.write(']\n}\n')

The `relevant_events.json` file contains duplicate keys which isn't permisable. The code in the following cell fixes this by joining the values of such duplicate keys.

In [None]:
with (
    open('./data/02relevant_events.json', 'r') as in_file, 
    open('./data/03cleaned_events.json', 'w') as out_file
):
    data = json.load(in_file)
    agg_data = {}

    for key, values in data.items():
        if not agg_data.get(key):
            agg_data[key] = []
        agg_data[key].extend(values)
    
    json.dump(agg_data, out_file, indent=1)