# Global confilcts prediction, and analysis of the relations between countries using GDELT database.

*Introduction*: Hi!


(Some parts of this notebook require user input to run cells. Please keep that in mind when running multiple cells at once.)

### Requirements

Make sure to have the following libraries installed!

In [None]:
%pip install requests beautifulsoup4 numpy pandas jq tqdm fake_useragent

In [2]:
import os
import json
import requests
import zipfile
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import json
import jq
import numpy as np
import datetime
import time
from tqdm import tqdm
import random
from fake_useragent import UserAgent

### Data Retrieval

To circumvent the use of Google's BigQuery, in this notebook we use alternate means to load the gdelt database.

In [None]:
print("""Running all cells will overwrite all current data. Do you want to continue? y/n""")
answer = input()

if answer.lower()[0] == 'n':
    raise Exception("Preventing the execution of the notebook as to not overwrite data.")

if not os.path.exists(r'./data/'):
    os.makedirs(r'./data/')

In [4]:
url = 'http://data.gdeltproject.org/gdeltv2/masterfilelist.txt'

# Once this cell is ran, it will take some time to finish (~1m) due to
# the retrieval and parsing of large amounts of data which need to be written to file
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    with open('./data/01events.txt', 'w') as file:
        file.write(soup.prettify())
else:
    print(f'Failed to retrieve gdelt data. Status code: {response.status_code}')

The main interest of this project will be the events between countries, so from the retrieved data at `01events.txt` we select only **events**, and discard *mentions* and *global knowledge graphs*.

In [5]:
pattern = r'http:\/\/data.gdeltproject.org\/gdeltv2\/(.*?)\.export\.CSV\.zip'
last_date_time = None

with (
    open('./data/01events.txt', 'r') as in_file,
    open('./data/02relevant_events.json', 'w') as out_file
):
    out_file.write('{\n')

    for line in in_file:
        try:
            line_url = line.strip().split()[2]
            if not line_url.__contains__('export'):
                continue  # We skip mentions and gkg's

            match = re.search(pattern, line_url)

            if not match:
                print("Failed to match regex.")
                continue

            date_time = match.group(1)

            if not last_date_time:  # We write a temporary JSON file
                out_file.write('"' + date_time[:8] + '":[\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')

            elif date_time[:8] != last_date_time:
                out_file.write('\n],\n')
                out_file.write('"' + date_time[:8] + '":[\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')

            else:
                out_file.write(',\n')
                out_file.write('{"' + date_time[8:12] + '":"' + line_url + '"}')

            last_date_time = date_time[:8]
        except:  # The line of data retrieved from the webpage is irregular and can be skipped
            # print(f'Corrupt line in 01events.txt containing the following line: {line}')
            continue

    out_file.write(']\n}\n')

The `02relevant_events.json` file contains duplicate keys which isn't permisable. The code in the following cell fixes this by joining the values of such duplicate keys.

In [6]:
def multidict(ordered_pairs):
    data = {}

    for key, value in ordered_pairs:
        if len(key) == 4:
            data[key] = value
            continue

        if not data.get(key):
            data[key] = []
        data[key].extend(value)

    return data


with (
    open('./data/02relevant_events.json', 'r') as in_file,
    open('./data/03cleaned_events.json', 'w') as out_file
):
    data = json.load(in_file, object_pairs_hook=multidict)
    json.dump(data, out_file, indent=1)


After this step, the `03cleaned_events.json` file contains a structured way to retrieve all necessary links for the gdelt data indexed by date and then by time.

### Data Preparation

In [7]:
if not os.path.exists(r'./data/03cleaned_events.json'):
    raise Exception("There is no data to preprocess. Please run the cells in the Data Retrieval section.")

In [8]:
pd.set_option('display.max_columns', None)

In [9]:
col_names = [
    "Global_Event_ID", "Day", "YYYYMM", "YYYY", "Day_Time", "Actor_1_Country_Code", "Actor_1_Name",
    "Actor_1_Country_ABBR", "Actor_1_Known_Group_Code", "Actor_1_Ethnic_Code", "Actor_1_Religion_Code",
    "Actor_1_Religion_2_Code", "Actor_1_Role", "Actor_1_Role2", "Actor_1_Role3", "Actor_2_Country_Code",
    "Actor_2_Name", "Actor_2_Country_ABBR", "Actor_2_Know_Group_Code", "Actor_2_Ethnic_Code",
    "Actor_2_Religion_Code", "Actor_2_Religion_2_Code", "Actor_2_Role", "Actor_2_Role2", "Actor_2_Role3",
    "Is_Root_Event", "Event_Code", "Event_Base_Code", "Event_Root_Code", "Quad_Class", "Goldstein_Scale",
    "Num_Mentions", "Num_Sources", "Num_Articles", "AVG_TONE", "Actor_1_Geo_Type", "Actor_1_Geo_FullName",
    "Actor_1_Geo_Country_Code", "Actor1Geo_ADM1Code", "Actor1Geo_ADM2Code", "Actor1Geo_Lat", "Actor1Geo_Long",
    "Actor1Geo_FeatureID", "Actor_2_Geo_Type", "Actor_2_Geo_FullName", "Actor_2_Geo_Country_Code",
    "Actor2Geo_ADM1Code", "Actor2Geo_ADM2Code", "Actor2Geo_Lat", "Actor2Geo_Long", "Actor2Geo_FeatureID",
    "Mention_Type", "ST_PR_CNTRY", "Country", "ADM1Code_Extra", "ADM2Code_Extra", "Lat_Extra", "Long_Extra",
    "ActorGeo_FeaturID_Extra", "Date_Added", "Source_URL"
]

In [10]:
mentions_col_names = [
    'Global_Event_ID', 'Event_Time_Date', 'Mention_Time_Date', 'Mention_Type',
    'Mention_Source_Name', 'Mention_Identifier', 'Sentence_ID',
    'Actor_1_Char_Offset', 'Actor_2_Char_Offset', 'Action_Char_Offset',
    'In_Raw_Text', 'Confidence', 'Mention_Doc_Len', 'Mention_Doc_Tone'
]

In [11]:
url = 'https://www.gdeltproject.org/data/lookups/CAMEO.country.txt'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    with open(r'./data/06country_codes.txt', 'w') as file:
        file.write(soup.prettify())
else:
    print(f'Failed to retrieve gdelt country codes. Status code: {response.status_code}')

In [None]:
country_codes: dict[str, str] = {}

with open(r'./data/06country_codes.txt', 'r') as file:
    for line in file:
        code, *country = line.strip().split()
        country = ' '.join(country)
        country_codes[code] = country

country_codes.pop('CODE')

In [13]:
url = 'https://www.gdeltproject.org/data/lookups/CAMEO.eventcodes.txt'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    with open(r'./data/07event_codes.txt', 'w') as file:
        file.write(soup.prettify())
else:
    print(f'Failed to retrieve gdelt event codes. Status code: {response.status_code}')

In [None]:
event_codes: dict[str, str] = {}

with open(r'./data/07event_codes.txt', 'r') as file:
    for line in file:
        code, *event_description = line.strip().split()
        event_description = ' '.join(event_description)
        event_codes[code] = event_description

event_codes.pop('CAMEOEVENTCODE')

In [15]:
def is_valid_country_code(code: str) -> bool:
    codes = country_codes.keys()
    return pd.notnull(code) and any([code in ccode for ccode in codes])

We define auxilary functions to ease the preprocessing step.

In [16]:
def format_time_yyyymmddhhmmss_to_str(
        year: str | int, month: str | int, day: str | int,
        hours: str | int, minutes: str | int = "00", seconds: str | int = "00"
) -> str:
    """
    Transform given date time to string.
    """
    times = list(map(str, [year, month, day, hours, minutes, seconds]))
    for i in range(len(times)):
        if len(times[i]) == 1:
            times[i] = "0" + times[i]

    return "".join(times)


def format_time_yyyymmdd_to_str(
        year: str | int, month: str | int, day: str | int
) -> str:
    """
    Transform given date to string.
    """
    times = list(map(str, [year, month, day]))
    for i in range(len(times)):
        if len(times[i]) == 1:
            times[i] = "0" + times[i]

    return "".join(times)

In [17]:
def format_str_yyyymmdd_to_time_str(x: str) -> str:
    return str(datetime.date(int(str(x)[:4]), int(str(x)[4:6]), int(str(x)[6:8])))

In [18]:
def load_gdelt_from_url(url: str) -> pd.DataFrame | None:
    """
    Load gdelt data from url on certain date time.
    """

    # Luk4 script kiddie hax time, bypassing rate limit 4tt3mpt
    # < --------------------------------------------------------------- >
    
    # Create address list and randomly select one to add to header
    ipv6_addresses = [
        '1550:dc4:d62e:6c4:327d:51d3:be53:9e8f',
        '402:d2b:b458:b8fa:b931:44b8:353e:234',
        'd9fa:8e46:6d80:602:e98f:cd43:6d0c:af74',
        '33e2:99aa:ab04:dc05:3fde:4ed5:d17c:b1c9',
        'df34:2251:5433:98e5:9560:6916:33ba:947b',
        '9b06:c07:c2b4:4de4:823b:49ff:b34:dcb6',
        'cb2:f49:28ed:8eae:5719:aa7b:1b85:e903',
        '6f9f:ff57:8d73:c18a:e74f:3b88:b964:114b',
        '9180:ab16:6c99:5e6:b147:ed3f:fd7c:b6e7',
        '7ae3:7e16:db65:99b1:bc55:a27b:29d4:cdb5',
        '6e7b:481f:877c:10b4:d7ef:7d9d:1b9d:2032',
        '218c:5222:7fcc:f286:896c:3a1b:2657:22c'
    ]
    random_ipv6_address = ipv6_addresses[random.randint(0, (len(ipv6_addresses)-1))]

    # Sp00f us3r 4g3nt h34d3r f0r 3xtr4 hax 

    ua = UserAgent()
    
    headers = {
        'User-Agent' : ua.random, 
        'X-Forwarded-For': '127.0.0.1' # Perhaps this will work better instead of random_ipv6
    }
    
    # < --------------------------------------------------------------- >

    
    response = requests.get(url, stream=True, headers=headers)

    if response.status_code == 200:
        with open(r'./data/04temp_15min_data.zip', 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
    else:
        print(f'Failed to retrieve gdelt data for url: {url}. Status code: {response.status_code}')
        return None

    with zipfile.ZipFile(r'./data/04temp_15min_data.zip', 'r') as zip_ref:
        zip_ref.extractall(r'./data/')

    pattern = r'gdeltv2\/(.*?)\.export\.CSV\.zip'

    match = re.search(pattern, url)

    if not match:
        print(f"Failed to match regex. {url}")
        return None

    date_time = match.group(1)

    os.rename(r'./data/' + date_time + ".export.CSV", r'./data/05temp_15min_data.CSV')

    data = pd.read_csv(r'./data/05temp_15min_data.CSV', sep=r'\t', engine='python', header=None, names=col_names, dtype={'Event_Code': str})

    os.remove(r'./data/04temp_15min_data.zip')
    os.remove(r'./data/05temp_15min_data.CSV')

    # Drop the events where a country is absent
    data = data.dropna(subset=['Actor_1_Country_ABBR', 'Actor_2_Country_ABBR'])
    
    # Drop the events where the country codes are equal
    data = data[data['Actor_1_Country_ABBR'] != data['Actor_2_Country_ABBR']]
    
    # Filter only the countries/continents
    data = data [
                    data[['Actor_1_Country_ABBR', 'Actor_2_Country_ABBR']]
                        .map(is_valid_country_code)
                        .all(axis=1)
                ]

    data = (
        # Get only the relevant columns of the data
        data[
                [
                    'Global_Event_ID', 'Day', 'Actor_1_Country_ABBR',
                    'Actor_2_Country_ABBR', 'Event_Code', 'Quad_Class',
                    'Goldstein_Scale', 'Num_Articles', # 'AVG_TONE', 'Source_URL'
                ]
            ]
        # Filter only the events that happened in the wanted time-frame
            [
                data['Day'] == int(date_time[:8])
            ]
            .reset_index(drop=True)
    )

    mentions_url = url.replace('export', 'mentions')

    response = requests.get(mentions_url, stream=True)

    if response.status_code == 200:
        with open(r'./data/04temp_15min_data.zip', 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
    else:
        print(f'Failed to retrieve gdelt data for url: {mentions_url}. Status code: {response.status_code}')
        exit()

    with zipfile.ZipFile(r'./data/04temp_15min_data.zip', 'r') as zip_ref:
        zip_ref.extractall(r'./data/')

    os.rename(r'./data/' + date_time + ".mentions.CSV", r'./data/05temp_15min_data.CSV')

    mentions_data = pd.read_csv(r'./data/05temp_15min_data.CSV', sep=r'\t', engine='python', header=None, names=mentions_col_names)

    os.remove(r'./data/04temp_15min_data.zip')
    os.remove(r'./data/05temp_15min_data.CSV')

    mentions_data = (
        # Get only the relevant columns of the data
        mentions_data[['Global_Event_ID', 'Confidence']]
        # For the same event, get the confidence score by averaging all scores | WE MIGHT WANT MAX!!
            .groupby(by='Global_Event_ID', as_index=False)
            .aggregate('mean')
            .reset_index(drop=True)
    )

    mentions_data['Confidence'] = mentions_data['Confidence'].astype(int)

    return pd.merge(data, mentions_data, how='left', on='Global_Event_ID')

In [19]:
def load_gdelt_by_yyyymmddhhmmss(
        year: str | int, month: str | int, day: str | int,
        hours: str | int, minutes: str | int, seconds: str | int = "00"
) -> pd.DataFrame | None:
    """
    Load the gdelt events dataset logged on given date time.
    """
    date_time = format_time_yyyymmddhhmmss_to_str(year, month, day, hours, minutes)

    url = (
        r'http://data.gdeltproject.org/gdeltv2/' +
        date_time +
        r'.export.CSV.zip'
    )

    return load_gdelt_from_url(url)

In [20]:
def load_gdelt_by_yyyymmdd(
        year: str | int, month: str | int, day: str | int
) -> pd.DataFrame | None:
    """
    Load the gdelt events dataset logged on given date.
    Equivalent to getting all 96 15-minute interval datasets on given date.
    """
    hours_list = list(range(24))
    minutes_list = list(range(0, 60, 15))

    data_frames = []
    for hours in hours_list:
        for minutes in minutes_list:
            data = load_gdelt_by_yyyymmddhhmmss(year, month, day, hours, minutes)
            if data is not None:
                data_frames.append(data)

    return pd.concat(data_frames, ignore_index=True)


In [21]:
def load_gdelt_from_to_yyyymmdd(
        year_from: str|int, month_from: str|int, day_from: str|int,
        year_to: str|int, month_to: str|int, day_to: str|int
) -> pd.DataFrame|None:
    """
    Load the gdelt events dataset between selected days (inclusive).
    """
    start_date = format_time_yyyymmdd_to_str(year_from, month_from, day_from)
    end_date = format_time_yyyymmdd_to_str(year_to, month_to, day_to)

    with open(r'./data/03cleaned_events.json', 'r') as file:
        json_data = json.dumps(json.load(file))

    # JQ query to filter URLs between the given dates
    jq_query = f'. | to_entries | map(select(.key >= "{start_date}" and .key <= "{end_date}")) | .[].value | .[] | .[]'

    urls = jq.compile(jq_query).input(text=json_data).all()

    data_frames = []
    for url in urls:
        data = load_gdelt_from_url(url)
        if data is not None:
            data_frames.append(data)

    return pd.concat(data_frames, ignore_index=True) if len(data_frames) > 0 else None

### Posting to FASTAPI

In [22]:
def custom_sigmoid(n: int) -> float:
    return 1 / (1 + 1 / (np.e ** ((n - 50) / 10)))

In [None]:
start_date = datetime.datetime(2019, 1, 2)
end_date = datetime.datetime(2022, 1, 1)

current_date = start_date
error_date = None
exponential_wait_time = 1

total_days = (end_date - start_date).days + 1

with tqdm(total=total_days) as pbar:
    while current_date <= end_date:
        times = [current_date.year, current_date.month, current_date.day]
        try:
            current_data = load_gdelt_by_yyyymmdd(*times)
        except Exception as e:
            print(str(e))
            time.sleep(exponential_wait_time)
    
            if current_date == error_date:
                exponential_wait_time *= 2
            else:
                error_date = current_date
            continue
    
        exponential_wait_time = 1
    
        current_data['Country_Pairs'] = (
            current_data.apply(lambda x: str(tuple(
                sorted([x['Actor_1_Country_ABBR'], x['Actor_2_Country_ABBR']])
            )), axis=1)
        )
    
        current_data['Event_Score'] = (
            current_data['Goldstein_Scale'] *
            current_data['Confidence'].apply(lambda x: custom_sigmoid(x))
        )
    
        current_data = current_data.groupby(['Country_Pairs', 'Day']).aggregate(
            country_code_a=('Actor_1_Country_ABBR', 'first'),
            country_code_b=('Actor_2_Country_ABBR', 'first'),
            relations_score=('Event_Score', 'mean'),
            num_verbal_coop=('Quad_Class', lambda x: (x == 1).sum()),
            num_material_coop=('Quad_Class', lambda x: (x == 2).sum()),
            num_verbal_conf=('Quad_Class', lambda x: (x == 3).sum()),
            num_material_conf=('Quad_Class', lambda x: (x == 4).sum())
           
        ).reset_index()

        current_data = current_data.drop(columns='Country_Pairs')
        current_data = current_data.rename(columns={'Day': 'date'})
        current_data['date'] = current_data['date'].apply(lambda x: format_str_yyyymmdd_to_time_str(str(x)))
    

        for row in current_data.index:
            data_to_post = current_data.iloc[row].to_json()

            try: 
                response = requests.post(url=r'http://127.0.0.1:8000/api/v1/relations/', data=data_to_post, headers={
                    "x-key": os.getenv("API_KEY")
                })

            except Exception as e: 
                print(response.content)
                continue

            try: 
                response = requests.post(url=r'https://gdelt-api-staging.filipovski.net/api/v1/relations/', data=data_to_post, headers={
                    "x-key": os.getenv("API_KEY")
                })

            except Exception as e: 
                print(response.content)
                continue

        
        pbar.update(1)
        current_date += datetime.timedelta(days=1)

In [None]:
data_to_post

In [30]:
response = requests.post(url=r'http://127.0.0.1:8000/api/v1/relations/', data=data_to_post)

In [None]:
response.content

{
   "date": "2024-08-25",
   "country_code_a": "string",
   "country_code_b": "string",
   "relations_score": 0,
   "num_verbal_coop": 0,
   "num_material_coop": 0,
   "num_verbal_conf": 0,
   "num_material_conf": 0
 }

### Calculate Score

GDELT-GRF conflict score should take into account:
- The potential impact of the type of the events (goldstein scale, custom scale using event code, quad class, maybe avg tone)
- The scale of the events (num articles, confidence)
---
Contents of master table:
- `Pair of countries`: (AFG, AUS)
- `Day`: 20220430
- `Number of events with quad class 1`: 10
- `Number of events with quad class 2`: 20
- `Number of events with quad class 3`: 100
- `Number of events with quad class 4`: 3
- Custom aggregate measure of goldstein scale for all events:
 - `Goldstein scale`: -0.8
 - `Confidence of extracted event from article`: 100
 - ~`Average tone of article`: -1 $~~~~$ [is it necessary?]~
 - $avgs = \sum_{\text{events}} gs \cdot \sigma(conf)$

 where $\sigma(n) = \frac{e^Q}{1 + e^Q}$ and $Q = \frac{n-50}{10}$ (shifted and scaled sigmoid)

 (if we want to add custom goldstein mappings $cgs$, we could replace the goldstein value with $$gs \gets \beta \cdot gs + (1 - \beta) \cdot cgs$$ but even if this gives more predictive power, it will be less interpretable)
---

Daily score calculation which we will predict later:
$$\frac{\sum_t \alpha_t \cdot avgs(t) \cdot N_t}{\sum_t N_t}$$
where $N_t$ is the total number of events which is the sum of all events in the respective quad classes, i.e. $N_t = N_1 + N_2 + N_3 + N_4$.