This notebook merges the monthly GDELT news sentiment data with the already merged regional panel dataset comprising of HPI, macro variables and Google Trend indices.The resulting dataset will then be used for pre-processing, exploratory analysis, and modelling. The link to the two datasets can be found here:
1. GDELT: https://drive.google.com/file/d/1xYBAlUKmPz9qsMS5K6Q6pJ63-raR_uDy/view?usp=sharing
2. Merged Data: https://docs.google.com/spreadsheets/d/1mEkqvOxpx4t1xQi7lYnfRsycjBrcq9r1/edit?usp=sharing&ouid=111965315671490167461&rtpof=true&sd=true

In [None]:
# Merging GDELT tone indice to already merged dataset
import re
import pandas as pd
from google.colab import files

print(" Upload the GDELT monthly tone CSV "
      "(e.g., gdelt_tone_2005_2025_stitched_ECONxGKG_UPDATED.csv)")
up_gdelt = files.upload()

print("\n Upload merged modelling dataset (Excel) "
      "(e.g., Merged dataset.xlsx)")
up_merged = files.upload()

gdelt_file  = next(iter(up_gdelt))
merged_file = next(iter(up_merged))

gdelt  = pd.read_csv(gdelt_file)
merged = pd.read_excel(merged_file)

# Converting any date-like column into Month Start dates
def to_month_start(s: pd.Series) -> pd.Series:
    s = s.copy()

    # First attempt: standard date parsing
    dt = pd.to_datetime(s, errors="coerce", infer_datetime_format=True)

    # Fallback for "YYYYMM" strings (e.g., 201905)
    s_str = s.astype(str).str.strip()
    yyyymm = s_str.str.fullmatch(r"\d{6}", na=False)
    if yyyymm.any():
        dt.loc[yyyymm] = pd.to_datetime(s_str[yyyymm], format="%Y%m", errors="coerce")

    # Normalize to month-start timestamps
    return dt.dt.to_period("M").dt.to_timestamp(how="start")

# Basic validation: expected key columns exist
if "Month" not in gdelt.columns:
    raise ValueError(
        "GDELT CSV must contain a 'Month' column. "
        f"Found: {gdelt.columns.tolist()}"
    )

if "Date" not in merged.columns:
    raise ValueError(
        "Merged dataset must contain a 'Date' column. "
        f"Found: {merged.columns.tolist()}"
    )

# Standardising both datasets onto the same monthly Date key
gdelt["Date"]  = to_month_start(gdelt["Month"])
merged["Date"] = to_month_start(merged["Date"])

# Dropping any GDELT rows where Date couldn't be parsed
gdelt = gdelt.dropna(subset=["Date"]).copy()

# If GDELT has duplicate months, keeping the first
gdelt = (
    gdelt.sort_values("Date")
         .drop_duplicates(subset=["Date"], keep="first")
)

# Keeping only GDELT value columns
gdelt_value_cols = [c for c in gdelt.columns if c not in ["Month", "Date"]]
gdelt_for_merge  = gdelt[["Date"] + gdelt_value_cols].copy()

# Merge: keep all rows from main dataset (left join)
final = merged.merge(gdelt_for_merge, on="Date", how="left")

if len(final) != len(merged):
    raise ValueError("Row count changed after merge — check duplicates in the Date key.")

if not merged["Date"].value_counts().sort_index().equals(final["Date"].value_counts().sort_index()):
    raise ValueError("Per-date row counts changed — check Date normalisation.")

# Saving output
out_path = "Master file.csv"
final.to_csv(out_path, index=False)

print(f"\n Merge complete. Saved file: {out_path}")
print(f"Rows: {final.shape[0]:,} | Columns: {final.shape[1]:,}")


 Upload the GDELT monthly tone CSV (e.g., gdelt_tone_2005_2025_stitched_ECONxGKG_UPDATED.csv)


Saving gdelt_tone_2005_2025_stitched_ECONxGKG_UPDATED.csv to gdelt_tone_2005_2025_stitched_ECONxGKG_UPDATED.csv

 Upload merged modelling dataset (Excel) (e.g., Merged dataset.xlsx)


Saving HPI_regional_merged_2005_2025_ddmmyyyy_MODIFIED.xlsx to HPI_regional_merged_2005_2025_ddmmyyyy_MODIFIED.xlsx


  dt = pd.to_datetime(s, errors="coerce", infer_datetime_format=True)
  dt = pd.to_datetime(s, errors="coerce", infer_datetime_format=True)



 Merge complete. Saved file: Master file.csv
Rows: 99,630 | Columns: 23


In [None]:
# Missingness summary for the new GDELT columns
if gdelt_value_cols:
    print("\nMissingness in GDELT columns (after merging):")
    for c in gdelt_value_cols:
        print(f" - {c}: {final[c].isna().mean():.2%}")


Missingness in GDELT columns (after merging):
 - AvgTone_Stitched: 0.00%
 - Docs_Stitched: 0.00%
 - Source: 0.00%
