# Price–News Join Analysis

Analyze the GDELT–OHLCV join table: **news on day t → prices on day t+1** (next trading day).  
We evaluate and validate the data, then compute mean and median sentiment (and optional price metrics) per ticker per day.



In [4]:
import pandas as pd
from pathlib import Path

# Find project root
current = Path.cwd()
while not (current / "data").exists() and current != current.parent:
    current = current.parent
PROJECT_ROOT = current
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
INPUT_PATH = PROCESSED_DIR / "gdelt_ohlcv_join.csv"

df = pd.read_csv(
    INPUT_PATH,
    parse_dates=["seendate", "article_date", "price_date"],
)
# Required for analysis
required = ["sentiment_score", "ticker", "article_date", "price_date"]
missing = [c for c in required if c not in df.columns]
assert not missing, f"Missing columns: {missing}"
print(f"Loaded {len(df):,} rows from {INPUT_PATH.name}")
print(f"Columns: {list(df.columns)}")

Loaded 2,591 rows from gdelt_ohlcv_join.csv
Columns: ['seendate', 'url', 'title', 'language', 'domain', 'socialimage', 'company', 'ticker', 'sentiment_score', 'sentiment_hits', 'sentiment_present', 'article_date', 'price_date', 'next_open', 'next_high', 'next_low', 'next_close', 'next_adj_close', 'next_volume']


## 1. Data evaluation and validation

**First objective: validate the matching between prices and news** (join integrity, then schema and coverage).

In [5]:
# Schema and shape
print("Shape:", df.shape)
print("\nDtypes:")
print(df.dtypes)
print("\nKey columns present:")
key_cols = ["article_date", "price_date", "ticker", "sentiment_score", "next_close", "next_volume"]
for c in key_cols:
    print(f"  {c}: {c in df.columns}")

Shape: (2591, 19)

Dtypes:
seendate             datetime64[us, UTC]
url                                  str
title                                str
language                             str
domain                               str
socialimage                          str
company                              str
ticker                               str
sentiment_score                  float64
sentiment_hits                   float64
sentiment_present                   bool
article_date              datetime64[us]
price_date                datetime64[us]
next_open                        float64
next_high                        float64
next_low                         float64
next_close                       float64
next_adj_close                   float64
next_volume                        int64
dtype: object

Key columns present:
  article_date: True
  price_date: True
  ticker: True
  sentiment_score: True
  next_close: True
  next_volume: True


### 1.1 Validate price–news matching

Check that **price_date** is the next trading day after **article_date** and that attached prices match the source OHLCV.

In [9]:
# 1) price_date must be strictly after article_date (next trading day)
df["_gap_days"] = (df["price_date"] - df["article_date"]).dt.days
bad_order = (df["_gap_days"] <= 0).sum()
print("1) Article date → price date (next trading day)")
print(f"   Rows where price_date ≤ article_date: {bad_order} (expect 0)")
assert bad_order == 0, "Every row must have price_date > article_date"
print("   ✓ All rows have price_date after article_date")
print("\n   Calendar-day gap (article_date to price_date):")
print(df["_gap_days"].value_counts().sort_index().to_string())

1) Article date → price date (next trading day)
   Rows where price_date ≤ article_date: 0 (expect 0)
   ✓ All rows have price_date after article_date

   Calendar-day gap (article_date to price_date):
_gap_days
1    2462
2      74
3      55


In [10]:
# 2) Within join: each (price_date, ticker) should have exactly one set of next_* values (no conflicts)
price_cols = [c for c in df.columns if c.startswith("next_")]
check = df.groupby(["price_date", "ticker"])[price_cols].nunique()
max_per_col = check.max()
conflicts = (check > 1).any(axis=1).sum()
print("2) One price per (price_date, ticker)")
print(f"   Unique (price_date, ticker) pairs: {len(check):,}")
print(f"   Pairs with conflicting next_* values: {conflicts} (expect 0)")
if conflicts == 0:
    print("   ✓ All rows for same (price_date, ticker) have identical next_* values")
else:
    print("   ⚠ Some (price_date, ticker) have multiple different prices — investigate")

2) One price per (price_date, ticker)
   Unique (price_date, ticker) pairs: 88
   Pairs with conflicting next_* values: 0 (expect 0)
   ✓ All rows for same (price_date, ticker) have identical next_* values


In [11]:
# 3) Cross-check: join next_* values vs source OHLCV (prices_daily_accumulated)
ohlcv_path = PROCESSED_DIR / "prices_daily_accumulated.csv"
if ohlcv_path.exists():
    ohlcv = pd.read_csv(ohlcv_path, parse_dates=["date"])
    # One row per (date, ticker) in join; take first next_* per (price_date, ticker)
    join_prices = df.groupby(["price_date", "ticker"])["next_close"].first().reset_index()
    join_prices = join_prices.rename(columns={"price_date": "date", "next_close": "join_close"})
    merged = join_prices.merge(ohlcv[["date", "ticker", "close"]], on=["date", "ticker"], how="left")
    merged["match"] = merged["join_close"].round(6) == merged["close"].round(6)
    mismatches = (~merged["match"]).sum()
    missing = merged["close"].isna().sum()
    print("3) Join vs source OHLCV (next_close vs close)")
    print(f"   (price_date, ticker) pairs checked: {len(merged):,}")
    print(f"   Mismatches (join next_close ≠ OHLCV close): {mismatches}")
    print(f"   Missing in OHLCV: {missing}")
    if mismatches == 0 and missing == 0:
        print("   ✓ All join prices match source OHLCV")
    else:
        print("   ⚠ Review mismatches or missing dates")
else:
    print("3) Skip cross-check (prices_daily_accumulated.csv not found)")

3) Join vs source OHLCV (next_close vs close)
   (price_date, ticker) pairs checked: 88
   Mismatches (join next_close ≠ OHLCV close): 0
   Missing in OHLCV: 0
   ✓ All join prices match source OHLCV


### 1.2 Schema, date ranges, and coverage

Schema, date ranges, missing values, and ticker coverage (for downstream aggregation).

In [12]:
# Date ranges and join alignment
art_min, art_max = df["article_date"].min(), df["article_date"].max()
price_min, price_max = df["price_date"].min(), df["price_date"].max()
print("Article date range:", art_min.date(), "to", art_max.date())
print("Price date range: ", price_min.date(), "to", price_max.date())
print("\nExpected: price_date = next trading day after article_date (weekends/holidays skipped).")
# Spot-check: article_date and price_date should differ by 1–3 calendar days (Fri→Mon = 3)
df["days_to_next"] = (df["price_date"] - df["article_date"]).dt.days
print("\nCalendar days from article_date to price_date (sample):")
print(df["days_to_next"].value_counts().head(10))

Article date range: 2026-01-05 to 2026-02-08
Price date range:  2026-01-06 to 2026-02-09

Expected: price_date = next trading day after article_date (weekends/holidays skipped).

Calendar days from article_date to price_date (sample):
days_to_next
1    2462
2      74
3      55
Name: count, dtype: int64


In [13]:
# Missing values in columns used for aggregation
agg_cols = ["sentiment_score", "ticker", "article_date", "price_date"]
if "next_close" in df.columns:
    agg_cols.append("next_close")
missing = df[agg_cols].isna().sum()
print("Missing values (columns used for mean/median per ticker per day):")
print(missing[missing > 0] if missing.any() else "  None")
print("\nRows with any missing in these columns:", df[agg_cols].isna().any(axis=1).sum())

Missing values (columns used for mean/median per ticker per day):
  None

Rows with any missing in these columns: 0


In [15]:
# Ticker coverage: articles and (article_date, ticker) pairs
print("Articles per ticker:")
print(df["ticker"].value_counts().sort_index())
print("\nUnique (article_date, ticker) pairs per ticker = distinct calendar days with ≥1 article:")
# Per ticker, count distinct article_date (same as count of (article_date, ticker) per ticker)
unique_days_per_ticker = df.groupby("ticker")["article_date"].nunique()
print(unique_days_per_ticker.to_string())
print("\n(Multiple articles per day per ticker is expected; we aggregate to mean/median per (date, ticker) later.)")

Articles per ticker:
ticker
AAPL     232
AMZN     269
GOOGL    528
META     478
MSFT     290
NVDA     564
TSLA     230
Name: count, dtype: int64

Unique (article_date, ticker) pairs per ticker (calendar days with at least one article):


KeyError: Index(['ticker'], dtype='str')