# Price–News Join Analysis

Analyze the GDELT–OHLCV join table: **news on day t → prices on day t+1** (next trading day).  
We evaluate and validate the data, then compute mean and median sentiment (and optional price metrics) per ticker per day.



In [12]:
import pandas as pd
from pathlib import Path

# Find project root
current = Path.cwd()
while not (current / "data").exists() and current != current.parent:
    current = current.parent
PROJECT_ROOT = current
PROCESSED_DIR = PROJECT_ROOT / "data" / "processed"
INPUT_PATH = PROCESSED_DIR / "gdelt_ohlcv_join.csv"

df = pd.read_csv(
    INPUT_PATH,
    parse_dates=["seendate", "article_date", "price_date"],
)
# Required for analysis
required = ["sentiment_score", "ticker", "article_date", "price_date"]
missing = [c for c in required if c not in df.columns]
assert not missing, f"Missing columns: {missing}"
print(f"Loaded {len(df):,} rows from {INPUT_PATH.name}")
print(f"Columns: {list(df.columns)}")

Loaded 3,090 rows from gdelt_ohlcv_join.csv
Columns: ['seendate', 'url', 'title', 'language', 'domain', 'socialimage', 'company', 'ticker', 'sentiment_score', 'sentiment_hits', 'sentiment_present', 'article_date', 'price_date', 'next_open', 'next_high', 'next_low', 'next_close', 'next_adj_close', 'next_volume']


## 1. Data evaluation and validation

**First objective: validate the matching between prices and news** (join integrity, then schema and coverage).

In [13]:
# Schema and shape
print("Shape:", df.shape)
print("\nDtypes:")
print(df.dtypes)
print("\nKey columns present:")
key_cols = ["article_date", "price_date", "ticker", "sentiment_score", "next_close", "next_volume"]
for c in key_cols:
    print(f"  {c}: {c in df.columns}")

Shape: (3090, 19)

Dtypes:
seendate             datetime64[us, UTC]
url                                  str
title                                str
language                             str
domain                               str
socialimage                          str
company                              str
ticker                               str
sentiment_score                  float64
sentiment_hits                   float64
sentiment_present                   bool
article_date              datetime64[us]
price_date                datetime64[us]
next_open                        float64
next_high                        float64
next_low                         float64
next_close                       float64
next_adj_close                   float64
next_volume                        int64
dtype: object

Key columns present:
  article_date: True
  price_date: True
  ticker: True
  sentiment_score: True
  next_close: True
  next_volume: True


### 1.1 Validate price–news matching

Check that **price_date** is the next trading day after **article_date** and that attached prices match the source OHLCV.

In [None]:
# 1) price_date must be strictly after article_date (next trading day)
df["_gap_days"] = (df["price_date"] - df["article_date"]).dt.days
bad_order = (df["_gap_days"] <= 0).sum()
print("1) Article date → price date (next trading day)")
# Display number of misaligned rows (s/b 0)
print(f"   Rows where price_date ≤ article_date: {bad_order} (expect 0)")
# Unit test for correct matching
assert bad_order == 0, "Every row must have price_date > article_date"
print("   ✓ All rows have price_date after article_date")
print("\n   Calendar-day gap (article_date to price_date):")
print(df["_gap_days"].value_counts().sort_index().to_string())

1) Article date → price date (next trading day)
   Rows where price_date ≤ article_date: 0 (expect 0)
   ✓ All rows have price_date after article_date

   Calendar-day gap (article_date to price_date):
_gap_days
1    2873
2      86
3     131


In [None]:
# 2) Within join: each (price_date, ticker) should have exactly one set of next_* values (no conflicts)
price_cols = [c for c in df.columns if c.startswith("next_")]
check = df.groupby(["price_date", "ticker"])[price_cols].nunique()
max_per_col = check.max()
conflicts = (check > 1).any(axis=1).sum()
print("2) One price per (price_date, ticker)")
print(f"   Unique (price_date, ticker) pairs: {len(check):,}")
print(f"   Pairs with conflicting next_* values: {conflicts} (expect 0)")
# Unit test for unique prices per (price_date, ticker)
assert conflicts == 0, "⚠ Some (price_date, ticker) have multiple different prices — investigate"
print("   ✓ All rows for same (price_date, ticker) have identical next_* values")

2) One price per (price_date, ticker)
   Unique (price_date, ticker) pairs: 97
   Pairs with conflicting next_* values: 0 (expect 0)
   ✓ All rows for same (price_date, ticker) have identical next_* values


In [None]:
# 3) Cross-check: join next_* values vs source OHLCV (prices_daily_accumulated)
ohlcv_path = PROCESSED_DIR / "prices_daily_accumulated.csv"
if ohlcv_path.exists():
    ohlcv = pd.read_csv(ohlcv_path, parse_dates=["date"])
    # One row per (date, ticker) in join; take first next_* per (price_date, ticker)
    join_prices = df.groupby(["price_date", "ticker"])["next_close"].first().reset_index()
    join_prices = join_prices.rename(columns={"price_date": "date", "next_close": "join_close"})
    merged = join_prices.merge(ohlcv[["date", "ticker", "close"]], on=["date", "ticker"], how="left")
    merged["match"] = merged["join_close"].round(6) == merged["close"].round(6)
    mismatches = (~merged["match"]).sum()
    missing = merged["close"].isna().sum()
    print("3) Join vs source OHLCV (next_close vs close)")
    print(f"   (price_date, ticker) pairs checked: {len(merged):,}")
    print(f"   Mismatches (join next_close ≠ OHLCV close): {mismatches}")
    print(f"   Missing in OHLCV: {missing}")
    # Unit test for matching prices
    assert mismatches == 0 and missing == 0, "   ⚠ Review mismatches or missing dates"
    print("   ✓ All join prices match source OHLCV")
else:
    print("3) Skip cross-check (prices_daily_accumulated.csv not found)")

3) Join vs source OHLCV (next_close vs close)
   (price_date, ticker) pairs checked: 97
   Mismatches (join next_close ≠ OHLCV close): 0
   Missing in OHLCV: 0
   ✓ All join prices match source OHLCV


### 1.2 Schema, date ranges, and coverage

Schema, date ranges, missing values, and ticker coverage (for downstream aggregation).

In [24]:
# Date ranges and join alignment
art_min, art_max = df["article_date"].min(), df["article_date"].max()
price_min, price_max = df["price_date"].min(), df["price_date"].max()
print("Article date range:", art_min.date(), "to", art_max.date())
print("Price date range: ", price_min.date(), "to", price_max.date())
print("\nExpected: price_date = next trading day after article_date (weekends/holidays skipped).")
# Spot-check: article_date and price_date should differ by 1–3 calendar days (Fri→Mon = 3)
df["days_to_next"] = (df["price_date"] - df["article_date"]).dt.days
print("\nCalendar days from article_date to price_date (sample):")
print(df["days_to_next"].value_counts().head(10))

Article date range: 2026-01-05 to 2026-02-09
Price date range:  2026-01-06 to 2026-02-10

Expected: price_date = next trading day after article_date (weekends/holidays skipped).

Calendar days from article_date to price_date (sample):
days_to_next
1    2873
3     131
2      86
Name: count, dtype: int64


In [25]:
# Missing values in columns used for aggregation
agg_cols = ["sentiment_score", "ticker", "article_date", "price_date"]
if "next_close" in df.columns:
    agg_cols.append("next_close")
missing = df[agg_cols].isna().sum()
print("Missing values (columns used for mean/median per ticker per day):")
print(missing[missing > 0] if missing.any() else "  None")
print("\nRows with any missing in these columns:", df[agg_cols].isna().any(axis=1).sum())

Missing values (columns used for mean/median per ticker per day):
  None

Rows with any missing in these columns: 0


In [8]:
# Ticker coverage: articles and (article_date, ticker) pairs
print("Articles per ticker:")
print(df["ticker"].value_counts().sort_index())
print("\nUnique (article_date, ticker) pairs per ticker = distinct calendar days with ≥1 article:")
# Per ticker, count distinct article_date (same as count of (article_date, ticker) per ticker)
unique_days_per_ticker = df.groupby("ticker")["article_date"].nunique()
print(unique_days_per_ticker.to_string())
print("\n(Multiple articles per day per ticker is expected; we aggregate to mean/median per (date, ticker) later.)")

Articles per ticker:
ticker
AAPL     306
AMZN     320
GOOGL    580
META     628
MSFT     351
NVDA     658
TSLA     247
Name: count, dtype: int64

Unique (article_date, ticker) pairs per ticker = distinct calendar days with ≥1 article:
ticker
AAPL     12
AMZN     14
GOOGL    17
META     18
MSFT     13
NVDA     13
TSLA     16

(Multiple articles per day per ticker is expected; we aggregate to mean/median per (date, ticker) later.)


In [None]:
# Compute mean/median sentiment per ticker per day
sentiment_per_ticker_per_day = df.groupby(["ticker", "price_date"])["sentiment_score"].agg(["mean", "median"])
print("\nMean and median sentiment per ticker per day:")
print(sentiment_per_ticker_per_day.head())

# Sanity check aggregated outputs across tickers



Mean and median sentiment per ticker per day:
                       mean  median
ticker price_date                  
AAPL   2026-01-06  0.417500    0.38
       2026-01-07  0.262857    0.00
       2026-01-08  0.052500    0.00
       2026-01-12  0.317308    0.00
       2026-01-13  0.233953    0.00


## 2. Sentiment summary statistics per ticker per day

Aggregate to mean and median sentiment per (ticker, day).

In [27]:
# Aggregate: one row per (article_date, ticker)
daily = df.groupby(["article_date", "ticker"]).agg(
    sentiment_mean=("sentiment_score", "mean"),
    sentiment_median=("sentiment_score", "median"),
    article_count=("sentiment_score", "count"),
).reset_index()
if "next_close" in df.columns:
    daily["next_close"] = df.groupby(["article_date", "ticker"])["next_close"].first().values
if "next_volume" in df.columns:
    daily["next_volume"] = df.groupby(["article_date", "ticker"])["next_volume"].first().values

print("Daily summary (first rows):")
daily.head(10)

Daily summary (first rows):


Unnamed: 0,article_date,ticker,sentiment_mean,sentiment_median,article_count,next_close,next_volume
0,2026-01-05,AAPL,0.4175,0.38,8,262.359985,52352100
1,2026-01-05,AMZN,0.38,0.38,2,240.929993,53764700
2,2026-01-05,GOOGL,0.91,0.91,2,314.339996,31212100
3,2026-01-05,MSFT,0.006667,0.0,6,478.51001,23037700
4,2026-01-05,NVDA,0.411429,0.0,7,187.240005,176862600
5,2026-01-05,TSLA,0.066667,0.0,3,432.959991,89093800
6,2026-01-06,AAPL,0.262857,0.0,21,260.329987,48309800
7,2026-01-06,AMZN,0.206923,0.0,52,241.559998,42236500
8,2026-01-06,GOOGL,0.490769,0.76,13,321.980011,35104400
9,2026-01-06,META,0.125,0.0,16,648.690002,12846300


## 3. Monday-only analysis: sentiment per ticker per week

Filter to **Mondays only** (article_date), then group by week (7-day increments starting from **2026-01-12** as week 1). Compute summary statistics per ticker per week (open-ended; weeks 1+).

In [75]:
# Filter to Mondays only (dayofweek: Monday=0)
mondays = daily[daily["article_date"].dt.dayofweek == 0].copy()
print(f"Total rows: {len(daily):,}")
print(f"Mondays only: {len(mondays):,} ({100*len(mondays)/len(daily):.1f}%)")
print(f"\nMonday dates in data:")
monday_dates = sorted(mondays["article_date"].dt.date.unique())
print(monday_dates)
print(f"\nExpected Mondays (if all weeks present):")
week_start = pd.Timestamp("2026-01-12")
for w in range(1, 6):  # weeks 1-5 based on data up to 2/9
    expected_monday = week_start + pd.Timedelta(days=7*(w-1))
    print(f"  Week {w}: {expected_monday.date()}")
    if expected_monday.date() not in monday_dates:
        print(f"    ⚠ MISSING from data")

Total rows: 103
Mondays only: 27 (26.2%)

Monday dates in data:
[datetime.date(2026, 1, 5), datetime.date(2026, 1, 12), datetime.date(2026, 2, 2), datetime.date(2026, 2, 9)]

Expected Mondays (if all weeks present):
  Week 1: 2026-01-12
  Week 2: 2026-01-19
    ⚠ MISSING from data
  Week 3: 2026-01-26
    ⚠ MISSING from data
  Week 4: 2026-02-02
  Week 5: 2026-02-09


In [74]:
# Assign week number (increments of 7 days starting from 2026-01-12)
# Week 1 = 2026-01-12 + 0-6 days, Week 2 = 2026-01-12 + 7-13 days, etc.
week_start_date = pd.Timestamp("2026-01-12")
mondays["days_since_week1_start"] = (mondays["article_date"] - week_start_date).dt.days
mondays["week"] = (mondays["days_since_week1_start"] // 7) + 1
# Explain missing weeks (MLK holiday on 1/19)
print("Note: MLK holiday on 1/19 causes week 2 to be missing")
# Filter to weeks >= 1 (exclude dates before 1/12, but no upper limit - open-ended)
mondays_filtered = mondays[mondays["week"] >= 1].copy()
print(f"Week start date: {week_start_date.date()}")
print(f"Week range in data: {mondays_filtered['week'].min()} to {mondays_filtered['week'].max()} (open-ended)")
print(f"\nMondays per week:")
week_counts = mondays_filtered.groupby("week")["article_date"].nunique()
print(week_counts)
print(f"\nMissing weeks (expected but not present):")
all_weeks = set(range(mondays_filtered['week'].min(), mondays_filtered['week'].max() + 1))
present_weeks = set(week_counts.index)
missing_weeks = sorted(all_weeks - present_weeks)
if missing_weeks:
    for w in missing_weeks:
        expected_date = week_start_date + pd.Timedelta(days=7*(w-1))
        print(f"  Week {w}: {expected_date.date()} (no articles on this Monday)")
        # Check if this date exists in the raw daily data (not just Mondays)
        if expected_date.date() in daily["article_date"].dt.date.values:
            print(f"    → Date exists in daily data but is not a Monday (dayofweek check)")
        else:
            print(f"    → Date not in daily data at all (no articles on this date)")
else:
    print("  None")

Note: MLK holiday on 1/19 causes week 2 to be missing
Week start date: 2026-01-12
Week range in data: 1 to 5 (open-ended)

Mondays per week:
week
1    1
4    1
5    1
Name: article_date, dtype: int64


In [69]:
# Summary statistics per ticker per week (Mondays only)
# Use mondays_filtered to exclude week 0 (dates before 1/12)
weekly_stats = mondays_filtered.groupby(["ticker", "week"]).agg(
    sentiment_mean=("sentiment_mean", "mean"),
    sentiment_median=("sentiment_median", "median"),
    # sentiment_std excluded: std of already-aggregated daily means is problematic (NaN when only one Monday)
    monday_count=("article_date", "nunique"),  # number of Mondays in this week with data
    total_articles=("article_count", "sum"),
).reset_index()

print("Summary statistics per ticker per week (Mondays only, weeks 1+):")
print("=" * 80)
weekly_stats

Summary statistics per ticker per week (Mondays only, weeks 1+):


Unnamed: 0,ticker,week,sentiment_mean,sentiment_median,sentiment_std,monday_count,total_articles
0,AAPL,1,0.233953,0.0,,1,43
1,AAPL,4,0.0,0.0,,1,1
2,AAPL,5,0.190857,0.0,,1,35
3,AMZN,1,0.14037,0.0,,1,27
4,AMZN,4,-0.19,0.0,,1,4
5,AMZN,5,0.171714,0.0,,1,35
6,GOOGL,1,0.09,0.0,,1,20
7,GOOGL,4,-0.016667,0.0,,1,3
8,GOOGL,5,0.172133,0.0,,1,75
9,META,1,0.016897,0.0,,1,29


In [70]:
# Pivot table for mean sentiment per ticker (per week) - weeks 1+ (open-ended)
weekly_pivot_mean = weekly_stats.pivot(index="ticker", columns="week", values="sentiment_mean")
print("Mean sentiment per ticker per week (Mondays only, weeks 1+) - pivot view:")
print("=" * 80)
print("Note: Week 0 (dates before 1/12) is excluded. If a ticker has NaN for a week,")
print("      that ticker had no articles on the Monday(s) in that week.")
print("      Weeks are open-ended; new weeks will appear as data extends past 2/9.")
weekly_pivot_mean


Mean sentiment per ticker per week (Mondays only, weeks 1+) - pivot view:
Note: Week 0 (dates before 1/12) is excluded. If a ticker has NaN for a week,
      that ticker had no articles on the Monday(s) in that week.
      Weeks are open-ended; new weeks will appear as data extends past 2/9.


week,1,4,5
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,0.233953,0.0,0.190857
AMZN,0.14037,-0.19,0.171714
GOOGL,0.09,-0.016667,0.172133
META,0.016897,0.081,0.004167
MSFT,0.12,0.0,0.136
NVDA,0.364038,0.0,0.211875
TSLA,0.087222,-0.91,0.556


In [72]:
# Pivot table for median sentiment per ticker (per week)
weekly_pivot_median = weekly_stats.pivot(index="ticker", columns="week", values="sentiment_median")
print("Median sentiment per ticker per week (Mondays only) - pivot view:")
print("=" * 70)
# Display table
weekly_pivot_median

Median sentiment per ticker per week (Mondays only) - pivot view:


week,1,4,5
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,0.0,0.0,0.0
AMZN,0.0,0.0,0.0
GOOGL,0.0,0.0,0.0
META,0.0,0.0,0.0
MSFT,0.0,0.0,0.0
NVDA,0.12,0.0,0.0
TSLA,0.0,-0.91,0.86


### Diagnostic: Why is week 3 (1/26) missing?

Check if 1/26 exists in the data and why it might not appear as a Monday.

In [None]:
# Check for 1/26 specifically
target_date = pd.Timestamp("2026-01-26")
print(f"Checking date: {target_date.date()} (expected Monday for week 3)")
print(f"Day of week: {target_date.day_name()} (dayofweek={target_date.dayofweek}, Monday=0)")

# Check if this date exists in daily data
has_date = (daily["article_date"].dt.date == target_date.date()).any()
print(f"\nDate exists in daily data: {has_date}")

if has_date:
    date_rows = daily[daily["article_date"].dt.date == target_date.date()]
    print(f"  Rows for {target_date.date()}: {len(date_rows)}")
    print(f"  Tickers: {sorted(date_rows['ticker'].unique())}")
    print(f"  Day of week in data: {date_rows['article_date'].dt.day_name().iloc[0]}")
else:
    print(f"\n  → {target_date.date()} is not in daily data (no articles on this date)")
    print(f"\nChecking nearby dates:")
    nearby = daily[
        (daily["article_date"] >= target_date - pd.Timedelta(days=3)) &
        (daily["article_date"] <= target_date + pd.Timedelta(days=3))
    ]
    if len(nearby) > 0:
        print(nearby[["article_date", "ticker"]].drop_duplicates().sort_values("article_date"))
    else:
        print("  No articles within 3 days of 1/26")