# 0.2 Historical Earnings Move Analysis

**Objective:** Analyze historical post-earnings price moves to understand:
1. Distribution of |moves| by market cap bucket
2. Historical volatility around earnings
3. Baseline statistics for ML model comparison
4. Which stocks have predictable vs unpredictable earnings reactions

This provides the foundation for the ML model that will predict |move| quantiles.

In [1]:
import requests
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import os
from dotenv import load_dotenv
import time
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

load_dotenv()

DATA_DIR = Path('../data/earnings')
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Headers for Nasdaq API
NASDAQ_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.9',
    'Origin': 'https://www.nasdaq.com',
    'Referer': 'https://www.nasdaq.com/',
}

## 1. Fetch Historical Earnings Calendar

Get past earnings announcements from Nasdaq API (includes BMO/AMC timing).

In [2]:
def fetch_earnings_calendar_nasdaq(from_date: datetime, to_date: datetime) -> pd.DataFrame:
    """Fetch earnings calendar from Nasdaq API (includes BMO/AMC timing).
    
    Includes retry logic and longer delays to handle rate limiting.
    """
    all_rows = []
    current_date = from_date
    consecutive_errors = 0
    max_consecutive_errors = 10
    
    total_days = (to_date - from_date).days
    
    while current_date <= to_date:
        date_str = current_date.strftime('%Y-%m-%d')
        url = f"https://api.nasdaq.com/api/calendar/earnings?date={date_str}"
        
        success = False
        for attempt in range(3):  # Up to 3 retries per date
            try:
                r = requests.get(url, headers=NASDAQ_HEADERS, timeout=15)
                if r.status_code == 200:
                    data = r.json()
                    rows = data.get('data', {}).get('rows', [])
                    if rows:
                        for row in rows:
                            row['date'] = date_str
                        all_rows.extend(rows)
                    success = True
                    consecutive_errors = 0
                    break
                elif r.status_code == 429:  # Rate limited
                    print(f"  Rate limited, waiting 5s...")
                    time.sleep(5)
                else:
                    print(f"  {date_str}: HTTP {r.status_code}")
                    break
            except requests.exceptions.Timeout:
                print(f"  {date_str}: Timeout (attempt {attempt+1}/3)")
                time.sleep(2)
            except requests.exceptions.ConnectionError as e:
                print(f"  {date_str}: Connection error (attempt {attempt+1}/3)")
                time.sleep(3)
            except Exception as e:
                print(f"  {date_str}: {type(e).__name__}: {e}")
                break
        
        if not success:
            consecutive_errors += 1
            if consecutive_errors >= max_consecutive_errors:
                print(f"\n  WARNING: {max_consecutive_errors} consecutive errors, stopping early")
                break
        
        current_date += timedelta(days=1)
        time.sleep(0.15)  # Slightly longer delay between requests
        
        # Progress indicator every 30 days
        days_done = (current_date - from_date).days
        if days_done % 30 == 0:
            pct = days_done / total_days * 100
            print(f"  Progress: {days_done}/{total_days} days ({pct:.0f}%) - {len(all_rows)} records...")
    
    return pd.DataFrame(all_rows)

# Fetch 5 years of earnings (20 quarters) to align with prices.pqt data range
end_date = datetime.now()
start_date = end_date - timedelta(days=1825)  # ~5 years

print(f"Fetching earnings from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}...")
print(f"Total days to fetch: {(end_date - start_date).days}")
print("This may take 20-30 minutes...\n")

earnings_df = fetch_earnings_calendar_nasdaq(start_date, end_date)
print(f"\nTotal earnings records: {len(earnings_df)}")

Fetching earnings from 2021-01-08 to 2026-01-07...
Total days to fetch: 1825
This may take 20-30 minutes...

  Progress: 30/1825 days (2%) - 976 records...
  Progress: 60/1825 days (3%) - 2693 records...
  Progress: 90/1825 days (5%) - 3439 records...
  Progress: 120/1825 days (7%) - 5788 records...
  Progress: 150/1825 days (8%) - 6965 records...
  Progress: 180/1825 days (10%) - 7149 records...
  Progress: 210/1825 days (12%) - 9460 records...
  Progress: 240/1825 days (13%) - 10836 records...
  Progress: 270/1825 days (15%) - 11041 records...
  Progress: 300/1825 days (16%) - 12797 records...
  Progress: 330/1825 days (18%) - 14844 records...
  Progress: 360/1825 days (20%) - 15050 records...
  Progress: 390/1825 days (21%) - 15682 records...
  Progress: 420/1825 days (23%) - 17908 records...
  Progress: 450/1825 days (25%) - 18892 records...
  Progress: 480/1825 days (26%) - 20151 records...
  Progress: 510/1825 days (28%) - 22832 records...
  Progress: 540/1825 days (30%) - 23065 

In [3]:
# Filter to US stocks only (no dots/dashes in symbol)
us_earnings = earnings_df[
    ~earnings_df['symbol'].str.contains(r'[.-]', regex=True, na=False)
].copy()

# Parse dates
us_earnings['date'] = pd.to_datetime(us_earnings['date'])

# Keep only past earnings
us_earnings = us_earnings[us_earnings['date'] < datetime.now()]

# Parse timing from Nasdaq format
def parse_timing(time_str):
    if pd.isna(time_str):
        return 'unknown'
    time_str = str(time_str).lower()
    if 'pre-market' in time_str or 'before' in time_str:
        return 'BMO'
    elif 'after-hours' in time_str or 'after' in time_str:
        return 'AMC'
    return 'unknown'

us_earnings['timing'] = us_earnings['time'].apply(parse_timing)

print(f"US earnings (historical): {len(us_earnings)}")
print(f"Unique symbols: {us_earnings['symbol'].nunique()}")
print(f"\nDate range: {us_earnings['date'].min()} to {us_earnings['date'].max()}")

US earnings (historical): 84381
Unique symbols: 4864

Date range: 2021-01-08 00:00:00 to 2026-01-07 00:00:00


In [4]:
# Check BMO/AMC timing distribution (Nasdaq provides this!)
print("Timing distribution:")
print(us_earnings['timing'].value_counts())
print(f"\nBMO/AMC coverage: {(us_earnings['timing'] != 'unknown').mean()*100:.1f}%")

Timing distribution:
timing
unknown    84365
AMC           11
BMO            5
Name: count, dtype: int64

BMO/AMC coverage: 0.0%


## 2. Fetch Historical Prices for Earnings Stocks

For each stock with earnings, get prices around the earnings date to compute realized moves.

In [5]:
# We'll use existing prices.pqt - no need to fetch from API
# The prices were previously fetched and saved

In [6]:
# Sample stocks with multiple earnings in our period
# Focus on stocks with >= 4 earnings (1+ year of data)
earnings_counts = us_earnings.groupby('symbol').size()
frequent_earners = earnings_counts[earnings_counts >= 4].index.tolist()

print(f"Stocks with 4+ earnings in period: {len(frequent_earners)}")
print(f"Sample: {frequent_earners[:20]}")

Stocks with 4+ earnings in period: 4651
Sample: ['A', 'AA', 'AACG', 'AAL', 'AAME', 'AAMI', 'AAOI', 'AAON', 'AAP', 'AAPL', 'AAT', 'AB', 'ABAT', 'ABBV', 'ABCB', 'ABCL', 'ABEO', 'ABEV', 'ABG', 'ABM']


In [None]:
# UPDATED: Use existing prices.pqt instead of fetching from API
# This gives us much more coverage (5,644 symbols vs 200 sample)

prices_file = Path('../data/prices.pqt')

if prices_file.exists():
    print("Loading existing prices.pqt...")
    all_prices = pd.read_parquet(prices_file)
    all_prices['date'] = pd.to_datetime(all_prices['date'])
    
    # Get symbols that are in both prices and earnings
    price_symbols = set(all_prices['symbol'].unique())
    earning_symbols = set(us_earnings['symbol'].unique())
    common_symbols = price_symbols & earning_symbols
    
    print(f"Prices: {len(price_symbols):,} symbols")
    print(f"Earnings: {len(earning_symbols):,} symbols")
    print(f"Common: {len(common_symbols):,} symbols")
    
    # Build price cache using groupby (MUCH faster than loop filtering)
    print("Building price cache...")
    all_prices_sorted = all_prices.sort_values(['symbol', 'date'])
    
    # Filter to common symbols first, then groupby
    common_prices = all_prices_sorted[all_prices_sorted['symbol'].isin(common_symbols)]
    price_cache = {symbol: group for symbol, group in common_prices.groupby('symbol')}
    
    print(f"Built price cache for {len(price_cache)} symbols")
else:
    print("prices.pqt not found, falling back to API fetch...")
    # Original API fetch code (limited to 200 symbols)
    sample_size = 200
    sample_symbols = frequent_earners[:sample_size]
    
    print(f"Fetching prices for {len(sample_symbols)} symbols...")
    
    price_cache = {}
    for i, symbol in enumerate(sample_symbols):
        if i > 0 and i % 20 == 0:
            print(f"  Progress: {i}/{len(sample_symbols)}")
        
        df = fetch_historical_prices(
            symbol,
            start_date.strftime('%Y-%m-%d'),
            end_date.strftime('%Y-%m-%d')
        )
        if not df.empty:
            price_cache[symbol] = df
        time.sleep(0.15)
    
    print(f"\nGot prices for {len(price_cache)} symbols")

## 3. Compute Earnings Moves

For each earnings event, compute:
- **Gap move:** |Close_T-1 → Open_T| (pure earnings reaction)
- **Full move:** |Close_T-1 → Close_T| (includes intraday)
- **Overnight hold move:** |Close_T-1 → Close_T+1| (matches our exit strategy)

In [None]:
def compute_earnings_moves(symbol: str, earnings_dates: list, prices_df: pd.DataFrame) -> list:
    """Compute moves around each earnings date."""
    moves = []
    
    prices_df = prices_df.set_index('date').sort_index()
    
    for earn_date in earnings_dates:
        earn_date = pd.to_datetime(earn_date)
        
        try:
            # Find T-1 (day before earnings)
            t_minus_1_candidates = prices_df[prices_df.index < earn_date].tail(1)
            if t_minus_1_candidates.empty:
                continue
            t_minus_1 = t_minus_1_candidates.index[0]
            
            # Find T (earnings day or next trading day)
            t_candidates = prices_df[prices_df.index >= earn_date].head(1)
            if t_candidates.empty:
                continue
            t = t_candidates.index[0]
            
            # Find T+1 (day after earnings reaction)
            t_plus_1_candidates = prices_df[prices_df.index > t].head(1)
            if t_plus_1_candidates.empty:
                continue
            t_plus_1 = t_plus_1_candidates.index[0]
            
            # Get prices
            close_t_minus_1 = prices_df.loc[t_minus_1, 'close']
            open_t = prices_df.loc[t, 'open']
            close_t = prices_df.loc[t, 'close']
            close_t_plus_1 = prices_df.loc[t_plus_1, 'close']
            
            # Compute moves
            gap_move = (open_t - close_t_minus_1) / close_t_minus_1
            full_move = (close_t - close_t_minus_1) / close_t_minus_1
            overnight_move = (close_t_plus_1 - close_t_minus_1) / close_t_minus_1
            
            moves.append({
                'symbol': symbol,
                'earnings_date': earn_date,
                'close_t_minus_1': close_t_minus_1,
                'open_t': open_t,
                'close_t': close_t,
                'close_t_plus_1': close_t_plus_1,
                'gap_move': gap_move,
                'gap_move_abs': abs(gap_move),
                'full_move': full_move,
                'full_move_abs': abs(full_move),
                'overnight_move': overnight_move,
                'overnight_move_abs': abs(overnight_move),
            })
        except Exception as e:
            continue
    
    return moves

In [None]:
# Compute moves for all symbols with prices
all_moves = []

for symbol in price_cache:
    # Get earnings dates for this symbol
    symbol_earnings = us_earnings[us_earnings['symbol'] == symbol]['date'].tolist()
    
    moves = compute_earnings_moves(symbol, symbol_earnings, price_cache[symbol])
    all_moves.extend(moves)

moves_df = pd.DataFrame(all_moves)
print(f"Computed {len(moves_df)} earnings moves")
print(f"Unique symbols: {moves_df['symbol'].nunique()}")

In [None]:
# Save for later use
moves_df.to_parquet(DATA_DIR / 'historical_earnings_moves.parquet', index=False)
print(f"Saved to {DATA_DIR / 'historical_earnings_moves.parquet'}")

## 4. Move Distribution Analysis

In [None]:
# Overall distribution statistics
print("=" * 60)
print("EARNINGS MOVE DISTRIBUTION (Absolute Values)")
print("=" * 60)

for move_type in ['gap_move_abs', 'full_move_abs', 'overnight_move_abs']:
    print(f"\n{move_type.replace('_abs', '').replace('_', ' ').title()}:")
    data = moves_df[move_type] * 100  # Convert to percentage
    print(f"  Mean:   {data.mean():.2f}%")
    print(f"  Median: {data.median():.2f}%")
    print(f"  Std:    {data.std():.2f}%")
    print(f"  Q75:    {data.quantile(0.75):.2f}%")
    print(f"  Q90:    {data.quantile(0.90):.2f}%")
    print(f"  Q95:    {data.quantile(0.95):.2f}%")
    print(f"  Max:    {data.max():.2f}%")

In [None]:
# Distribution by price bucket (proxy for market cap)
moves_df['price_bucket'] = pd.cut(
    moves_df['close_t_minus_1'],
    bins=[0, 20, 50, 100, 200, 500, float('inf')],
    labels=['<$20', '$20-50', '$50-100', '$100-200', '$200-500', '>$500']
)

print("\n" + "=" * 60)
print("OVERNIGHT MOVE BY PRICE BUCKET")
print("=" * 60)

bucket_stats = moves_df.groupby('price_bucket')['overnight_move_abs'].agg([
    'count',
    'mean',
    'median',
    ('q75', lambda x: x.quantile(0.75)),
    ('q90', lambda x: x.quantile(0.90)),
]) * 100  # Convert to percentage (except count)

bucket_stats['count'] = bucket_stats['count'] / 100  # Fix count
bucket_stats.columns = ['Count', 'Mean %', 'Median %', 'Q75 %', 'Q90 %']
print(bucket_stats.to_string())

In [None]:
# Plot distribution
try:
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for i, move_type in enumerate(['gap_move_abs', 'full_move_abs', 'overnight_move_abs']):
        ax = axes[i]
        data = moves_df[move_type] * 100
        data = data[data < 30]  # Clip outliers for visualization
        ax.hist(data, bins=50, edgecolor='black', alpha=0.7)
        ax.axvline(data.median(), color='red', linestyle='--', label=f'Median: {data.median():.1f}%')
        ax.axvline(data.quantile(0.75), color='orange', linestyle='--', label=f'Q75: {data.quantile(0.75):.1f}%')
        ax.set_xlabel('|Move| %')
        ax.set_ylabel('Count')
        ax.set_title(move_type.replace('_abs', '').replace('_', ' ').title())
        ax.legend()
    
    plt.tight_layout()
    plt.savefig(DATA_DIR / 'earnings_move_distributions.png', dpi=100)
    plt.show()
    print(f"Saved plot to {DATA_DIR / 'earnings_move_distributions.png'}")
except ImportError:
    print("matplotlib not available - skipping plot")

## 5. Stock-Level Analysis

Which stocks have predictable vs volatile earnings reactions?

In [None]:
# Per-stock statistics
stock_stats = moves_df.groupby('symbol').agg({
    'overnight_move_abs': ['count', 'mean', 'std', 'median'],
    'close_t_minus_1': 'last',  # Most recent price
}).round(4)

stock_stats.columns = ['earnings_count', 'mean_move', 'std_move', 'median_move', 'last_price']
stock_stats = stock_stats[stock_stats['earnings_count'] >= 4]  # At least 4 earnings

# Compute coefficient of variation (std/mean) - lower = more predictable
stock_stats['cv'] = stock_stats['std_move'] / stock_stats['mean_move']

print(f"Stocks with 4+ earnings: {len(stock_stats)}")

In [None]:
# Most volatile (largest average moves)
print("\n" + "=" * 60)
print("MOST VOLATILE EARNINGS (Largest Average |Move|)")
print("=" * 60)
volatile = stock_stats.nlargest(15, 'mean_move').copy()
volatile['mean_move_pct'] = volatile['mean_move'] * 100
volatile['median_move_pct'] = volatile['median_move'] * 100
print(volatile[['earnings_count', 'mean_move_pct', 'median_move_pct', 'last_price']].to_string())

In [None]:
# Most predictable (lowest coefficient of variation among decent movers)
print("\n" + "=" * 60)
print("MOST PREDICTABLE EARNINGS (Low CV, Decent Moves)")
print("=" * 60)
# Filter to stocks with at least 3% average move (interesting for trading)
decent_movers = stock_stats[stock_stats['mean_move'] >= 0.03]
predictable = decent_movers.nsmallest(15, 'cv').copy()
predictable['mean_move_pct'] = predictable['mean_move'] * 100
print(predictable[['earnings_count', 'mean_move_pct', 'cv', 'last_price']].to_string())

In [None]:
# Save stock stats
stock_stats.to_parquet(DATA_DIR / 'stock_earnings_stats.parquet')
print(f"Saved stock stats to {DATA_DIR / 'stock_earnings_stats.parquet'}")

## 6. Implied Move Baseline

In the absence of historical options data, estimate what implied moves might look like.

Typical ATM straddle pricing = ~1.2-1.5x expected |move|, so:
- If historical mean |move| = 5%, implied move ≈ 6-7.5%
- If historical mean |move| = 10%, implied move ≈ 12-15%

The edge comes from correctly predicting the tails.

In [None]:
# Estimate implied moves and potential edge
# Assumption: Market prices in ~1.3x historical mean (rough heuristic)
IMPLIED_MULTIPLIER = 1.3

stock_stats['estimated_implied'] = stock_stats['mean_move'] * IMPLIED_MULTIPLIER

# Potential edge = q75 - estimated_implied
# (If we can predict q75 will happen, and market only prices in mean*1.3)

# First need to compute q75 per stock
q75_by_stock = moves_df.groupby('symbol')['overnight_move_abs'].quantile(0.75)
stock_stats['q75_move'] = q75_by_stock
stock_stats['potential_edge'] = stock_stats['q75_move'] - stock_stats['estimated_implied']

print("\n" + "=" * 60)
print("POTENTIAL EDGE (Q75 vs Estimated Implied)")
print("=" * 60)

# Stocks where q75 exceeds estimated implied (potential long vol edge)
edge_stocks = stock_stats[stock_stats['potential_edge'] > 0.01].copy()  # >1% edge
edge_stocks['edge_pct'] = edge_stocks['potential_edge'] * 100
edge_stocks['q75_pct'] = edge_stocks['q75_move'] * 100
edge_stocks['implied_pct'] = edge_stocks['estimated_implied'] * 100

print(f"\nStocks with >1% potential edge: {len(edge_stocks)}")
print("\nTop 15 by potential edge:")
print(edge_stocks.nlargest(15, 'potential_edge')[['earnings_count', 'q75_pct', 'implied_pct', 'edge_pct', 'last_price']].to_string())

## 7. Summary & Next Steps

In [None]:
print("=" * 60)
print("HISTORICAL EARNINGS MOVE ANALYSIS SUMMARY")
print("=" * 60)

print(f"""
Data Collected:
  - Earnings events analyzed: {len(moves_df)}
  - Unique stocks: {moves_df['symbol'].nunique()}
  - Date range: {moves_df['earnings_date'].min()} to {moves_df['earnings_date'].max()}

Overnight Move Distribution (|Close_T-1 → Close_T+1|):
  - Mean: {moves_df['overnight_move_abs'].mean()*100:.2f}%
  - Median: {moves_df['overnight_move_abs'].median()*100:.2f}%
  - Q75: {moves_df['overnight_move_abs'].quantile(0.75)*100:.2f}%
  - Q90: {moves_df['overnight_move_abs'].quantile(0.90)*100:.2f}%
  - Q95: {moves_df['overnight_move_abs'].quantile(0.95)*100:.2f}%

Key Findings:
  1. Average earnings move is ~{moves_df['overnight_move_abs'].mean()*100:.1f}% (overnight hold)
  2. ~25% of earnings moves exceed {moves_df['overnight_move_abs'].quantile(0.75)*100:.1f}% (q75)
  3. ~10% of earnings moves exceed {moves_df['overnight_move_abs'].quantile(0.90)*100:.1f}% (q90)
  4. Lower-priced stocks tend to have larger moves
  5. Some stocks have predictable move magnitude (low CV)

Files Saved:
  - {DATA_DIR / 'historical_earnings_moves.parquet'}
  - {DATA_DIR / 'stock_earnings_stats.parquet'}

Next Steps:
  1. Add more features (volatility regime, sector, etc.) for ML model
  2. Compare to actual implied moves when we have options data
  3. Build quantile regression model to predict q50/q75/q90
  4. Validate calibration on held-out data
""")