# 1.0 Feature Engineering for Earnings Move Prediction

Build comprehensive feature set to predict post-earnings |move| distribution.

**Reuses existing data from news_ranking project:**
- `news_embeddings.pqt` - 1.7M+ pre-computed embeddings (768-dim)
- `key_metrics.pqt`, `ratios.pqt`, `growth.pqt` - fundamentals
- `filing_dates.pqt` - SEC filing dates for point-in-time alignment

## Feature Categories

1. **Historical earnings behavior** - past moves, consistency
2. **Pre-earnings news** - PCA-reduced embeddings (10 components) from T-7 to T-1
3. **Fundamentals** - key metrics, ratios, growth (point-in-time)
4. **Price context** - momentum, volatility, positioning
5. **Analyst expectations** - surprise history

## Key Design Decision: PCA-10 for News Embeddings

Full 768-dim embeddings hurt model performance (overfitting). Testing showed:
- Full embeddings: worse calibration
- PCA-10: 31% improvement in q75 calibration

We reduce 768-dim â†’ 10 PCA components, capturing ~35% variance while preventing overfitting.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pathlib import Path
import requests
import os
from dotenv import load_dotenv
import time
import warnings
import joblib
from sklearn.decomposition import PCA
warnings.filterwarnings('ignore')

load_dotenv()
FMP_KEY = os.getenv('FMP_API_KEY')

DATA_DIR = Path('../data')
EARNINGS_DIR = DATA_DIR / 'earnings'
NEWS_DIR = DATA_DIR / 'news_ranking'
MODEL_DIR = Path('../models')

EARNINGS_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# PCA configuration
N_PCA_COMPONENTS = 10

## 1. Load Existing Data

In [2]:
# Load earnings moves from 0.2 notebook
moves_df = pd.read_parquet(EARNINGS_DIR / 'historical_earnings_moves.parquet')
moves_df['earnings_date'] = pd.to_datetime(moves_df['earnings_date'])
print(f"Earnings moves: {len(moves_df)} events, {moves_df['symbol'].nunique()} symbols")
print(f"Date range: {moves_df['earnings_date'].min().date()} to {moves_df['earnings_date'].max().date()}")

Earnings moves: 5307 events, 2212 symbols
Date range: 2024-03-19 to 2025-12-18


In [3]:
# Load earnings calendar for timing info
earnings_cal = pd.read_parquet(EARNINGS_DIR / 'earnings_calendar.parquet')
earnings_cal['date'] = pd.to_datetime(earnings_cal['date'])
print(f"Earnings calendar: {len(earnings_cal)} events")

Earnings calendar: 21931 events


In [4]:
# Check what data we have from news_ranking
print("Available news_ranking data:")
for f in sorted(NEWS_DIR.glob('*.pqt')):
    size_mb = f.stat().st_size / 1e6
    print(f"  {f.name}: {size_mb:.1f} MB")

Available news_ranking data:
  all_the_news_anon.pqt: 1001.2 MB
  backtest_vol_comparison.pqt: 0.0 MB
  confidence_gating_best.pqt: 0.0 MB
  confidence_scores.pqt: 7.5 MB
  dropout_gridsearch_results.pqt: 0.0 MB
  dropout_search_results.pqt: 0.0 MB
  hyperparam_arch_results.pqt: 0.0 MB
  hyperparam_train_results.pqt: 0.0 MB
  ml_dataset.pqt: 3815.8 MB
  news_embeddings.pqt: 6343.4 MB
  price_features.pqt: 635.8 MB
  risk_management_results.pqt: 0.0 MB
  robust_arch_results.pqt: 0.0 MB
  robust_dropout_results.pqt: 0.0 MB
  robust_train_results.pqt: 0.0 MB
  short_backtest_improved.pqt: 0.0 MB
  short_backtest_results.pqt: 0.0 MB
  strategy_comparison_results.pqt: 0.0 MB
  strategy_evaluation_results.pqt: 0.0 MB
  symbol_metrics_val.pqt: 0.1 MB
  vol_targeting_best.pqt: 0.0 MB


## 2. Historical Earnings Features

For each earnings event, compute features based on that stock's past earnings behavior.

In [5]:
def compute_historical_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute historical earnings features for each event.
    Uses only data available BEFORE the event (no lookahead).
    """
    df = df.sort_values(['symbol', 'earnings_date']).copy()
    
    features = []
    
    for symbol in df['symbol'].unique():
        symbol_df = df[df['symbol'] == symbol].copy()
        
        for i in range(len(symbol_df)):
            row = symbol_df.iloc[i]
            
            # Get all PREVIOUS earnings for this symbol
            past = symbol_df.iloc[:i]
            
            feat = {
                'symbol': symbol,
                'earnings_date': row['earnings_date'],
                'target_move': row['overnight_move_abs'],
                'gap_move': row.get('gap_move_abs', np.nan),
                'close_t_minus_1': row['close_t_minus_1'],
            }
            
            if len(past) >= 1:
                # Historical move statistics
                feat['hist_move_mean'] = past['overnight_move_abs'].mean()
                feat['hist_move_median'] = past['overnight_move_abs'].median()
                feat['hist_move_std'] = past['overnight_move_abs'].std() if len(past) > 1 else 0
                feat['hist_move_max'] = past['overnight_move_abs'].max()
                feat['hist_move_min'] = past['overnight_move_abs'].min()
                
                # Coefficient of variation (predictability)
                if feat['hist_move_mean'] > 0:
                    feat['hist_move_cv'] = feat['hist_move_std'] / feat['hist_move_mean']
                else:
                    feat['hist_move_cv'] = 0
                
                # Recent moves (last 2 quarters)
                recent = past.tail(2)
                feat['recent_move_mean'] = recent['overnight_move_abs'].mean()
                
                # Trend (are moves getting bigger or smaller?)
                if len(past) >= 2:
                    feat['move_trend'] = past['overnight_move_abs'].iloc[-1] - past['overnight_move_abs'].iloc[0]
                else:
                    feat['move_trend'] = 0
                
                # Gap vs full move ratio (does stock continue or reverse?)
                if 'gap_move_abs' in past.columns:
                    gap_mean = past['gap_move_abs'].mean()
                    if gap_mean > 0:
                        feat['gap_continuation_ratio'] = feat['hist_move_mean'] / gap_mean
                    else:
                        feat['gap_continuation_ratio'] = 1
                
                # Number of historical observations
                feat['n_past_earnings'] = len(past)
            else:
                # No history - use defaults
                feat['hist_move_mean'] = np.nan
                feat['hist_move_median'] = np.nan
                feat['hist_move_std'] = np.nan
                feat['hist_move_max'] = np.nan
                feat['hist_move_min'] = np.nan
                feat['hist_move_cv'] = np.nan
                feat['recent_move_mean'] = np.nan
                feat['move_trend'] = np.nan
                feat['gap_continuation_ratio'] = np.nan
                feat['n_past_earnings'] = 0
            
            features.append(feat)
    
    return pd.DataFrame(features)

In [6]:
# Compute historical features
hist_features = compute_historical_features(moves_df)
print(f"Computed features for {len(hist_features)} earnings events")
print(f"Events with history (n_past >= 1): {(hist_features['n_past_earnings'] >= 1).sum()}")
print(f"Events with history (n_past >= 4): {(hist_features['n_past_earnings'] >= 4).sum()}")
hist_features.head()

Computed features for 5307 earnings events
Events with history (n_past >= 1): 3095
Events with history (n_past >= 4): 575


Unnamed: 0,symbol,earnings_date,target_move,gap_move,close_t_minus_1,hist_move_mean,hist_move_median,hist_move_std,hist_move_max,hist_move_min,hist_move_cv,recent_move_mean,move_trend,gap_continuation_ratio,n_past_earnings
0,A,2024-05-29,0.113285,0.009918,148.21,,,,,,,,,,0
1,A,2024-11-25,0.00396,0.005155,133.84,0.113285,0.113285,0.0,0.113285,0.113285,0.0,0.113285,0.0,11.421769,1
2,A,2025-05-28,0.018156,0.003865,111.26,0.058623,0.058623,0.077305,0.113285,0.00396,1.318684,0.058623,-0.109325,7.778092,2
3,A,2025-08-27,0.056298,0.001691,118.3,0.045134,0.018156,0.059446,0.113285,0.00396,1.317118,0.011058,-0.09513,7.149467,3
4,A,2025-11-24,0.039339,0.004298,151.25,0.047925,0.037227,0.048858,0.113285,0.00396,1.019468,0.037227,-0.056988,9.292571,4


## 3. Pre-Earnings News Features (PCA-10)

Aggregate news embeddings from the X days before each earnings event.

Strategy:
- Look at news from T-7 to T-1 before earnings
- Mean-pool 768-dim embeddings for that window
- Apply PCA to reduce to 10 components
- Also include news count as a feature

**Why PCA?** Full 768-dim embeddings cause overfitting. PCA-10 captures the important variance while being robust.

In [7]:
# Load news data and embeddings
print("Loading news data...")
news = pd.read_parquet(DATA_DIR / 'news_ranking' / 'all_the_news_anon.pqt')
news['publishedDate'] = pd.to_datetime(news['publishedDate'])
print(f"News articles: {len(news):,}")

print("\nLoading embeddings (this is 6GB, may take a minute)...")
embeddings = pd.read_parquet(NEWS_DIR / 'news_embeddings.pqt')
print(f"Embeddings: {len(embeddings):,} rows")

# Get embedding columns
emb_cols = [c for c in embeddings.columns if c.startswith('emb_')]
print(f"Embedding dimension: {len(emb_cols)}")

Loading news data...
News articles: 1,747,711

Loading embeddings (this is 6GB, may take a minute)...
Embeddings: 1,748,149 rows
Embedding dimension: 768


In [8]:
# Join embeddings with news metadata
news_with_emb = embeddings.merge(
    news[['url', 'symbol', 'publishedDate']],
    on=['url', 'symbol'],
    how='inner'
)
news_with_emb['pub_date'] = news_with_emb['publishedDate'].dt.date
print(f"News with embeddings: {len(news_with_emb):,}")

News with embeddings: 1,747,711


In [9]:
def aggregate_pre_earnings_news(earnings_df: pd.DataFrame, 
                                 news_df: pd.DataFrame,
                                 emb_cols: list,
                                 lookback_days: int = 7) -> pd.DataFrame:
    """
    For each earnings event, aggregate news embeddings from [T-lookback, T-1].
    Returns DataFrame with mean-pooled 768-dim embeddings and news count.
    (PCA is applied separately after aggregation)
    """
    from tqdm.auto import tqdm
    
    results = []
    
    # Group news by symbol for faster lookup
    news_by_symbol = {symbol: grp for symbol, grp in news_df.groupby('symbol')}
    
    for _, row in tqdm(earnings_df.iterrows(), total=len(earnings_df), desc="Aggregating news"):
        symbol = row['symbol']
        earn_date = row['earnings_date'].date()
        
        result = {
            'symbol': symbol,
            'earnings_date': row['earnings_date'],
        }
        
        if symbol not in news_by_symbol:
            # No news for this symbol
            result['pre_earnings_news_count'] = 0
            for col in emb_cols:
                result[col] = 0.0  # Store raw embedding, not news_emb_
            results.append(result)
            continue
        
        symbol_news = news_by_symbol[symbol]
        
        # Filter to lookback window [T-lookback, T-1]
        start_date = earn_date - timedelta(days=lookback_days)
        end_date = earn_date - timedelta(days=1)
        
        window_news = symbol_news[
            (symbol_news['pub_date'] >= start_date) &
            (symbol_news['pub_date'] <= end_date)
        ]
        
        result['pre_earnings_news_count'] = len(window_news)
        
        if len(window_news) > 0:
            # Mean-pool embeddings
            mean_emb = window_news[emb_cols].mean()
            for col in emb_cols:
                result[col] = mean_emb[col]
        else:
            for col in emb_cols:
                result[col] = 0.0
        
        results.append(result)
    
    return pd.DataFrame(results)

In [10]:
# Aggregate pre-earnings news (7-day lookback)
# This takes a while - cache the result
news_features_full_file = EARNINGS_DIR / 'pre_earnings_news_features_full.parquet'

if news_features_full_file.exists():
    print("Loading cached full news features...")
    news_features_full = pd.read_parquet(news_features_full_file)
else:
    print("Computing pre-earnings news features (this may take 10-20 minutes)...")
    news_features_full = aggregate_pre_earnings_news(
        hist_features,
        news_with_emb,
        emb_cols,
        lookback_days=7
    )
    news_features_full.to_parquet(news_features_full_file, index=False)
    print(f"Saved to {news_features_full_file}")

print(f"Full news features shape: {news_features_full.shape}")
print(f"Events with news: {(news_features_full['pre_earnings_news_count'] > 0).sum()} ({(news_features_full['pre_earnings_news_count'] > 0).mean()*100:.1f}%)")

Loading cached full news features...
Full news features shape: (5313, 771)
Events with news: 2381 (44.8%)


In [11]:
# Apply PCA to reduce 768-dim embeddings to 10 components

# Detect embedding columns in the cached file (could be 'emb_*' or 'news_emb_*')
emb_cols_in_cache = [c for c in news_features_full.columns if c.startswith('emb_') or c.startswith('news_emb_')]
print(f"\nApplying PCA: {len(emb_cols_in_cache)} dims -> {N_PCA_COMPONENTS} components")

# Get rows with actual news (non-zero embeddings)
has_news = news_features_full['pre_earnings_news_count'] > 0
X_emb = news_features_full.loc[has_news, emb_cols_in_cache].values

print(f"Fitting PCA on {len(X_emb)} rows with news...")

# Fit PCA
pca = PCA(n_components=N_PCA_COMPONENTS, random_state=42)
pca.fit(X_emb)

print(f"Variance explained: {pca.explained_variance_ratio_.sum()*100:.1f}%")
print(f"Per component: {[f'{v:.1%}' for v in pca.explained_variance_ratio_]}")

# Save PCA model for inference
pca_path = MODEL_DIR / 'news_pca.joblib'
joblib.dump(pca, pca_path)
print(f"Saved PCA model to {pca_path}")


Applying PCA: 768 dims -> 10 components
Fitting PCA on 2381 rows with news...
Variance explained: 34.5%
Per component: ['9.8%', '6.8%', '4.5%', '2.8%', '2.4%', '2.1%', '2.0%', '1.5%', '1.3%', '1.3%']
Saved PCA model to ../models/news_pca.joblib


In [12]:
# Transform all embeddings to PCA features
# Fill NaN with 0 for rows without news (PCA doesn't accept NaN)
X_all_emb = news_features_full[emb_cols_in_cache].fillna(0).values
X_pca = pca.transform(X_all_emb)

# Create news_features with PCA columns instead of 768-dim
news_features = news_features_full[['symbol', 'earnings_date', 'pre_earnings_news_count']].copy()
for i in range(N_PCA_COMPONENTS):
    news_features[f'news_pca_{i}'] = X_pca[:, i]

print(f"News features shape (with PCA): {news_features.shape}")
news_features.head()

News features shape (with PCA): (5313, 13)


Unnamed: 0,symbol,earnings_date,pre_earnings_news_count,news_pca_0,news_pca_1,news_pca_2,news_pca_3,news_pca_4,news_pca_5,news_pca_6,news_pca_7,news_pca_8,news_pca_9
0,HYFT,2024-09-16,1,-0.165235,-0.141558,-0.152304,0.128352,-0.114902,0.128625,-0.155911,0.012585,-0.059119,-0.073341
1,HYFT,2024-12-10,2,-0.122203,-0.224748,-0.133205,0.126947,-0.063816,0.069849,-0.121516,-0.029824,-0.061042,-0.06286
2,HYFT,2025-03-28,0,0.030873,-0.170455,-0.197268,0.010112,-0.106102,0.053732,0.076889,0.15101,-0.028557,-0.018098
3,HYFT,2025-09-15,0,0.030873,-0.170455,-0.197268,0.010112,-0.106102,0.053732,0.076889,0.15101,-0.028557,-0.018098
4,HYFT,2025-12-15,1,-0.314191,0.074169,0.037352,0.064734,0.052504,0.015498,0.007807,-0.053889,-0.017339,-0.006043


In [13]:
# Merge news features with historical features
news_features['earnings_date'] = pd.to_datetime(news_features['earnings_date'])
features_df = hist_features.merge(
    news_features,
    on=['symbol', 'earnings_date'],
    how='left'
)
print(f"After news merge: {features_df.shape}")

After news merge: (5343, 26)


## 4. Fundamental Features (Point-in-Time)

Use fundamentals from news_ranking project with proper point-in-time alignment via SEC filing dates.

In [14]:
# Load fundamentals
metrics = pd.read_parquet(DATA_DIR / 'key_metrics.pqt')
ratios = pd.read_parquet(DATA_DIR / 'ratios.pqt')
growth = pd.read_parquet(DATA_DIR / 'growth.pqt')
filing_dates = pd.read_parquet(DATA_DIR / 'filing_dates.pqt')

print(f"Metrics: {len(metrics):,} rows")
print(f"Ratios: {len(ratios):,} rows")
print(f"Growth: {len(growth):,} rows")
print(f"Filing dates: {len(filing_dates):,} rows")

Metrics: 307,009 rows
Ratios: 307,009 rows
Growth: 307,009 rows
Filing dates: 305,371 rows


In [15]:
# Select key fundamental features
METRIC_COLS = [
    'evToEBITDA',           # Value
    'freeCashFlowYield',    # Value
    'earningsYield',        # Value
    'returnOnEquity',       # Quality
    'returnOnAssets',       # Quality
    'currentRatio',         # Liquidity
]

RATIO_COLS = [
    'priceToEarningsRatio',  # Value (P/E)
    'priceToBookRatio',      # Value
    'priceToSalesRatio',     # Value
    'grossProfitMargin',     # Quality
    'operatingProfitMargin', # Quality
    'netProfitMargin',       # Quality
    'debtToEquityRatio',     # Leverage
]

GROWTH_COLS = [
    'revenueGrowth',         # Growth
    'netIncomeGrowth',       # Growth
    'epsgrowth',             # Growth
]

FUND_COLS = METRIC_COLS + RATIO_COLS + GROWTH_COLS
print(f"Using {len(FUND_COLS)} fundamental features")

Using 16 fundamental features


In [16]:
# Merge fundamentals into single table
metrics_sub = metrics[['symbol', 'date'] + [c for c in METRIC_COLS if c in metrics.columns]].copy()
ratios_sub = ratios[['symbol', 'date'] + [c for c in RATIO_COLS if c in ratios.columns]].copy()
growth_sub = growth[['symbol', 'date'] + [c for c in GROWTH_COLS if c in growth.columns]].copy()

fundamentals = metrics_sub.merge(ratios_sub, on=['symbol', 'date'], how='outer')
fundamentals = fundamentals.merge(growth_sub, on=['symbol', 'date'], how='outer')
fundamentals['period_end'] = pd.to_datetime(fundamentals['date'])

print(f"Combined fundamentals: {len(fundamentals):,} rows")

Combined fundamentals: 307,481 rows


In [17]:
# Add filing dates for point-in-time alignment
filing_dates_clean = filing_dates[['symbol', 'period_end', 'filing_date']].copy()
filing_dates_clean['period_end'] = pd.to_datetime(filing_dates_clean['period_end'])
filing_dates_clean['filing_date'] = pd.to_datetime(filing_dates_clean['filing_date'])

fundamentals = fundamentals.merge(
    filing_dates_clean,
    on=['symbol', 'period_end'],
    how='left'
)

# Use filing_date where available, fallback to period_end + 45 days
FALLBACK_LAG_DAYS = 45
fundamentals['available_date'] = fundamentals['filing_date'].fillna(
    fundamentals['period_end'] + timedelta(days=FALLBACK_LAG_DAYS)
)

# Sort for merge_asof
fundamentals = fundamentals.sort_values(['symbol', 'available_date'])

print(f"Filing date coverage: {fundamentals['filing_date'].notna().mean()*100:.1f}%")

Filing date coverage: 98.9%


In [18]:
def pit_join_fundamentals(earnings_df: pd.DataFrame, fund_df: pd.DataFrame, fund_cols: list) -> pd.DataFrame:
    """
    Point-in-time join: for each earnings event, get most recent fundamentals
    where available_date < earnings_date.
    """
    # Prepare for merge_asof
    earnings_sorted = earnings_df.sort_values('earnings_date').copy()
    fund_sorted = fund_df.sort_values('available_date').copy()
    
    # Filter fund_cols to those that exist
    fund_cols_exist = [c for c in fund_cols if c in fund_sorted.columns]
    
    merged = pd.merge_asof(
        earnings_sorted,
        fund_sorted[['symbol', 'available_date'] + fund_cols_exist],
        left_on='earnings_date',
        right_on='available_date',
        by='symbol',
        direction='backward'
    )
    
    merged['has_fundamentals'] = merged[fund_cols_exist[0]].notna().astype(int)
    merged = merged.drop(columns=['available_date'], errors='ignore')
    
    return merged

In [19]:
# Join fundamentals
print("Joining fundamentals (point-in-time)...")
features_df = pit_join_fundamentals(features_df, fundamentals, FUND_COLS)

print(f"After fundamental join: {features_df.shape}")
print(f"Has fundamentals: {features_df['has_fundamentals'].sum()} ({features_df['has_fundamentals'].mean()*100:.1f}%)")

Joining fundamentals (point-in-time)...
After fundamental join: (5343, 43)
Has fundamentals: 5329 (99.7%)


## 5. Price Context Features

Realized volatility, momentum, positioning before earnings.

In [20]:
# Load price data
prices = pd.read_parquet(DATA_DIR / 'prices.pqt')
prices['date'] = pd.to_datetime(prices['date'])
print(f"Prices: {len(prices):,} rows, {prices['symbol'].nunique()} symbols")

Prices: 5,888,410 rows, 5644 symbols


In [21]:
def compute_price_context(prices_df: pd.DataFrame, earnings_df: pd.DataFrame, lookback: int = 20) -> pd.DataFrame:
    """
    Compute price-based features for each earnings event.
    """
    from tqdm.auto import tqdm
    
    # Pre-compute returns
    prices_df = prices_df.sort_values(['symbol', 'date']).copy()
    prices_df['return'] = prices_df.groupby('symbol')['close'].pct_change()
    
    # Group by symbol for faster lookup
    prices_by_symbol = {symbol: grp.set_index('date') for symbol, grp in prices_df.groupby('symbol')}
    
    results = []
    
    for _, row in tqdm(earnings_df.iterrows(), total=len(earnings_df), desc="Computing price context"):
        symbol = row['symbol']
        earn_date = row['earnings_date']
        
        result = {
            'symbol': symbol,
            'earnings_date': earn_date,
        }
        
        if symbol not in prices_by_symbol:
            for col in ['rvol_5d', 'rvol_10d', 'rvol_20d', 'ret_5d', 'ret_10d', 'ret_20d',
                       'dist_from_high_20d', 'dist_from_low_20d', 'gap_frequency', 'volume_ratio']:
                result[col] = np.nan
            results.append(result)
            continue
        
        symbol_prices = prices_by_symbol[symbol]
        
        # Get prices before earnings
        before = symbol_prices[symbol_prices.index < earn_date].tail(lookback + 5)
        
        if len(before) < 5:
            for col in ['rvol_5d', 'rvol_10d', 'rvol_20d', 'ret_5d', 'ret_10d', 'ret_20d',
                       'dist_from_high_20d', 'dist_from_low_20d', 'gap_frequency', 'volume_ratio']:
                result[col] = np.nan
            results.append(result)
            continue
        
        returns = before['return'].dropna()
        
        # Realized volatility (annualized)
        result['rvol_5d'] = returns.tail(5).std() * np.sqrt(252) if len(returns) >= 5 else np.nan
        result['rvol_10d'] = returns.tail(10).std() * np.sqrt(252) if len(returns) >= 10 else np.nan
        result['rvol_20d'] = returns.tail(20).std() * np.sqrt(252) if len(returns) >= 20 else np.nan
        
        # Momentum
        closes = before['close']
        result['ret_5d'] = (closes.iloc[-1] / closes.iloc[-5] - 1) if len(closes) >= 5 else np.nan
        result['ret_10d'] = (closes.iloc[-1] / closes.iloc[-10] - 1) if len(closes) >= 10 else np.nan
        result['ret_20d'] = (closes.iloc[-1] / closes.iloc[-20] - 1) if len(closes) >= 20 else np.nan
        
        # Position relative to recent range
        if len(before) >= 20:
            result['dist_from_high_20d'] = closes.iloc[-1] / before['high'].tail(20).max() - 1
            result['dist_from_low_20d'] = closes.iloc[-1] / before['low'].tail(20).min() - 1
        else:
            result['dist_from_high_20d'] = np.nan
            result['dist_from_low_20d'] = np.nan
        
        # Gap frequency (how often does stock gap > 2%?)
        if len(before) > 1 and 'open' in before.columns:
            gaps = np.abs(before['open'] / before['close'].shift(1) - 1)
            result['gap_frequency'] = (gaps > 0.02).mean()
        else:
            result['gap_frequency'] = np.nan
        
        # Volume ratio (recent vs average)
        if 'volume' in before.columns and len(before) >= 20:
            recent_vol = before['volume'].tail(5).mean()
            avg_vol = before['volume'].mean()
            result['volume_ratio'] = recent_vol / avg_vol if avg_vol > 0 else np.nan
        else:
            result['volume_ratio'] = np.nan
        
        results.append(result)
    
    return pd.DataFrame(results)

In [22]:
# Compute price context features
price_context_file = EARNINGS_DIR / 'price_context_features.parquet'

if price_context_file.exists():
    print("Loading cached price context...")
    price_context = pd.read_parquet(price_context_file)
else:
    print("Computing price context features...")
    price_context = compute_price_context(prices, hist_features)
    price_context.to_parquet(price_context_file, index=False)
    print(f"Saved to {price_context_file}")

print(f"Price context shape: {price_context.shape}")

Loading cached price context...
Price context shape: (5313, 12)


In [23]:
# Merge price context
price_context['earnings_date'] = pd.to_datetime(price_context['earnings_date'])
features_df = features_df.merge(
    price_context,
    on=['symbol', 'earnings_date'],
    how='left'
)
print(f"After price context merge: {features_df.shape}")

After price context merge: (5427, 53)


## 6. Earnings Surprise Features

In [24]:
def fetch_earnings_surprises(symbol: str, limit: int = 20) -> pd.DataFrame:
    """Fetch historical earnings data from FMP (actual vs estimated EPS)."""
    url = f"https://financialmodelingprep.com/stable/earnings?symbol={symbol}&apikey={FMP_KEY}"
    try:
        r = requests.get(url, timeout=10)
        if r.status_code == 200:
            data = r.json()
            if data:
                df = pd.DataFrame(data)
                # Filter to rows with actual data and limit
                if 'epsActual' in df.columns:
                    df = df[df['epsActual'].notna()].head(limit)
                return df
    except:
        pass
    return pd.DataFrame()

In [25]:
# Fetch/load earnings surprises
surprise_cache_file = EARNINGS_DIR / 'earnings_surprises_cache.parquet'

if surprise_cache_file.exists():
    all_surprises = pd.read_parquet(surprise_cache_file)
    print(f"Loaded cached surprises: {len(all_surprises)} rows")
else:
    print("Fetching earnings surprises...")
    # Use hist_features (always available) instead of features_df (created later)
    symbols = hist_features['symbol'].unique()
    
    all_surprises = []
    for i, symbol in enumerate(symbols):
        if i > 0 and i % 50 == 0:
            print(f"  Progress: {i}/{len(symbols)}")
        
        surprises = fetch_earnings_surprises(symbol)
        if not surprises.empty:
            surprises['symbol'] = symbol
            all_surprises.append(surprises)
        time.sleep(0.1)
    
    if all_surprises:
        all_surprises = pd.concat(all_surprises, ignore_index=True)
        all_surprises.to_parquet(surprise_cache_file, index=False)
        print(f"Cached surprises: {len(all_surprises)} rows")
    else:
        all_surprises = pd.DataFrame()

Loaded cached surprises: 40191 rows


In [26]:
def compute_surprise_features(features_df: pd.DataFrame, surprises_df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute earnings surprise features for each event.
    Only uses data available BEFORE the event.
    """
    features_df = features_df.copy()
    
    if surprises_df.empty:
        features_df['surprise_pct_mean'] = np.nan
        features_df['surprise_pct_std'] = np.nan
        features_df['beat_rate'] = np.nan
        features_df['surprise_streak'] = np.nan
        return features_df
    
    surprises_df = surprises_df.copy()
    surprises_df['date'] = pd.to_datetime(surprises_df['date'])
    
    # Initialize columns
    features_df['surprise_pct_mean'] = np.nan
    features_df['surprise_pct_std'] = np.nan
    features_df['beat_rate'] = np.nan
    features_df['surprise_streak'] = np.nan
    
    # Group surprises by symbol for faster lookup
    surprises_by_symbol = {sym: grp for sym, grp in surprises_df.groupby('symbol')}
    
    for idx, row in features_df.iterrows():
        symbol = row['symbol']
        earn_date = row['earnings_date']
        
        if symbol not in surprises_by_symbol:
            continue
            
        symbol_surprises = surprises_by_symbol[symbol]
        
        # Get past surprises for this symbol (before current earnings)
        past = symbol_surprises[symbol_surprises['date'] < earn_date].sort_values('date')
        
        # Use epsActual/epsEstimated from FMP /stable/earnings endpoint
        if len(past) >= 1 and 'epsActual' in past.columns and 'epsEstimated' in past.columns:
            # Surprise percentage
            past_valid = past.dropna(subset=['epsActual', 'epsEstimated'])
            if len(past_valid) > 0:
                past_valid = past_valid.copy()
                past_valid['surprise_pct'] = (past_valid['epsActual'] - past_valid['epsEstimated']) / past_valid['epsEstimated'].abs().clip(lower=0.01)
                
                features_df.at[idx, 'surprise_pct_mean'] = past_valid['surprise_pct'].mean()
                features_df.at[idx, 'surprise_pct_std'] = past_valid['surprise_pct'].std() if len(past_valid) > 1 else 0
                
                # Beat rate
                features_df.at[idx, 'beat_rate'] = (past_valid['epsActual'] > past_valid['epsEstimated']).mean()
                
                # Recent streak
                recent = past_valid.tail(4)
                beats = (recent['epsActual'] > recent['epsEstimated']).values
                streak = 0
                if len(beats) > 0:
                    last_val = beats[-1]
                    for b in reversed(beats):
                        if b == last_val:
                            streak += 1
                        else:
                            break
                    if not last_val:
                        streak = -streak
                features_df.at[idx, 'surprise_streak'] = streak
    
    return features_df

In [27]:
# Add surprise features
features_df = compute_surprise_features(features_df, all_surprises)
print(f"Features after surprises: {features_df.shape}")

Features after surprises: (5427, 57)


## 7. Timing Features

In [28]:
# Add timing from earnings calendar
def parse_timing(time_str):
    if pd.isna(time_str):
        return 'unknown'
    time_str = str(time_str).lower()
    if 'bmo' in time_str or 'before' in time_str:
        return 'BMO'
    elif 'amc' in time_str or 'after' in time_str:
        return 'AMC'
    return 'unknown'

if 'time' in earnings_cal.columns:
    earnings_cal['timing'] = earnings_cal['time'].apply(parse_timing)
else:
    earnings_cal['timing'] = 'unknown'

timing_df = earnings_cal[['symbol', 'date', 'timing']].rename(columns={'date': 'earnings_date'})

features_df = features_df.merge(timing_df, on=['symbol', 'earnings_date'], how='left')
features_df['timing'] = features_df['timing'].fillna('unknown')

print("Timing distribution:")
print(features_df['timing'].value_counts())

Timing distribution:
timing
unknown    5631
Name: count, dtype: int64


In [29]:
# Add calendar features
features_df['day_of_week'] = features_df['earnings_date'].dt.dayofweek
features_df['month'] = features_df['earnings_date'].dt.month
features_df['quarter'] = features_df['earnings_date'].dt.quarter

# Earnings season flag
def is_earnings_season(month):
    return month in [1, 2, 4, 5, 7, 8, 10, 11]

features_df['is_earnings_season'] = features_df['month'].apply(is_earnings_season).astype(int)

## 8. Final Dataset Assembly

In [30]:
# List all columns
print(f"Total columns: {len(features_df.columns)}")
print("\nFeature columns by category:")

# Historical
hist_cols = ['hist_move_mean', 'hist_move_median', 'hist_move_std', 'hist_move_max',
             'hist_move_min', 'hist_move_cv', 'recent_move_mean', 'move_trend',
             'gap_continuation_ratio', 'n_past_earnings']
print(f"Historical: {len(hist_cols)} cols")

# News PCA features (not full 768-dim embeddings!)
news_pca_cols = [c for c in features_df.columns if c.startswith('news_pca_')]
print(f"News PCA: {len(news_pca_cols)} cols")

# Fundamentals
fund_cols_actual = [c for c in FUND_COLS if c in features_df.columns]
print(f"Fundamentals: {len(fund_cols_actual)} cols")

# Price context
price_cols = ['rvol_5d', 'rvol_10d', 'rvol_20d', 'ret_5d', 'ret_10d', 'ret_20d',
              'dist_from_high_20d', 'dist_from_low_20d', 'gap_frequency', 'volume_ratio']
print(f"Price context: {len(price_cols)} cols")

# Surprise
surprise_cols = ['surprise_pct_mean', 'surprise_pct_std', 'beat_rate', 'surprise_streak']
print(f"Surprise: {len(surprise_cols)} cols")

# Timing
timing_cols = ['day_of_week', 'month', 'quarter', 'is_earnings_season']
print(f"Timing: {len(timing_cols)} cols")

total_features = len(hist_cols) + len(news_pca_cols) + len(fund_cols_actual) + len(price_cols) + len(surprise_cols) + len(timing_cols) + 2  # +2 for news_count and timing_encoded
print(f"\nTotal model features: {total_features} (expected ~52)")

Total columns: 62

Feature columns by category:
Historical: 10 cols
News PCA: 10 cols
Fundamentals: 16 cols
Price context: 10 cols
Surprise: 4 cols
Timing: 4 cols

Total model features: 56 (expected ~52)


In [31]:
# Feature coverage
print("\nFeature coverage:")
for col in hist_cols + price_cols + surprise_cols + fund_cols_actual[:5]:
    if col in features_df.columns:
        coverage = features_df[col].notna().mean() * 100
        print(f"  {col}: {coverage:.1f}%")


Feature coverage:
  hist_move_mean: 59.6%
  hist_move_median: 59.6%
  hist_move_std: 59.6%
  hist_move_max: 59.6%
  hist_move_min: 59.6%
  hist_move_cv: 59.6%
  recent_move_mean: 59.6%
  move_trend: 59.6%
  gap_continuation_ratio: 59.6%
  n_past_earnings: 100.0%
  rvol_5d: 97.6%
  rvol_10d: 97.5%
  rvol_20d: 97.0%
  ret_5d: 97.6%
  ret_10d: 97.5%
  ret_20d: 97.0%
  dist_from_high_20d: 97.0%
  dist_from_low_20d: 97.0%
  gap_frequency: 97.6%
  volume_ratio: 97.0%
  surprise_pct_mean: 72.6%
  surprise_pct_std: 72.6%
  beat_rate: 72.6%
  surprise_streak: 72.6%
  evToEBITDA: 99.8%
  freeCashFlowYield: 99.8%
  earningsYield: 99.8%
  returnOnEquity: 99.8%
  returnOnAssets: 99.8%


In [32]:
# Filter to usable rows
print(f"\nFiltering dataset...")
print(f"Starting rows: {len(features_df)}")

# Must have target
df_clean = features_df[features_df['target_move'].notna()].copy()
print(f"With target: {len(df_clean)}")

# Must have some history
df_clean = df_clean[df_clean['n_past_earnings'] >= 1]
print(f"With history (n>=1): {len(df_clean)}")

# Remove extreme outliers (>100% moves)
df_clean = df_clean[df_clean['target_move'] < 1.0]
print(f"After outlier removal: {len(df_clean)}")


Filtering dataset...
Starting rows: 5631
With target: 5631
With history (n>=1): 3358
After outlier removal: 3350


In [33]:
# Define all feature columns for model
ALL_NUMERIC_FEATURES = hist_cols + price_cols + surprise_cols + fund_cols_actual + ['pre_earnings_news_count'] + news_pca_cols
ALL_NUMERIC_FEATURES = [c for c in ALL_NUMERIC_FEATURES if c in df_clean.columns]

CATEGORICAL_FEATURES = ['timing']

print(f"Numeric features: {len(ALL_NUMERIC_FEATURES)}")
print(f"Categorical features: {len(CATEGORICAL_FEATURES)}")
print(f"\nFeature list:")
for i, col in enumerate(ALL_NUMERIC_FEATURES):
    print(f"  {i+1}. {col}")

Numeric features: 51
Categorical features: 1

Feature list:
  1. hist_move_mean
  2. hist_move_median
  3. hist_move_std
  4. hist_move_max
  5. hist_move_min
  6. hist_move_cv
  7. recent_move_mean
  8. move_trend
  9. gap_continuation_ratio
  10. n_past_earnings
  11. rvol_5d
  12. rvol_10d
  13. rvol_20d
  14. ret_5d
  15. ret_10d
  16. ret_20d
  17. dist_from_high_20d
  18. dist_from_low_20d
  19. gap_frequency
  20. volume_ratio
  21. surprise_pct_mean
  22. surprise_pct_std
  23. beat_rate
  24. surprise_streak
  25. evToEBITDA
  26. freeCashFlowYield
  27. earningsYield
  28. returnOnEquity
  29. returnOnAssets
  30. currentRatio
  31. priceToEarningsRatio
  32. priceToBookRatio
  33. priceToSalesRatio
  34. grossProfitMargin
  35. operatingProfitMargin
  36. netProfitMargin
  37. debtToEquityRatio
  38. revenueGrowth
  39. netIncomeGrowth
  40. epsgrowth
  41. pre_earnings_news_count
  42. news_pca_0
  43. news_pca_1
  44. news_pca_2
  45. news_pca_3
  46. news_pca_4
  47. news

In [34]:
# Fill missing values
for col in ALL_NUMERIC_FEATURES:
    if col in df_clean.columns:
        median_val = df_clean[col].median()
        df_clean[col] = df_clean[col].fillna(median_val)

for col in CATEGORICAL_FEATURES:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].fillna('unknown')

# News embeddings already default to 0
print("Missing values filled.")

Missing values filled.


In [35]:
# Save final dataset
output_file = EARNINGS_DIR / 'ml_features.parquet'
df_clean.to_parquet(output_file, index=False)

print(f"\nSaved ML features: {df_clean.shape}")
print(f"File: {output_file}")
print(f"File size: {output_file.stat().st_size / 1e6:.1f} MB")


Saved ML features: (3350, 62)
File: ../data/earnings/ml_features.parquet
File size: 1.2 MB


## 9. Feature Correlation with Target

In [36]:
# Check correlation of features with target
correlations = {}
for col in ALL_NUMERIC_FEATURES:
    if col in df_clean.columns:
        corr = df_clean[col].corr(df_clean['target_move'])
        correlations[col] = corr

corr_df = pd.DataFrame({
    'feature': correlations.keys(),
    'correlation': correlations.values()
}).sort_values('correlation', key=abs, ascending=False)

print("Top feature correlations with target |move|:")
print(corr_df.head(25).to_string(index=False))

Top feature correlations with target |move|:
               feature  correlation
              rvol_10d     0.225289
              rvol_20d     0.208986
               rvol_5d     0.204769
     dist_from_low_20d     0.198595
      priceToBookRatio    -0.167643
         hist_move_min     0.160385
      hist_move_median     0.131637
      recent_move_mean     0.125316
            news_pca_4     0.123456
            news_pca_8     0.122636
        hist_move_mean     0.120830
                ret_5d     0.119312
               ret_10d     0.117211
    dist_from_high_20d    -0.108244
         gap_frequency     0.104386
            news_pca_2     0.101398
            news_pca_7    -0.088076
     debtToEquityRatio    -0.084581
            news_pca_1     0.077258
         hist_move_max     0.075607
            news_pca_5    -0.073776
               ret_20d     0.069772
          volume_ratio     0.069288
gap_continuation_ratio     0.067769
            news_pca_6    -0.065369


In [37]:
# News PCA correlations
news_pca_corrs = {}
for col in news_pca_cols:
    if col in df_clean.columns:
        corr = df_clean[col].corr(df_clean['target_move'])
        news_pca_corrs[col] = corr

news_corr_df = pd.DataFrame({
    'feature': news_pca_corrs.keys(),
    'correlation': news_pca_corrs.values()
}).sort_values('correlation', key=abs, ascending=False)

print("\nNews PCA feature correlations:")
print(news_corr_df.to_string(index=False))
print(f"\nMean |correlation| for news PCA: {np.abs(list(news_pca_corrs.values())).mean():.4f}")


News PCA feature correlations:
   feature  correlation
news_pca_4     0.123456
news_pca_8     0.122636
news_pca_2     0.101398
news_pca_7    -0.088076
news_pca_1     0.077258
news_pca_5    -0.073776
news_pca_6    -0.065369
news_pca_0    -0.050050
news_pca_9    -0.034627
news_pca_3    -0.001125

Mean |correlation| for news PCA: 0.0738


## Summary

Features engineered (~52 total):
1. **Historical earnings** (10 features) - past moves, consistency, trends
2. **Pre-earnings news PCA** (10 features) - PCA-reduced from 768-dim embeddings
3. **Fundamentals** (~16 features) - key metrics, ratios, growth (point-in-time)
4. **Price context** (10 features) - realized vol, momentum, positioning
5. **Earnings surprises** (4 features) - beat/miss history, streaks
6. **Timing** (4+1 features) - day of week, earnings season, timing_encoded

**Key outputs:**
- `ml_features.parquet` - training dataset
- `news_pca.joblib` - PCA model for live inference

Ready for model training in `1.1 model_training.ipynb`.