# Model Stress Test ‚Äî Systematic Failure Analysis

**Role**: Senior Quantitative Researcher (Model Validation & Stress Testing)

## Objective

This notebook **does not improve the model**. It systematically **breaks** it to understand:
- Where does it fail?
- Why does it fail?
- Is the signal real or statistical coincidence?

## Hard Constraints

| ‚ùå FORBIDDEN | ‚úÖ ALLOWED |
|-------------|------------|
| Retraining | Diagnostics only |
| Feature modification | Slicing existing predictions |
| Backtests | IC/correlation analysis |
| Strategy logic | Statistical tests |
| Sharpe optimization | Failure interpretation |

## Model Breakers Implemented

1. **Time-Based Stress** ‚Äî Does the model rely on a specific regime?
2. **Feature Ablation** ‚Äî Is it dependent on a narrow feature subset?
3. **Permutation Tests** ‚Äî Is performance real or luck?
4. **Noise Injection** ‚Äî Is it robust to small perturbations?
5. **Horizon Mismatch** ‚Äî Is the signal horizon-specific?

---

In [None]:
# =============================================================================
# CELL 1: SETUP & LOAD MODEL ARTIFACTS
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib
import json
import warnings
warnings.filterwarnings('ignore')

# Paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / 'data' / 'processed'
TARGET_DIR = DATA_DIR / 'targets'
MODEL_DIR = PROJECT_ROOT / 'outputs' / 'models'
OUTPUT_DIR = PROJECT_ROOT / 'outputs' / 'stress_test'
OUTPUT_DIR.mkdir(exist_ok=True)

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("="*70)
print("üî¨ MODEL STRESS TEST ‚Äî FAILURE ANALYSIS")
print("="*70)
print(f"\nüìÅ Project root: {PROJECT_ROOT}")
print(f"üìÅ Output dir: {OUTPUT_DIR}")

In [None]:
# =============================================================================
# CELL 2: LOAD DATA & MODEL
# =============================================================================

print("\nüìÇ Loading model artifacts...")

# Load features (long format)
features_df = pd.read_parquet(DATA_DIR / 'features_aligned_is.parquet')
print(f"   Features: {features_df.shape}")

# Load target (wide format -> convert to long)
target_wide = pd.read_parquet(TARGET_DIR / 'primary_target_is.parquet')
target_long = target_wide.stack().reset_index()
target_long.columns = ['date', 'ticker', 'target']
print(f"   Target: {target_long.shape}")

# Load feature names
with open(DATA_DIR / 'feature_names.txt', 'r') as f:
    feature_names = [line.strip() for line in f.readlines()]
print(f"   Feature names: {len(feature_names)}")

# Load trained model
model_path = MODEL_DIR / 'final_lgb_model.joblib'
if model_path.exists():
    model = joblib.load(model_path)
    print(f"   ‚úÖ Loaded model: {model_path.name}")
else:
    print(f"   ‚ö†Ô∏è No model found at {model_path}")
    print(f"   Will generate synthetic predictions for demonstration")
    model = None

# Load target metadata
with open(TARGET_DIR / 'target_metadata.json', 'r') as f:
    target_config = json.load(f)
print(f"   Target horizon: {target_config['horizon']} days")

In [None]:
# =============================================================================
# CELL 3: GENERATE PREDICTIONS (if not already saved)
# =============================================================================

print("\nüîÆ Generating predictions...")

# Merge features with target
panel = features_df.merge(target_long, on=['date', 'ticker'], how='inner')
print(f"   Merged panel: {panel.shape}")

# Drop rows with missing target
panel = panel.dropna(subset=['target'])
print(f"   After dropping NaN targets: {panel.shape}")

# Generate predictions
X = panel[feature_names].values
y = panel['target'].values

if model is not None:
    predictions = model.predict(X)
else:
    # Generate weak synthetic signal for demonstration
    # This mimics a model with IC ~ 0.03
    noise = np.random.randn(len(y))
    predictions = 0.03 * y + 0.97 * noise
    print("   ‚ö†Ô∏è Using synthetic predictions (model not found)")

panel['prediction'] = predictions

# Compute baseline IC
baseline_ic = np.corrcoef(predictions, y)[0, 1]
print(f"\nüìä Baseline IC: {baseline_ic:.4f}")
print(f"   (This is what we're trying to break)")

---

## Model Breaker 1: Time-Based Stress (Regime Dependence)

### Hypothesis Being Tested

> **Does the model rely on a specific market regime?**

If the model was trained predominantly on bull markets, it may fail in bear markets.
If it learned volatility patterns from 2020, it may not generalize.

### Expected Failure If Model Is Fragile

- IC flips sign in certain regimes
- Prediction bias changes across regimes
- Performance clusters in specific time periods

In [None]:
# =============================================================================
# BREAKER 1: TIME-BASED STRESS ‚Äî REGIME SLICING
# =============================================================================

print("="*70)
print("üî® BREAKER 1: TIME-BASED STRESS (Regime Dependence)")
print("="*70)

# Step 1: Compute daily cross-sectional IC
def compute_daily_ic(df):
    """Compute IC for each date."""
    ic_by_date = df.groupby('date').apply(
        lambda x: x['prediction'].corr(x['target']) if len(x) > 5 else np.nan
    )
    return ic_by_date

daily_ic = compute_daily_ic(panel)
daily_ic = daily_ic.dropna()

print(f"\nüìä Daily IC Statistics:")
print(f"   Mean IC: {daily_ic.mean():.4f}")
print(f"   Std IC: {daily_ic.std():.4f}")
print(f"   IC > 0: {(daily_ic > 0).mean()*100:.1f}%")

# Step 2: Define regimes
# Use rolling volatility of the cross-sectional mean return as regime indicator
daily_mean_target = panel.groupby('date')['target'].mean()
rolling_vol = daily_mean_target.rolling(21).std()

# High vol vs low vol regimes (median split)
vol_median = rolling_vol.median()
high_vol_dates = rolling_vol[rolling_vol > vol_median].index
low_vol_dates = rolling_vol[rolling_vol <= vol_median].index

# Bull vs bear regimes (based on cumulative return)
cumret = daily_mean_target.cumsum()
rolling_trend = cumret.diff(63)  # 3-month trend
bull_dates = rolling_trend[rolling_trend > 0].index
bear_dates = rolling_trend[rolling_trend <= 0].index

print(f"\nüìÖ Regime Splits:")
print(f"   High Vol dates: {len(high_vol_dates)}")
print(f"   Low Vol dates: {len(low_vol_dates)}")
print(f"   Bull dates: {len(bull_dates)}")
print(f"   Bear dates: {len(bear_dates)}")

In [None]:
# =============================================================================
# BREAKER 1 (cont): REGIME IC ANALYSIS
# =============================================================================

# Compute IC per regime
regimes = {
    'High Volatility': high_vol_dates,
    'Low Volatility': low_vol_dates,
    'Bull Market': bull_dates,
    'Bear Market': bear_dates
}

regime_stats = []
for regime_name, regime_dates in regimes.items():
    regime_ic = daily_ic[daily_ic.index.isin(regime_dates)]
    if len(regime_ic) > 10:
        regime_stats.append({
            'Regime': regime_name,
            'N_Days': len(regime_ic),
            'Mean_IC': regime_ic.mean(),
            'Std_IC': regime_ic.std(),
            'IC_t_stat': regime_ic.mean() / (regime_ic.std() / np.sqrt(len(regime_ic))),
            'IC_positive_pct': (regime_ic > 0).mean() * 100
        })

regime_df = pd.DataFrame(regime_stats)
print("\nüìä IC by Regime:")
print(regime_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# IC time series with regime shading
ax = axes[0]
ax.plot(daily_ic.index, daily_ic.values, alpha=0.5, linewidth=0.5, color='gray')
rolling_ic = daily_ic.rolling(21).mean()
ax.plot(rolling_ic.index, rolling_ic.values, color='blue', linewidth=1.5, label='21d Rolling IC')
ax.axhline(0, color='red', linestyle='--', linewidth=1)
ax.axhline(baseline_ic, color='green', linestyle=':', label=f'Baseline IC: {baseline_ic:.4f}')
ax.set_title('Daily IC Over Time', fontweight='bold')
ax.set_xlabel('Date')
ax.set_ylabel('IC')
ax.legend()
ax.grid(True, alpha=0.3)

# IC boxplot by regime
ax = axes[1]
regime_ic_data = []
regime_labels = []
for regime_name, regime_dates in regimes.items():
    regime_ic = daily_ic[daily_ic.index.isin(regime_dates)].dropna()
    regime_ic_data.append(regime_ic.values)
    regime_labels.append(regime_name.replace(' ', '\n'))

bp = ax.boxplot(regime_ic_data, labels=regime_labels, patch_artist=True)
colors = ['coral', 'lightblue', 'lightgreen', 'salmon']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
ax.axhline(0, color='red', linestyle='--', linewidth=1)
ax.set_title('IC Distribution by Regime', fontweight='bold')
ax.set_ylabel('IC')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'breaker1_regime_ic.png', dpi=150, bbox_inches='tight')
plt.show()

# Interpretation
print("\n" + "="*70)
print("üìã BREAKER 1 INTERPRETATION")
print("="*70)

ic_range = regime_df['Mean_IC'].max() - regime_df['Mean_IC'].min()
sign_flip = (regime_df['Mean_IC'] < 0).any()

if sign_flip:
    verdict = "‚ùå FAIL"
    interpretation = "IC flips sign across regimes ‚Äî model is regime-dependent"
elif ic_range > 0.02:
    verdict = "‚ö†Ô∏è WEAK"
    interpretation = f"IC varies by {ic_range:.4f} across regimes ‚Äî moderate regime sensitivity"
else:
    verdict = "‚úÖ PASS"
    interpretation = "IC is stable across regimes"

print(f"\n   Verdict: {verdict}")
print(f"   {interpretation}")

---

## Model Breaker 2: Feature Ablation (Structural Dependency)

### Hypothesis Being Tested

> **Is the model dependent on a narrow subset of features?**

A robust model should degrade gracefully when feature families are removed.
An overfit model will collapse when its "crutch" features are ablated.

### Expected Failure If Model Is Fragile

- Removing one feature family destroys all signal
- Model relies entirely on momentum OR volatility

In [None]:
# =============================================================================
# BREAKER 2: FEATURE ABLATION ‚Äî STRUCTURAL DEPENDENCY
# =============================================================================

print("="*70)
print("üî® BREAKER 2: FEATURE ABLATION (Structural Dependency)")
print("="*70)

# Define feature families
feature_families = {
    'momentum': [f for f in feature_names if 'mom' in f.lower()],
    'volatility': [f for f in feature_names if 'vol' in f.lower()],
    'kalman': [f for f in feature_names if 'kalman' in f.lower()],
    'regime': [f for f in feature_names if 'regime' in f.lower()],
    'cross_sectional': [f for f in feature_names if 'cs_' in f.lower()],
    'technical': [f for f in feature_names if any(x in f.lower() for x in ['ma_', 'bb_', 'rsi'])]
}

print("\nüìä Feature Families:")
for family, features in feature_families.items():
    print(f"   {family}: {len(features)} features")
    if features:
        print(f"      Examples: {features[:3]}")

In [None]:
# =============================================================================
# BREAKER 2 (cont): ABLATION ANALYSIS
# =============================================================================

# Function to compute IC with ablated features
def compute_ablated_ic(panel, feature_names, ablate_features, model):
    """
    Zero out ablated features and recompute predictions.
    NOTE: We don't retrain ‚Äî we mask inputs to the existing model.
    """
    X = panel[feature_names].copy()
    X[ablate_features] = 0  # Zero out ablated features
    
    if model is not None:
        preds = model.predict(X.values)
    else:
        # For synthetic model, simulate ablation effect
        # Assume each family contributes ~equally to signal
        ablate_frac = len(ablate_features) / len(feature_names)
        signal_remaining = 1 - ablate_frac * 0.8  # 80% signal from features
        noise = np.random.randn(len(panel))
        preds = signal_remaining * 0.03 * panel['target'].values + noise
    
    ic = np.corrcoef(preds, panel['target'].values)[0, 1]
    return ic

# Run ablation for each family
ablation_results = [{'Family': 'None (Baseline)', 'IC': baseline_ic, 'IC_Delta': 0, 'Pct_Retained': 100}]

for family, features in feature_families.items():
    if features:  # Only ablate if family has features
        ablated_ic = compute_ablated_ic(panel, feature_names, features, model)
        ic_delta = ablated_ic - baseline_ic
        pct_retained = (ablated_ic / baseline_ic) * 100 if baseline_ic != 0 else 0
        
        ablation_results.append({
            'Family': family,
            'IC': ablated_ic,
            'IC_Delta': ic_delta,
            'Pct_Retained': pct_retained
        })

ablation_df = pd.DataFrame(ablation_results)
print("\nüìä Ablation Results:")
print(ablation_df.to_string(index=False))

In [None]:
# =============================================================================
# BREAKER 2 (cont): VISUALIZATION & INTERPRETATION
# =============================================================================

fig, ax = plt.subplots(figsize=(10, 6))

families = ablation_df['Family'].tolist()
ics = ablation_df['IC'].tolist()
colors = ['green' if f == 'None (Baseline)' else 'steelblue' for f in families]

bars = ax.barh(families, ics, color=colors, edgecolor='black')
ax.axvline(baseline_ic, color='green', linestyle='--', linewidth=2, label=f'Baseline: {baseline_ic:.4f}')
ax.axvline(0, color='red', linestyle='-', linewidth=1)

# Add value labels
for bar, ic in zip(bars, ics):
    ax.annotate(f'{ic:.4f}', xy=(ic, bar.get_y() + bar.get_height()/2),
               ha='left' if ic >= 0 else 'right', va='center', fontsize=10)

ax.set_xlabel('IC After Ablation')
ax.set_title('Feature Family Ablation: IC Impact\n(Lower = More Dependent)', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'breaker2_feature_ablation.png', dpi=150, bbox_inches='tight')
plt.show()

# Interpretation
print("\n" + "="*70)
print("üìã BREAKER 2 INTERPRETATION")
print("="*70)

# Find most critical family
ablation_no_baseline = ablation_df[ablation_df['Family'] != 'None (Baseline)']
most_critical = ablation_no_baseline.loc[ablation_no_baseline['IC'].idxmin()]
max_drop = baseline_ic - most_critical['IC']

if most_critical['IC'] <= 0:
    verdict = "‚ùå FAIL"
    interpretation = f"Removing {most_critical['Family']} destroys signal (IC goes to {most_critical['IC']:.4f})"
elif most_critical['Pct_Retained'] < 50:
    verdict = "‚ö†Ô∏è WEAK"
    interpretation = f"Model heavily reliant on {most_critical['Family']} (retains only {most_critical['Pct_Retained']:.1f}% IC)"
else:
    verdict = "‚úÖ PASS"
    interpretation = "No single feature family dominates ‚Äî model has diverse signal sources"

print(f"\n   Most critical family: {most_critical['Family']}")
print(f"   Verdict: {verdict}")
print(f"   {interpretation}")

---

## Model Breaker 3: Permutation Tests (Luck Detection)

### Hypothesis Being Tested

> **Does the model exploit real structure or random correlations?**

If performance is indistinguishable from shuffled data, the model learned noise.

### Expected Failure If Model Is Fragile

- Real IC lies inside the null distribution
- p-value > 0.05 (cannot reject null hypothesis of no signal)

In [None]:
# =============================================================================
# BREAKER 3: PERMUTATION TESTS ‚Äî LUCK DETECTION
# =============================================================================

print("="*70)
print("üî® BREAKER 3: PERMUTATION TESTS (Luck Detection)")
print("="*70)

N_PERMUTATIONS = 200

# Test 1: Label Permutation (shuffle targets globally)
print("\nüé≤ Test 1: Label Permutation (shuffle targets)...")
label_perm_ics = []
for i in range(N_PERMUTATIONS):
    shuffled_target = np.random.permutation(panel['target'].values)
    ic = np.corrcoef(panel['prediction'].values, shuffled_target)[0, 1]
    label_perm_ics.append(ic)
label_perm_ics = np.array(label_perm_ics)

# Test 2: Time Permutation (shuffle predictions over time)
print("üé≤ Test 2: Time Permutation (shuffle predictions over time)...")
time_perm_ics = []
unique_dates = panel['date'].unique()
for i in range(N_PERMUTATIONS):
    # Shuffle which date's predictions go where
    shuffled_dates = np.random.permutation(unique_dates)
    date_map = dict(zip(unique_dates, shuffled_dates))
    panel_shuffled = panel.copy()
    panel_shuffled['shuffled_date'] = panel_shuffled['date'].map(date_map)
    panel_shuffled = panel_shuffled.sort_values(['shuffled_date', 'ticker'])
    ic = np.corrcoef(panel_shuffled['prediction'].values, panel['target'].values)[0, 1]
    time_perm_ics.append(ic)
time_perm_ics = np.array(time_perm_ics)

# Test 3: Cross-sectional Permutation (shuffle predictions across assets at fixed t)
print("üé≤ Test 3: Cross-sectional Permutation (shuffle across assets)...")
cs_perm_ics = []
for i in range(N_PERMUTATIONS):
    panel_shuffled = panel.copy()
    # Shuffle predictions within each date
    panel_shuffled['prediction'] = panel_shuffled.groupby('date')['prediction'].transform(
        lambda x: np.random.permutation(x.values)
    )
    ic = np.corrcoef(panel_shuffled['prediction'].values, panel_shuffled['target'].values)[0, 1]
    cs_perm_ics.append(ic)
cs_perm_ics = np.array(cs_perm_ics)

print("   Done.")

In [None]:
# =============================================================================
# BREAKER 3 (cont): PERMUTATION VISUALIZATION
# =============================================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

permutation_tests = [
    ('Label Permutation', label_perm_ics),
    ('Time Permutation', time_perm_ics),
    ('Cross-Sectional Permutation', cs_perm_ics)
]

perm_results = []

for ax, (test_name, null_dist) in zip(axes, permutation_tests):
    # Plot null distribution
    ax.hist(null_dist, bins=30, alpha=0.7, color='gray', edgecolor='black', label='Null Distribution')
    
    # Plot real IC
    ax.axvline(baseline_ic, color='red', linewidth=2, linestyle='-', label=f'Real IC: {baseline_ic:.4f}')
    
    # Compute p-value (two-tailed)
    p_value = np.mean(np.abs(null_dist) >= np.abs(baseline_ic))
    
    # 95% confidence interval
    ci_low, ci_high = np.percentile(null_dist, [2.5, 97.5])
    ax.axvline(ci_low, color='orange', linestyle='--', label=f'95% CI')
    ax.axvline(ci_high, color='orange', linestyle='--')
    
    ax.set_title(f'{test_name}\np-value: {p_value:.4f}', fontweight='bold')
    ax.set_xlabel('IC')
    ax.set_ylabel('Frequency')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)
    
    perm_results.append({
        'Test': test_name,
        'Real_IC': baseline_ic,
        'Null_Mean': null_dist.mean(),
        'Null_Std': null_dist.std(),
        'p_value': p_value,
        'Significant': p_value < 0.05
    })

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'breaker3_permutation_tests.png', dpi=150, bbox_inches='tight')
plt.show()

perm_df = pd.DataFrame(perm_results)
print("\nüìä Permutation Test Results:")
print(perm_df.to_string(index=False))

In [None]:
# =============================================================================
# BREAKER 3 (cont): INTERPRETATION
# =============================================================================

print("\n" + "="*70)
print("üìã BREAKER 3 INTERPRETATION")
print("="*70)

all_significant = perm_df['Significant'].all()
any_significant = perm_df['Significant'].any()
min_pvalue = perm_df['p_value'].min()

if all_significant:
    verdict = "‚úÖ PASS"
    interpretation = f"All permutation tests significant (min p={min_pvalue:.4f}) ‚Äî signal is real"
elif any_significant:
    failed_tests = perm_df[~perm_df['Significant']]['Test'].tolist()
    verdict = "‚ö†Ô∏è WEAK"
    interpretation = f"Failed: {', '.join(failed_tests)} ‚Äî partial signal, possible overfitting"
else:
    verdict = "‚ùå FAIL"
    interpretation = "No test significant ‚Äî performance is indistinguishable from luck"

print(f"\n   Verdict: {verdict}")
print(f"   {interpretation}")

---

## Model Breaker 4: Noise Injection (Robustness)

### Hypothesis Being Tested

> **Is the model sensitive to small perturbations in inputs?**

A robust model should be stable under small input noise.
An overfit model will collapse because it memorized exact feature values.

### Expected Failure If Model Is Fragile

- IC collapses under 1-5% noise
- Prediction correlation drops sharply

In [None]:
# =============================================================================
# BREAKER 4: NOISE INJECTION ‚Äî ROBUSTNESS
# =============================================================================

print("="*70)
print("üî® BREAKER 4: NOISE INJECTION (Robustness)")
print("="*70)

noise_levels = [0.0, 0.01, 0.02, 0.05, 0.10, 0.20]
noise_results = []

X_original = panel[feature_names].values
y = panel['target'].values
original_preds = panel['prediction'].values

for alpha in noise_levels:
    print(f"   Testing noise level Œ± = {alpha:.2f}...")
    
    # Add Gaussian noise: X' = X + Œµ, Œµ ~ N(0, Œ± * œÉ_X)
    if alpha == 0:
        X_noisy = X_original
    else:
        feature_stds = np.std(X_original, axis=0)
        noise = np.random.randn(*X_original.shape) * alpha * feature_stds
        X_noisy = X_original + noise
    
    # Get predictions on noisy features
    if model is not None:
        noisy_preds = model.predict(X_noisy)
    else:
        # Simulate noise effect
        signal_decay = np.exp(-alpha * 5)  # Exponential decay
        noisy_preds = signal_decay * original_preds + (1 - signal_decay) * np.random.randn(len(y)) * np.std(original_preds)
    
    # Compute metrics
    ic_noisy = np.corrcoef(noisy_preds, y)[0, 1]
    pred_corr = np.corrcoef(noisy_preds, original_preds)[0, 1]
    
    noise_results.append({
        'Noise_Level': alpha,
        'IC': ic_noisy,
        'IC_Retained_Pct': (ic_noisy / baseline_ic) * 100 if baseline_ic != 0 else 0,
        'Pred_Correlation': pred_corr
    })

noise_df = pd.DataFrame(noise_results)
print("\nüìä Noise Injection Results:")
print(noise_df.to_string(index=False))

In [None]:
# =============================================================================
# BREAKER 4 (cont): VISUALIZATION & INTERPRETATION
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# IC vs noise level
ax = axes[0]
ax.plot(noise_df['Noise_Level'], noise_df['IC'], 'o-', color='blue', linewidth=2, markersize=8)
ax.axhline(baseline_ic, color='green', linestyle='--', label=f'Baseline: {baseline_ic:.4f}')
ax.axhline(0, color='red', linestyle='-', linewidth=1)
ax.fill_between(noise_df['Noise_Level'], noise_df['IC'], baseline_ic, alpha=0.3, color='red')
ax.set_xlabel('Noise Level (Œ±)')
ax.set_ylabel('IC')
ax.set_title('IC Decay Under Noise Injection', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

# Prediction stability
ax = axes[1]
ax.plot(noise_df['Noise_Level'], noise_df['Pred_Correlation'], 'o-', color='purple', linewidth=2, markersize=8)
ax.axhline(1.0, color='green', linestyle='--', label='Perfect stability')
ax.axhline(0.9, color='orange', linestyle=':', label='90% threshold')
ax.set_xlabel('Noise Level (Œ±)')
ax.set_ylabel('Prediction Correlation')
ax.set_title('Prediction Stability Under Noise', fontweight='bold')
ax.set_ylim(0, 1.05)
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'breaker4_noise_injection.png', dpi=150, bbox_inches='tight')
plt.show()

# Interpretation
print("\n" + "="*70)
print("üìã BREAKER 4 INTERPRETATION")
print("="*70)

# Check IC at 5% noise
ic_at_5pct = noise_df[noise_df['Noise_Level'] == 0.05]['IC'].values[0]
ic_retained_at_5pct = noise_df[noise_df['Noise_Level'] == 0.05]['IC_Retained_Pct'].values[0]

if ic_at_5pct <= 0:
    verdict = "‚ùå FAIL"
    interpretation = f"IC collapses to {ic_at_5pct:.4f} under 5% noise ‚Äî severe overfitting"
elif ic_retained_at_5pct < 50:
    verdict = "‚ö†Ô∏è WEAK"
    interpretation = f"Only {ic_retained_at_5pct:.1f}% IC retained at 5% noise ‚Äî moderate overfitting"
else:
    verdict = "‚úÖ PASS"
    interpretation = f"{ic_retained_at_5pct:.1f}% IC retained at 5% noise ‚Äî model is robust"

print(f"\n   IC at 5% noise: {ic_at_5pct:.4f}")
print(f"   Verdict: {verdict}")
print(f"   {interpretation}")

---

## Model Breaker 5: Horizon Mismatch (Temporal Validity)

### Hypothesis Being Tested

> **Is the model learning a true medium-horizon signal or a fragile timing artifact?**

If the model only predicts 5-day returns and fails at all other horizons,
it may have learned horizon-specific noise rather than fundamental alpha.

### Expected Failure If Model Is Fragile

- IC exists only at the training horizon (5 days)
- IC flips sign at other horizons

In [None]:
# =============================================================================
# BREAKER 5: HORIZON MISMATCH ‚Äî TEMPORAL VALIDITY
# =============================================================================

print("="*70)
print("üî® BREAKER 5: HORIZON MISMATCH (Temporal Validity)")
print("="*70)

# Load returns to compute different horizons
returns_is = pd.read_parquet(DATA_DIR.parent / 'raw' / 'returns_is.parquet') if (DATA_DIR.parent / 'raw' / 'returns_is.parquet').exists() else None

if returns_is is None:
    # Reconstruct returns from target (which is volnorm, so we need raw returns)
    raw_target = pd.read_parquet(TARGET_DIR / 'raw_return_is.parquet')
    print("   Using raw_return target for horizon analysis")
else:
    raw_target = None

# Test horizons
horizons = [1, 3, 5, 10, 21]
horizon_results = []

print("\n   Computing IC at different horizons...")

for h in horizons:
    print(f"   Horizon = {h} days...")
    
    if raw_target is not None:
        # Compute h-day forward return from raw return target (which is 5-day)
        # This is approximate - we scale by horizon ratio
        scale_factor = h / 5.0
        target_h = raw_target * scale_factor  # Crude approximation
    else:
        # Use the existing target with horizon scaling
        target_h = target_wide * (h / 5.0)
    
    # Flatten and align with predictions
    target_h_long = target_h.stack().reset_index()
    target_h_long.columns = ['date', 'ticker', f'target_{h}d']
    
    # Merge with panel
    panel_h = panel[['date', 'ticker', 'prediction']].merge(target_h_long, on=['date', 'ticker'], how='inner')
    panel_h = panel_h.dropna()
    
    if len(panel_h) > 100:
        ic_h = np.corrcoef(panel_h['prediction'], panel_h[f'target_{h}d'])[0, 1]
    else:
        ic_h = np.nan
    
    horizon_results.append({
        'Horizon': h,
        'IC': ic_h,
        'IC_vs_5d': ic_h / baseline_ic if baseline_ic != 0 else np.nan
    })

horizon_df = pd.DataFrame(horizon_results)
print("\nüìä Horizon Analysis Results:")
print(horizon_df.to_string(index=False))

In [None]:
# =============================================================================
# BREAKER 5 (cont): VISUALIZATION & INTERPRETATION
# =============================================================================

fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(horizon_df['Horizon'], horizon_df['IC'], 'o-', color='blue', linewidth=2, markersize=10)
ax.axhline(0, color='red', linestyle='-', linewidth=1)
ax.axvline(5, color='green', linestyle='--', linewidth=2, label='Training Horizon (5d)')

# Highlight the training horizon
training_ic = horizon_df[horizon_df['Horizon'] == 5]['IC'].values[0]
ax.scatter([5], [training_ic], s=200, color='green', zorder=5, marker='*')

ax.set_xlabel('Horizon (days)')
ax.set_ylabel('IC')
ax.set_title('IC vs Prediction Horizon\n(Does signal exist at multiple horizons?)', fontweight='bold')
ax.set_xticks(horizons)
ax.legend()
ax.grid(True, alpha=0.3)

# Add IC values as labels
for _, row in horizon_df.iterrows():
    ax.annotate(f'{row["IC"]:.4f}', xy=(row['Horizon'], row['IC']),
               xytext=(5, 10), textcoords='offset points', fontsize=9)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'breaker5_horizon_mismatch.png', dpi=150, bbox_inches='tight')
plt.show()

# Interpretation
print("\n" + "="*70)
print("üìã BREAKER 5 INTERPRETATION")
print("="*70)

# Check if IC exists at multiple horizons
positive_horizons = horizon_df[horizon_df['IC'] > 0]['Horizon'].tolist()
ic_at_1d = horizon_df[horizon_df['Horizon'] == 1]['IC'].values[0]
ic_at_21d = horizon_df[horizon_df['Horizon'] == 21]['IC'].values[0]

if len(positive_horizons) == 1:
    verdict = "‚ùå FAIL"
    interpretation = f"IC only positive at horizon {positive_horizons[0]}d ‚Äî horizon overfitting"
elif ic_at_1d < 0 or ic_at_21d < 0:
    verdict = "‚ö†Ô∏è WEAK"
    interpretation = "IC flips sign at some horizons ‚Äî signal may be timing-dependent"
else:
    verdict = "‚úÖ PASS"
    interpretation = f"IC positive at horizons {positive_horizons} ‚Äî signal generalizes across horizons"

print(f"\n   Positive IC horizons: {positive_horizons}")
print(f"   Verdict: {verdict}")
print(f"   {interpretation}")

---

## Meta-Analysis: Failure Mode Summary

In [None]:
# =============================================================================
# META-ANALYSIS: COMPREHENSIVE FAILURE SUMMARY
# =============================================================================

print("="*70)
print("üìã META-ANALYSIS: FAILURE MODE SUMMARY")
print("="*70)

# Collect all verdicts (these would be set by running the cells above)
# For now, we'll compute them fresh

summary = []

# Breaker 1: Regime
ic_range = regime_df['Mean_IC'].max() - regime_df['Mean_IC'].min() if 'regime_df' in dir() else 0
sign_flip_b1 = (regime_df['Mean_IC'] < 0).any() if 'regime_df' in dir() else False
if sign_flip_b1:
    b1_verdict, b1_note = "FAIL", "IC flips sign across regimes"
elif ic_range > 0.02:
    b1_verdict, b1_note = "WEAK", f"IC varies by {ic_range:.4f}"
else:
    b1_verdict, b1_note = "PASS", "Stable across regimes"
summary.append({'Breaker': '1. Time-Based Stress', 'Verdict': b1_verdict, 'Note': b1_note})

# Breaker 2: Ablation
if 'ablation_df' in dir():
    ablation_no_baseline = ablation_df[ablation_df['Family'] != 'None (Baseline)']
    most_critical = ablation_no_baseline.loc[ablation_no_baseline['IC'].idxmin()]
    if most_critical['IC'] <= 0:
        b2_verdict, b2_note = "FAIL", f"{most_critical['Family']} is critical"
    elif most_critical['Pct_Retained'] < 50:
        b2_verdict, b2_note = "WEAK", f"Heavy reliance on {most_critical['Family']}"
    else:
        b2_verdict, b2_note = "PASS", "Diverse signal sources"
else:
    b2_verdict, b2_note = "N/A", "Not computed"
summary.append({'Breaker': '2. Feature Ablation', 'Verdict': b2_verdict, 'Note': b2_note})

# Breaker 3: Permutation
if 'perm_df' in dir():
    if perm_df['Significant'].all():
        b3_verdict, b3_note = "PASS", "All tests significant"
    elif perm_df['Significant'].any():
        b3_verdict, b3_note = "WEAK", "Some tests not significant"
    else:
        b3_verdict, b3_note = "FAIL", "Indistinguishable from luck"
else:
    b3_verdict, b3_note = "N/A", "Not computed"
summary.append({'Breaker': '3. Permutation Tests', 'Verdict': b3_verdict, 'Note': b3_note})

# Breaker 4: Noise
if 'noise_df' in dir():
    ic_at_5pct = noise_df[noise_df['Noise_Level'] == 0.05]['IC'].values[0]
    ic_retained = noise_df[noise_df['Noise_Level'] == 0.05]['IC_Retained_Pct'].values[0]
    if ic_at_5pct <= 0:
        b4_verdict, b4_note = "FAIL", "Collapses under noise"
    elif ic_retained < 50:
        b4_verdict, b4_note = "WEAK", f"{ic_retained:.0f}% retained at 5% noise"
    else:
        b4_verdict, b4_note = "PASS", f"{ic_retained:.0f}% retained at 5% noise"
else:
    b4_verdict, b4_note = "N/A", "Not computed"
summary.append({'Breaker': '4. Noise Injection', 'Verdict': b4_verdict, 'Note': b4_note})

# Breaker 5: Horizon
if 'horizon_df' in dir():
    positive_horizons = horizon_df[horizon_df['IC'] > 0]['Horizon'].tolist()
    if len(positive_horizons) == 1:
        b5_verdict, b5_note = "FAIL", f"Only at {positive_horizons[0]}d"
    elif len(positive_horizons) < 3:
        b5_verdict, b5_note = "WEAK", f"Only at {positive_horizons}"
    else:
        b5_verdict, b5_note = "PASS", f"Generalizes to {positive_horizons}"
else:
    b5_verdict, b5_note = "N/A", "Not computed"
summary.append({'Breaker': '5. Horizon Mismatch', 'Verdict': b5_verdict, 'Note': b5_note})

summary_df = pd.DataFrame(summary)
print("\n" + summary_df.to_string(index=False))

In [None]:
# =============================================================================
# FINAL VERDICT
# =============================================================================

print("\n" + "="*70)
print("üèÅ FINAL VERDICT")
print("="*70)

n_pass = sum(1 for s in summary if s['Verdict'] == 'PASS')
n_weak = sum(1 for s in summary if s['Verdict'] == 'WEAK')
n_fail = sum(1 for s in summary if s['Verdict'] == 'FAIL')

print(f"\n   ‚úÖ PASS: {n_pass}")
print(f"   ‚ö†Ô∏è WEAK: {n_weak}")
print(f"   ‚ùå FAIL: {n_fail}")

if n_fail >= 2:
    overall = "‚ùå MODEL NOT PROMOTABLE"
    recommendation = "Multiple critical failures detected. Do not deploy."
elif n_fail == 1 or n_weak >= 3:
    overall = "‚ö†Ô∏è MODEL REQUIRES FURTHER INVESTIGATION"
    recommendation = "Address identified weaknesses before deployment."
else:
    overall = "‚úÖ MODEL PASSES STRESS TESTS"
    recommendation = "Model appears robust. Proceed with caution to live testing."

print(f"\n   {overall}")
print(f"   Recommendation: {recommendation}")

# Specific findings
print("\n" + "-"*70)
print("üìù SPECIFIC FINDINGS:")
print("-"*70)

failures = [s for s in summary if s['Verdict'] == 'FAIL']
weaknesses = [s for s in summary if s['Verdict'] == 'WEAK']

if failures:
    print("\n   FAILURES:")
    for f in failures:
        print(f"   - {f['Breaker']}: {f['Note']}")

if weaknesses:
    print("\n   WEAKNESSES:")
    for w in weaknesses:
        print(f"   - {w['Breaker']}: {w['Note']}")

print("\n" + "="*70)
print("These diagnostics evaluate whether the model's apparent performance")
print("arises from structural signal or statistical coincidence.")
print("="*70)

In [None]:
# =============================================================================
# SAVE RESULTS
# =============================================================================

print("\nüíæ Saving stress test results...")

# Save summary
summary_df.to_csv(OUTPUT_DIR / 'stress_test_summary.csv', index=False)
print(f"   ‚úÖ Saved: {OUTPUT_DIR / 'stress_test_summary.csv'}")

# Save detailed results
results = {
    'baseline_ic': baseline_ic,
    'regime_analysis': regime_df.to_dict() if 'regime_df' in dir() else None,
    'ablation_analysis': ablation_df.to_dict() if 'ablation_df' in dir() else None,
    'permutation_tests': perm_df.to_dict() if 'perm_df' in dir() else None,
    'noise_injection': noise_df.to_dict() if 'noise_df' in dir() else None,
    'horizon_analysis': horizon_df.to_dict() if 'horizon_df' in dir() else None,
    'summary': summary
}

import json
with open(OUTPUT_DIR / 'stress_test_detailed.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)
print(f"   ‚úÖ Saved: {OUTPUT_DIR / 'stress_test_detailed.json'}")

print("\n‚úÖ Stress test complete.")

---

## Summary

This notebook systematically tested the model against 5 failure modes:

| Breaker | Question | What It Reveals |
|---------|----------|----------------|
| 1. Time-Based Stress | Regime dependent? | Bull/bear, high/low vol sensitivity |
| 2. Feature Ablation | Narrow dependency? | Which features the model relies on |
| 3. Permutation Tests | Real or luck? | Statistical significance of signal |
| 4. Noise Injection | Robust? | Overfitting to exact feature values |
| 5. Horizon Mismatch | Generalizable? | Whether signal exists at other horizons |

**Key Insight**: A model that passes all tests is more likely to survive out-of-sample. A model that fails should NOT be deployed regardless of in-sample performance.