**📊 FEATURE IMPORTANCE RANKING**
**🟢🟢🟢 TIER 1: CRITICAL (Must Have for Modeling)**
Macroeconomic Stress Indicators:
Financial_Stress_Index - Official Fed stress measure
Corporate_Bond_Spread - Direct corporate credit risk
TED_Spread - Banking system stress
High_Yield_Spread - Stressed company risk premium
Yield_Curve_Spread - Recession predictor

Market Fear Indicators:
VIX - Market fear gauge
SP500_Return - Market momentum

Economic Fundamentals:
GDP_Growth - Economic health
Unemployment_Rate - Labor market / recession signal
Federal_Funds_Rate - Cost of capital
CPI_Inflation - Price pressures

Company Performance:
{COMPANY}_Stock_Return (all 25) - Direct performance metrics

Why critical: These directly measure stress, economic health, and company performance - the core of the prediction task.

**🟢🟢 TIER 2: HIGH IMPORTANCE (Strong Predictive Power)**
Consumer_Confidence - Leading consumer spending indicator
Treasury_10Y_Yield - Benchmark borrowing rate
{COMPANY}_Stock_Volume - Unusual activity detection
SP500 - Market level / valuation
Quarter - Earnings cycle alignment
Why high: Strong relationship with economic conditions and company performance, but somewhat redundant with Tier 1.

**🟡 TIER 3: MEDIUM IMPORTANCE (Useful Context)**
Oil_Price - Energy costs, inflation proxy
Trade_Balance - Economic flows
Year - Long-term trends
Month - Seasonal patterns
Why medium: Provide additional context but less direct impact on individual company stress.

**🔴 TIER 4: LOW IMPORTANCE (Minor or Redundant)**
{COMPANY}_Stock_Price - Use returns instead
DayOfWeek - Weak intraday patterns
IsMonthEnd - Minor microstructure effect
Unnamed: 0 - Index, not a feature
Why low: Either redundant (stock price), weak signal (day of week), or not a feature (index).

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('C:\\Users\\akulc\\mlops_project\\Mlops_Project_FinancialCrises\\data\\processed\\merged\\financial_data_complete_daily.csv')

In [3]:
df1=df.copy()

In [4]:
initial_features = df1.shape[1]
initial_features

97

Phase 1:

1A. Lag Features - Economic Indicators
Why: Economic conditions from recent past strongly predict future stress

In [6]:
# Create lag features for slowly-changing economic indicators
print("\n[1/6] Creating lag features for economic indicators...")

macro_lag_features = [
    'GDP_Growth', 'Unemployment_Rate', 'CPI_Inflation',
    'Federal_Funds_Rate', 'Consumer_Confidence'
]

lags = [1, 5, 10, 20, 60]

lag_count = 0
for feature in macro_lag_features:
    if feature in df1.columns:
        for lag in lags:
            df1[f'{feature}_lag{lag}'] = df1[feature].shift(lag)
            lag_count += 1

print(f"   ✅ Created {lag_count} economic lag features")


[1/6] Creating lag features for economic indicators...
   ✅ Created 25 economic lag features


1B. Lag Features - Market Stress Indicators
Why: Market stress evolves quickly - recent history matters

In [7]:
print("\n[2/6] Creating lag features for market stress indicators...")

market_lag_features = [
    'VIX', 'Corporate_Bond_Spread', 'TED_Spread', 
    'High_Yield_Spread', 'Financial_Stress_Index'
]

market_lags = [1, 2, 3, 5, 10]

lag_count = 0
for feature in market_lag_features:
    if feature in df1.columns:
        for lag in market_lags:
            df1[f'{feature}_lag{lag}'] = df1[feature].shift(lag)
            lag_count += 1

print(f"   ✅ Created {lag_count} market stress lag features")


[2/6] Creating lag features for market stress indicators...
   ✅ Created 25 market stress lag features


2. Rolling Statistics (Volatility & Momentum)

In [8]:
print("\n[3/6] Creating rolling statistics (volatility & momentum)...")

# A. VIX rolling statistics
windows = [5, 10, 20, 60]
rolling_count = 0

if 'VIX' in df1.columns:
    for window in windows:
        df1[f'VIX_rolling_mean_{window}'] = df1['VIX'].rolling(window).mean()
        df1[f'VIX_rolling_std_{window}'] = df1['VIX'].rolling(window).std()
        df1[f'VIX_rolling_max_{window}'] = df1['VIX'].rolling(window).max()
        rolling_count += 3

print(f"   ✅ VIX rolling stats: {rolling_count} features")

# B. Stock return volatility and momentum for ALL companies
companies = ['AAPL', 'AMZN', 'BA', 'BAC', 'C', 'CAT', 'COST', 'CVX', 
             'DIS', 'GOOGL', 'GS', 'HD', 'JNJ', 'JPM', 'LIN', 'MCD', 
             'MSFT', 'NFLX', 'NVDA', 'PG', 'TSLA', 'UNH', 'WFC', 'WMT', 'XOM']

rolling_count = 0
for company in companies:
    return_col = f'{company}_Stock_Return'
    if return_col in df1.columns:
        for window in [10, 20, 60]:
            # Volatility
            df1[f'{company}_Return_volatility_{window}'] = \
                df1[return_col].rolling(window).std()
            
            # Momentum (average return)
            df1[f'{company}_Return_momentum_{window}'] = \
                df1[return_col].rolling(window).mean()
            
            rolling_count += 2

print(f"   ✅ Company rolling stats: {rolling_count} features")

# C. Spread widening (credit stress indicators)
rolling_count = 0
for spread in ['Corporate_Bond_Spread', 'High_Yield_Spread', 'TED_Spread']:
    if spread in df1.columns:
        for window in [20, 60]:
            df1[f'{spread}_rolling_mean_{window}'] = df1[spread].rolling(window).mean()
            df1[f'{spread}_rolling_max_{window}'] = df1[spread].rolling(window).max()
            rolling_count += 2

print(f"   ✅ Spread rolling stats: {rolling_count} features")


[3/6] Creating rolling statistics (volatility & momentum)...
   ✅ VIX rolling stats: 12 features
   ✅ Company rolling stats: 150 features
   ✅ Spread rolling stats: 12 features


  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \
  df1[f'{company}_Return_volatility_{window}'] = \
  df1[f'{company}_Return_momentum_{window}'] = \


3. Critical Interaction Features
Based on EDA showing VIX correlates 0.81 with Financial_Stress_Index, create stress composites:

Why: Your EDA shows high correlations - interactions capture combined effects

In [9]:
print("\n[4/6] Creating interaction features (stress composites)...")

interaction_count = 0

# A. Combined Stress Indices
if all(col in df1.columns for col in ['Corporate_Bond_Spread', 'TED_Spread', 'High_Yield_Spread']):
    df1['Credit_Liquidity_Stress'] = (
        df1['Corporate_Bond_Spread'] * df1['TED_Spread'] * df1['High_Yield_Spread']
    )
    interaction_count += 1

if all(col in df1.columns for col in ['VIX', 'Corporate_Bond_Spread']):
    df1['Market_Credit_Stress'] = df1['VIX'] * df1['Corporate_Bond_Spread']
    interaction_count += 1

if all(col in df1.columns for col in ['Unemployment_Rate', 'Consumer_Confidence']):
    df1['Unemployment_Confidence_Stress'] = (
        df1['Unemployment_Rate'] * (100 - df1['Consumer_Confidence'])
    )
    interaction_count += 1

# B. Economic divergence
if all(col in df1.columns for col in ['SP500_Return', 'GDP_Growth']):
    df1['Market_Economy_Divergence'] = df1['SP500_Return'] - df1['GDP_Growth']
    interaction_count += 1

if all(col in df1.columns for col in ['CPI_Inflation', 'Federal_Funds_Rate']):
    df1['Inflation_Rate_Product'] = df1['CPI_Inflation'] * df1['Federal_Funds_Rate']
    interaction_count += 1

# C. Yield curve analysis
if 'Yield_Curve_Spread' in df1.columns:
    df1['Inverted_Yield_Curve'] = (df1['Yield_Curve_Spread'] < 0).astype(int)
    df1['Yield_Curve_Stress'] = df1['Yield_Curve_Spread'].apply(
        lambda x: 1 if x < -0.5 else 0
    )
    interaction_count += 2

print(f"   ✅ Created {interaction_count} interaction features")


[4/6] Creating interaction features (stress composites)...
   ✅ Created 7 interaction features


  df1['Credit_Liquidity_Stress'] = (
  df1['Market_Credit_Stress'] = df1['VIX'] * df1['Corporate_Bond_Spread']
  df1['Unemployment_Confidence_Stress'] = (
  df1['Market_Economy_Divergence'] = df1['SP500_Return'] - df1['GDP_Growth']
  df1['Inflation_Rate_Product'] = df1['CPI_Inflation'] * df1['Federal_Funds_Rate']
  df1['Inverted_Yield_Curve'] = (df1['Yield_Curve_Spread'] < 0).astype(int)
  df1['Yield_Curve_Stress'] = df1['Yield_Curve_Spread'].apply(


4. Sector Indices (From Your 25 Companies)
Based on company list, aggregate by sector:

Why: EDA showed sector patterns - financials especially stressed during crises

In [10]:
print("\n[5/6] Creating sector indices...")

sector_groups = {
    'Financial': ['JPM', 'BAC', 'C', 'GS', 'WFC'],
    'Tech': ['AAPL', 'MSFT', 'GOOGL', 'NVDA', 'AMZN'],
    'Healthcare': ['UNH', 'JNJ'],
    'Energy': ['XOM', 'CVX'],
    'Consumer_Discretionary': ['TSLA', 'MCD', 'COST', 'DIS'],
    'Consumer_Staples': ['WMT', 'PG'],
    'Industrial': ['BA', 'CAT', 'LIN', 'HD']
}

sector_count = 0
for sector, stocks in sector_groups.items():
    return_cols = [f'{s}_Stock_Return' for s in stocks if f'{s}_Stock_Return' in df1.columns]
    
    if len(return_cols) > 0:
        # Sector average return
        df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
        sector_count += 1
        
        # Sector volatility
        df1[f'{sector}_Sector_Volatility'] = df1[return_cols].std(axis=1)
        sector_count += 1
        
        # Sector momentum (20-day average)
        df1[f'{sector}_Sector_Momentum_20'] = df1[f'{sector}_Sector_Return'].rolling(20).mean()
        sector_count += 1

# Financial sector stress (special case)
if all(col in df1.columns for col in ['Financial_Sector_Volatility', 'Corporate_Bond_Spread']):
    df1['Financial_Sector_Stress'] = (
        df1['Financial_Sector_Volatility'] * df1['Corporate_Bond_Spread']
    )
    sector_count += 1

print(f"   ✅ Created {sector_count} sector features")



[5/6] Creating sector indices...
   ✅ Created 22 sector features


  df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
  df1[f'{sector}_Sector_Volatility'] = df1[return_cols].std(axis=1)
  df1[f'{sector}_Sector_Momentum_20'] = df1[f'{sector}_Sector_Return'].rolling(20).mean()
  df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
  df1[f'{sector}_Sector_Volatility'] = df1[return_cols].std(axis=1)
  df1[f'{sector}_Sector_Momentum_20'] = df1[f'{sector}_Sector_Return'].rolling(20).mean()
  df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
  df1[f'{sector}_Sector_Volatility'] = df1[return_cols].std(axis=1)
  df1[f'{sector}_Sector_Momentum_20'] = df1[f'{sector}_Sector_Return'].rolling(20).mean()
  df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
  df1[f'{sector}_Sector_Volatility'] = df1[return_cols].std(axis=1)
  df1[f'{sector}_Sector_Momentum_20'] = df1[f'{sector}_Sector_Return'].rolling(20).mean()
  df1[f'{sector}_Sector_Return'] = df1[return_cols].mean(axis=1)
  df1[f'{sector}_Sector_Volatility'] = df1[

5. Company-Specific Risk Features

Why: Individual company stress matters for prediction targets

In [11]:
print("\n[6/6] Creating company-specific risk features...")

company_feature_count = 0

for company in companies:
    return_col = f'{company}_Stock_Return'
    price_col = f'{company}_Stock_Price'
    volume_col = f'{company}_Stock_Volume'
    
    # A. Relative performance (company vs market)
    if return_col in df1.columns and 'SP500_Return' in df1.columns:
        df1[f'{company}_vs_SP500'] = (
            df1[return_col] - df1['SP500_Return']
        )
        company_feature_count += 1
        
        # Beta-like measure (60-day rolling correlation)
        df1[f'{company}_Market_Beta_60'] = (
            df1[return_col].rolling(60).corr(df1['SP500_Return'])
        )
        company_feature_count += 1
    
    # B. Drawdown (distance from recent peak)
    if price_col in df1.columns:
        rolling_max = df1[price_col].rolling(252, min_periods=1).max()
        df1[f'{company}_Drawdown'] = (
            (df1[price_col] - rolling_max) / rolling_max * 100
        )
        company_feature_count += 1
    
    # C. Volume anomalies
    if volume_col in df1.columns:
        avg_volume_20 = df1[volume_col].rolling(20).mean()
        df1[f'{company}_Volume_Spike'] = (
            (df1[volume_col] / avg_volume_20 > 2).astype(int)
        )
        company_feature_count += 1

print(f"   ✅ Created {company_feature_count} company-specific features")


[6/6] Creating company-specific risk features...
   ✅ Created 100 company-specific features


  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[f'{company}_vs_SP500'] = (
  df1[f'{company}_Market_Beta_60'] = (
  df1[f'{company}_Drawdown'] = (
  df1[f'{company}_Volume_Spike'] = (
  df1[

6. Market Regime Features
From EDA: 9.6% of days have VIX > 30

Why: Identifies different market environments - models can learn regime-specific patterns

In [12]:
print("\nCreating market regime features...")

regime_count = 0

# A. Volatility regimes
if 'VIX' in df1.columns:
    df1['High_Volatility_Regime'] = (df1['VIX'] > 30).astype(int)
    df1['Extreme_Volatility_Regime'] = (df1['VIX'] > 40).astype(int)
    regime_count += 2

# B. Economic cycle indicators
if 'Unemployment_Rate' in df1.columns and 'GDP_Growth' in df1.columns:
    df1['Recession_Signal'] = (
        ((df1['Unemployment_Rate'] > df1['Unemployment_Rate'].rolling(60).mean()) & 
         (df1['GDP_Growth'] < 0))
    ).astype(int)
    regime_count += 1

if 'Yield_Curve_Spread' in df1.columns:
    df1['Inverted_Curve_Signal'] = (df1['Yield_Curve_Spread'] < 0).astype(int)
    regime_count += 1

# C. Credit stress regime
if 'Corporate_Bond_Spread' in df1.columns:
    spread_75th = df1['Corporate_Bond_Spread'].quantile(0.75)
    df1['Tight_Credit_Regime'] = (df1['Corporate_Bond_Spread'] > spread_75th).astype(int)
    regime_count += 1

# D. Rate hike cycle
if 'Federal_Funds_Rate' in df1.columns:
    df1['Fed_Hiking_Cycle'] = (
        df1['Federal_Funds_Rate'] > df1['Federal_Funds_Rate'].shift(60)
    ).astype(int)
    regime_count += 1

print(f"   ✅ Created {regime_count} regime features")


Creating market regime features...
   ✅ Created 6 regime features


  df1['High_Volatility_Regime'] = (df1['VIX'] > 30).astype(int)
  df1['Extreme_Volatility_Regime'] = (df1['VIX'] > 40).astype(int)
  df1['Recession_Signal'] = (
  df1['Inverted_Curve_Signal'] = (df1['Yield_Curve_Spread'] < 0).astype(int)
  df1['Tight_Credit_Regime'] = (df1['Corporate_Bond_Spread'] > spread_75th).astype(int)
  df1['Fed_Hiking_Cycle'] = (


Phase 2:


In [13]:
phase1_features = df1.shape[1]

In [14]:
# -------------------------------------------------------------------
# 7. EXPONENTIAL MOVING AVERAGES (EMA)
# -------------------------------------------------------------------
print("\n[1/2] Creating Exponential Moving Averages (EMA)...")

# EMA gives more weight to recent observations
spans = [5, 10, 20, 60]
ema_features = ['VIX', 'SP500_Return', 'GDP_Growth', 'Unemployment_Rate']

ema_count = 0
for feature in ema_features:
    if feature in df1.columns:
        for span in spans:
            df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
            ema_count += 1

print(f"   ✅ Created {ema_count} EMA features")
print(f"      Features: {ema_features}")
print(f"      Spans: {spans}")


[1/2] Creating Exponential Moving Averages (EMA)...
   ✅ Created 16 EMA features
      Features: ['VIX', 'SP500_Return', 'GDP_Growth', 'Unemployment_Rate']
      Spans: [5, 10, 20, 60]


  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature}_ema_{span}'] = df1[feature].ewm(span=span, adjust=False).mean()
  df1[f'{feature

In [15]:
# -------------------------------------------------------------------
# 8. RATE OF CHANGE FEATURES (Momentum/Acceleration)
# -------------------------------------------------------------------
print("\n[2/2] Creating Rate of Change features...")

# Momentum / acceleration
periods = [5, 10, 20]
roc_features = ['VIX', 'Unemployment_Rate', 'Corporate_Bond_Spread']

roc_count = 0
for feature in roc_features:
    if feature in df1.columns:
        for period in periods:
            # Absolute change
            df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
            roc_count += 1
            
            # Percentage change
            df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
            roc_count += 1

print(f"   ✅ Created {roc_count} Rate of Change features")
print(f"      Features: {roc_features}")
print(f"      Periods: {periods}")


[2/2] Creating Rate of Change features...
   ✅ Created 18 Rate of Change features
      Features: ['VIX', 'Unemployment_Rate', 'Corporate_Bond_Spread']
      Periods: [5, 10, 20]


  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period) * 100
  df1[f'{feature}_change_{period}'] = df1[feature].diff(period)
  df1[f'{feature}_pct_change_{period}'] = df1[feature].pct_change(period

In [16]:
# Summary of Feature Engineering
print("\n" + "="*70)
print("📊 FEATURE ENGINEERING SUMMARY")
print("="*70)

# Initial state
print(f"\n🔹 Initial Features: {initial_features}")

# Phase 1 Summary
phase1_added = phase1_features - initial_features
print(f"\n{'='*70}")
print("PHASE 1: TIME-BASED & DOMAIN-SPECIFIC FEATURES")
print(f"{'='*70}")
print(f"Features after Phase 1: {phase1_features}")
print(f"Features added in Phase 1: {phase1_added}")

print("\nPhase 1 Breakdown:")
print(f"  ✓ Economic lag features: 25")
print(f"  ✓ Market stress lag features: 25")
print(f"  ✓ Rolling statistics (volatility & momentum): 174")
print(f"    - VIX rolling stats: 12")
print(f"    - Company rolling stats: 150")
print(f"    - Spread rolling stats: 12")
print(f"  ✓ Interaction features (stress composites): 7")
print(f"  ✓ Sector indices: 22")
print(f"  ✓ Company-specific risk features: 100")
print(f"  ✓ Market regime features: 6")

# Phase 2 Summary
final_features = df1.shape[1]
phase2_added = final_features - phase1_features
print(f"\n{'='*70}")
print("PHASE 2: ADVANCED TECHNICAL FEATURES")
print(f"{'='*70}")
print(f"Features after Phase 2: {final_features}")
print(f"Features added in Phase 2: {phase2_added}")

print("\nPhase 2 Breakdown:")
print(f"  ✓ Exponential Moving Averages (EMA): 16")
print(f"  ✓ Rate of Change features: 18")

# Overall Summary
total_added = final_features - initial_features
print(f"\n{'='*70}")
print("OVERALL SUMMARY")
print(f"{'='*70}")
print(f"📈 Total Features Created: {total_added}")
print(f"📊 Initial: {initial_features} → Final: {final_features}")
print(f"🔢 Percentage Increase: {(total_added/initial_features)*100:.1f}%")

# Data quality check
print(f"\n{'='*70}")
print("DATA QUALITY CHECK")
print(f"{'='*70}")
print(f"Total Rows: {len(df1):,}")
print(f"Total Columns: {df1.shape[1]}")
print(f"Memory Usage: {df1.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check for missing values
missing_summary = df1.isnull().sum()
features_with_missing = missing_summary[missing_summary > 0]
print(f"\nFeatures with Missing Values: {len(features_with_missing)}")
if len(features_with_missing) > 0:
    print(f"Total Missing Values: {features_with_missing.sum():,}")
    print(f"Max Missing in Single Feature: {features_with_missing.max():,} ({features_with_missing.idxmax()})")

# Feature categories
print(f"\n{'='*70}")
print("FEATURE CATEGORIES")
print(f"{'='*70}")

lag_features = [col for col in df1.columns if 'lag' in col.lower()]
rolling_features = [col for col in df1.columns if 'rolling' in col.lower()]
ema_features = [col for col in df1.columns if 'ema' in col.lower()]
sector_features = [col for col in df1.columns if 'sector' in col.lower()]
regime_features = [col for col in df1.columns if 'regime' in col.lower() or 'signal' in col.lower()]
interaction_features = [col for col in df1.columns if any(x in col.lower() for x in ['stress', 'divergence', 'product', 'inverted', 'curve'])]

print(f"  📌 Lag Features: {len(lag_features)}")
print(f"  📌 Rolling Window Features: {len(rolling_features)}")
print(f"  📌 EMA Features: {len(ema_features)}")
print(f"  📌 Sector Features: {len(sector_features)}")
print(f"  📌 Regime Features: {len(regime_features)}")
print(f"  📌 Interaction Features: {len(interaction_features)}")

print(f"\n{'='*70}")
print("✅ Feature Engineering Complete!")
print(f"{'='*70}\n")


📊 FEATURE ENGINEERING SUMMARY

🔹 Initial Features: 97

PHASE 1: TIME-BASED & DOMAIN-SPECIFIC FEATURES
Features after Phase 1: 456
Features added in Phase 1: 359

Phase 1 Breakdown:
  ✓ Economic lag features: 25
  ✓ Market stress lag features: 25
  ✓ Rolling statistics (volatility & momentum): 174
    - VIX rolling stats: 12
    - Company rolling stats: 150
    - Spread rolling stats: 12
  ✓ Interaction features (stress composites): 7
  ✓ Sector indices: 22
  ✓ Company-specific risk features: 100
  ✓ Market regime features: 6

PHASE 2: ADVANCED TECHNICAL FEATURES
Features after Phase 2: 490
Features added in Phase 2: 34

Phase 2 Breakdown:
  ✓ Exponential Moving Averages (EMA): 16
  ✓ Rate of Change features: 18

OVERALL SUMMARY
📈 Total Features Created: 393
📊 Initial: 97 → Final: 490
🔢 Percentage Increase: 405.2%

DATA QUALITY CHECK
Total Rows: 6,910
Total Columns: 490
Memory Usage: 25.28 MB

Features with Missing Values: 274
Total Missing Values: 7,664
Max Missing in Single Feature: 

In [17]:
# ===================================================================
# ADDITIONAL HIGH-PRIORITY FEATURES
# ===================================================================

print("\n" + "="*70)
print("🚀 ADDING HIGH-PRIORITY MISSING FEATURES")
print("="*70)

features_before = df1.shape[1]

# -------------------------------------------------------------------
# 1. ROLLING CORRELATIONS
# -------------------------------------------------------------------
print("\n[1/4] Creating rolling correlations...")

windows = [30, 60, 90]
corr_count = 0

# Market correlations
if 'SP500_Return' in df1.columns and 'VIX' in df1.columns:
    for window in windows:
        df1[f'corr_SP500_VIX_{window}'] = (
            df1['SP500_Return'].rolling(window).corr(df1['VIX'])
        )
        corr_count += 1

# Economic correlations
if 'Oil_Price' in df1.columns and 'CPI_Inflation' in df1.columns:
    for window in windows:
        df1[f'corr_OilPrice_CPI_{window}'] = (
            df1['Oil_Price'].rolling(window).corr(df1['CPI_Inflation'])
        )
        corr_count += 1

# Spread correlations (credit stress)
if 'Corporate_Bond_Spread' in df1.columns and 'High_Yield_Spread' in df1.columns:
    for window in windows:
        df1[f'corr_CorpSpread_HYSpread_{window}'] = (
            df1['Corporate_Bond_Spread'].rolling(window).corr(df1['High_Yield_Spread'])
        )
        corr_count += 1

# Sector correlations
if 'Financial_Sector_Return' in df1.columns and 'Tech_Sector_Return' in df1.columns:
    for window in windows:
        df1[f'corr_Financial_Tech_{window}'] = (
            df1['Financial_Sector_Return'].rolling(window).corr(df1['Tech_Sector_Return'])
        )
        corr_count += 1

print(f"   ✅ Created {corr_count} rolling correlation features")

# -------------------------------------------------------------------
# 2. SHOCK INDICATORS (2-Sigma Deviations)
# -------------------------------------------------------------------
print("\n[2/4] Creating shock indicators...")

features_to_monitor = [
    'Federal_Funds_Rate', 'Unemployment_Rate', 'VIX', 
    'Corporate_Bond_Spread', 'Oil_Price', 'CPI_Inflation'
]

shock_count = 0
window = 30  # 30-day baseline

for feature in features_to_monitor:
    if feature in df1.columns:
        # Calculate rolling statistics
        rolling_mean = df1[feature].rolling(window).mean()
        rolling_std = df1[feature].rolling(window).std()
        
        # Binary shock indicator (exceeds 2-sigma)
        df1[f'{feature}_Shock'] = (
            (df1[feature] - rolling_mean).abs() > 2 * rolling_std
        ).astype(int)
        shock_count += 1
        
        # Continuous deviation (how many sigmas away)
        df1[f'{feature}_Deviation'] = (df1[feature] - rolling_mean) / (rolling_std + 1e-6)
        shock_count += 1

print(f"   ✅ Created {shock_count} shock/deviation features")

# -------------------------------------------------------------------
# 3. SHARPE-LIKE RATIOS (Risk-Adjusted Returns)
# -------------------------------------------------------------------
print("\n[3/4] Creating Sharpe-like ratios...")

companies = ['AAPL', 'AMZN', 'BA', 'BAC', 'C', 'CAT', 'COST', 'CVX', 
             'DIS', 'GOOGL', 'GS', 'HD', 'JNJ', 'JPM', 'LIN', 'MCD', 
             'MSFT', 'NFLX', 'NVDA', 'PG', 'TSLA', 'UNH', 'WFC', 'WMT', 'XOM']

sharpe_count = 0

for company in companies:
    return_col = f'{company}_Stock_Return'
    if return_col in df1.columns:
        for window in [30, 60, 90]:
            mean_return = df1[return_col].rolling(window).mean()
            std_return = df1[return_col].rolling(window).std()
            
            # Sharpe-like: mean return / volatility
            df1[f'{company}_Sharpe_{window}'] = mean_return / (std_return + 1e-6)
            sharpe_count += 1

print(f"   ✅ Created {sharpe_count} Sharpe-like ratio features")

# -------------------------------------------------------------------
# 4. KEY INTERACTION FEATURES
# -------------------------------------------------------------------
print("\n[4/4] Creating key interaction features...")

interaction_count = 0

# Stagflation risk (high rates + high unemployment)
if 'Federal_Funds_Rate' in df1.columns and 'Unemployment_Rate' in df1.columns:
    df1['Stagflation_Risk'] = df1['Federal_Funds_Rate'] * df1['Unemployment_Rate']
    interaction_count += 1

# Energy burden on growth
if 'Oil_Price' in df1.columns and 'GDP_Growth' in df1.columns:
    df1['Energy_Burden'] = df1['Oil_Price'] / (df1['GDP_Growth'].abs() + 0.01)
    interaction_count += 1

# Composite stress (Financial Stress × VIX)
if 'Financial_Stress_Index' in df1.columns and 'VIX' in df1.columns:
    df1['Market_Stress_Composite'] = df1['Financial_Stress_Index'] * df1['VIX']
    interaction_count += 1

# Co-movement indicator (changes in SP500 × changes in VIX)
if 'SP500_Return' in df1.columns and 'VIX' in df1.columns:
    df1['Delta_SP500'] = df1['SP500_Return'].diff()
    df1['Delta_VIX'] = df1['VIX'].diff()
    df1['CoMovement_SP500_VIX'] = df1['Delta_SP500'] * df1['Delta_VIX']
    interaction_count += 3

# Interest rate shock × Market volatility
if 'Federal_Funds_Rate_Deviation' in df1.columns and 'VIX' in df1.columns:
    df1['RateShock_MarketStress'] = df1['Federal_Funds_Rate_Deviation'].abs() * df1['VIX']
    interaction_count += 1

print(f"   ✅ Created {interaction_count} interaction features")

# ===================================================================
# SUMMARY
# ===================================================================

features_after = df1.shape[1]
new_features = features_after - features_before

print("\n" + "="*70)
print("📊 HIGH-PRIORITY FEATURES SUMMARY")
print("="*70)

print(f"\n✅ Feature Engineering Complete!")
print(f"   Features before: {features_before}")
print(f"   New features added: {new_features}")
print(f"   Total features now: {features_after}")

print(f"\n📋 Feature Breakdown:")
print(f"   • Rolling correlations: {corr_count}")
print(f"   • Shock indicators: {shock_count}")
print(f"   • Sharpe ratios: {sharpe_count}")
print(f"   • Interactions: {interaction_count}")
print(f"   ─" * 35)
print(f"   Total added: {corr_count + shock_count + sharpe_count + interaction_count}")


🚀 ADDING HIGH-PRIORITY MISSING FEATURES

[1/4] Creating rolling correlations...
   ✅ Created 12 rolling correlation features

[2/4] Creating shock indicators...
   ✅ Created 12 shock/deviation features

[3/4] Creating Sharpe-like ratios...
   ✅ Created 75 Sharpe-like ratio features

[4/4] Creating key interaction features...
   ✅ Created 7 interaction features

📊 HIGH-PRIORITY FEATURES SUMMARY

✅ Feature Engineering Complete!
   Features before: 490
   New features added: 106
   Total features now: 596

📋 Feature Breakdown:
   • Rolling correlations: 12
   • Shock indicators: 12
   • Sharpe ratios: 75
   • Interactions: 7
   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─   ─
   Total added: 106


  df1[f'corr_SP500_VIX_{window}'] = (
  df1[f'corr_SP500_VIX_{window}'] = (
  df1[f'corr_SP500_VIX_{window}'] = (
  df1[f'corr_OilPrice_CPI_{window}'] = (
  df1[f'corr_OilPrice_CPI_{window}'] = (
  df1[f'corr_OilPrice_CPI_{window}'] = (
  df1[f'corr_CorpSpread_HYSpread_{window}'] = (
  df1[f'corr_CorpSpread_HYSpread_{window}'] = (
  df1[f'corr_CorpSpread_HYSpread_{window}'] = (
  df1[f'corr_Financial_Tech_{window}'] = (
  df1[f'corr_Financial_Tech_{window}'] = (
  df1[f'corr_Financial_Tech_{window}'] = (
  df1[f'{feature}_Shock'] = (
  df1[f'{feature}_Deviation'] = (df1[feature] - rolling_mean) / (rolling_std + 1e-6)
  df1[f'{feature}_Shock'] = (
  df1[f'{feature}_Deviation'] = (df1[feature] - rolling_mean) / (rolling_std + 1e-6)
  df1[f'{feature}_Shock'] = (
  df1[f'{feature}_Deviation'] = (df1[feature] - rolling_mean) / (rolling_std + 1e-6)
  df1[f'{feature}_Shock'] = (
  df1[f'{feature}_Deviation'] = (df1[feature] - rolling_mean) / (rolling_std + 1e-6)
  df1[f'{feature}_Shock'] = (


In [18]:
df1.to_csv('financial_data_engineered_complete.csv')

print("✅ File saved: financial_data_engineered_complete.csv")

✅ File saved: financial_data_engineered_complete.csv


In [19]:
missing_cols = df1.isnull().sum()
missing_cols = missing_cols[missing_cols > 0]   # filter only columns with missing values

for col, count in missing_cols.items():
    print(f"{col}: {count} missing values")

GDP_Growth_lag1: 1 missing values
GDP_Growth_lag5: 5 missing values
GDP_Growth_lag10: 10 missing values
GDP_Growth_lag20: 20 missing values
GDP_Growth_lag60: 60 missing values
Unemployment_Rate_lag1: 1 missing values
Unemployment_Rate_lag5: 5 missing values
Unemployment_Rate_lag10: 10 missing values
Unemployment_Rate_lag20: 20 missing values
Unemployment_Rate_lag60: 60 missing values
CPI_Inflation_lag1: 1 missing values
CPI_Inflation_lag5: 5 missing values
CPI_Inflation_lag10: 10 missing values
CPI_Inflation_lag20: 20 missing values
CPI_Inflation_lag60: 60 missing values
Federal_Funds_Rate_lag1: 1 missing values
Federal_Funds_Rate_lag5: 5 missing values
Federal_Funds_Rate_lag10: 10 missing values
Federal_Funds_Rate_lag20: 20 missing values
Federal_Funds_Rate_lag60: 60 missing values
Consumer_Confidence_lag1: 1 missing values
Consumer_Confidence_lag5: 5 missing values
Consumer_Confidence_lag10: 10 missing values
Consumer_Confidence_lag20: 20 missing values
Consumer_Confidence_lag60: 60 

In [20]:
df1_clean = df1.copy()
# ----------------------------------------
# STRATEGY 1: Forward Fill - Economic & Market Indicators
# ----------------------------------------
# These change slowly, so carrying forward is realistic
economic_features = [col for col in df1.columns if any(x in col for x in [
    'GDP', 'CPI', 'Unemployment', 'Federal_Funds', 'Consumer_Confidence',
    'Treasury', 'Oil_Price', 'Trade_Balance'
])]

print(f"\n[1/4] Forward filling economic indicators...")
df1_clean[economic_features] = df1_clean[economic_features].fillna(method='ffill')
print(f"   Applied to {len(economic_features)} features")


[1/4] Forward filling economic indicators...
   Applied to 59 features


  df1_clean[economic_features] = df1_clean[economic_features].fillna(method='ffill')


In [21]:
df1_clean = df1_clean.rename(columns={'Unnamed: 0': 'Date'})

In [22]:
# ----------------------------------------
# STRATEGY 2: Forward Fill - Lag Features
# ----------------------------------------
# Lags by definition use past values
lag_features = [col for col in df1.columns if '_lag' in col]

print(f"\n[2/4] Forward filling lag features...")
df1_clean[lag_features] = df1_clean[lag_features].fillna(method='ffill')
print(f"   Applied to {len(lag_features)} features")


[2/4] Forward filling lag features...
   Applied to 50 features


  df1_clean[lag_features] = df1_clean[lag_features].fillna(method='ffill')


In [23]:
# ----------------------------------------
# STRATEGY 3: Forward Fill - Rolling Stats
# ----------------------------------------
# Rolling windows need initial warm-up period
rolling_features = [col for col in df1.columns if any(x in col for x in [
    'rolling_mean', 'rolling_std', 'rolling_max', 'volatility', 'momentum'
])]

print(f"\n[3/4] Forward filling rolling statistics...")
df1_clean[rolling_features] = df1_clean[rolling_features].fillna(method='ffill')
print(f"   Applied to {len(rolling_features)} features")


[3/4] Forward filling rolling statistics...
   Applied to 174 features


  df1_clean[rolling_features] = df1_clean[rolling_features].fillna(method='ffill')


In [24]:
# ----------------------------------------
# STRATEGY 4: Forward + Backward Fill - Everything Else
# ----------------------------------------
print(f"\n[4/4] Filling remaining features...")
df1_clean = df1_clean.fillna(method='ffill').fillna(method='bfill')

# Final dropna for any stubborn NaN
#rows_before = len(df1_clean)
#df1_clean = df1_clean.dropna()
#rows_after = len(df1_clean)

print(f"\n✅ Cleaning complete!")
print(f"   Original rows: {len(df1):,}")
#print(f"   Final rows: {rows_after:,}")
#print(f"   Rows dropped: {rows_before - rows_after:,}")
#print(f"   Retention: {(rows_after/len(df1))*100:.2f}%")
print(f"   Missing values: {df1_clean.isna().sum().sum()}")


[4/4] Filling remaining features...

✅ Cleaning complete!
   Original rows: 6,910
   Missing values: 0


  df1_clean = df1_clean.fillna(method='ffill').fillna(method='bfill')


In [26]:
df1_clean.to_csv('all_features.csv')

print("✅ File saved: all_features.csv")

✅ File saved: all_features.csv


In [27]:
df1_clean.columns

Index(['Date', 'GDP_Growth', 'CPI_Inflation', 'Unemployment_Rate',
       'Federal_Funds_Rate', 'Yield_Curve_Spread', 'Consumer_Confidence',
       'Oil_Price', 'Trade_Balance', 'Corporate_Bond_Spread',
       ...
       'XOM_Sharpe_30', 'XOM_Sharpe_60', 'XOM_Sharpe_90', 'Stagflation_Risk',
       'Energy_Burden', 'Market_Stress_Composite', 'Delta_SP500', 'Delta_VIX',
       'CoMovement_SP500_VIX', 'RateShock_MarketStress'],
      dtype='object', length=596)

In [28]:
df1_clean.shape

(6910, 596)