# S&P 500 Returns Prediction — Comprehensive Analysis & Pipeline


**Note:**
This notebook is a clean version, free from any bugs. It is based on the original notebook: [Hull Tactic: No Overfitting / LEAKAGE](https://www.kaggle.com/code/ahsuna123/hull-tactic-no-overfitting-leakage?scriptVersionId=269940868).


**Reference:** Insights and explorations are derived from the EDA notebook [Hull Tactical: Complete EDA Deep Dive](https://www.kaggle.com/code/ahsuna123/hull-tactical-complete-eda-deep-dive).

---

## 1. Dataset Overview
- **Rows:** 8,990  
- **Features:** 98 (94 numeric, 4 special)  
- **Target:** `forward_returns`  
- **Feature Categories:**
  - **Market Dynamics:** M1–M13  
  - **Macro Economic:** E1–E20  
  - **Interest Rate:** I1–I9  
  - **Price Valuation:** P1–P13  
  - **Volatility:** V1–V13  
  - **Sentiment:** S1–S12  
  - **Dummy Binary:** D1–D9  
  - **Special:** 4  

✅ Rich in macroeconomic and market features; some categories show **high internal correlations**.

---

## 2. Target Variable Insights
- `forward_returns`:
  - Not normally distributed (Jarque-Bera p-value = 0)  
  - Stationary (ADF test p-value = 0)  
  - Weak autocorrelation: lag 1 ≈ -0.045, lag 5 ≈ -0.024  
  - Skewness ≈ -0.176, Kurtosis ≈ 2.19 → slightly platykurtic  
  - Maximum drawdown ≈ -0.492  

**Implications:**  
- Use robust methods (tree-based, quantile regression).  
- Temporal features can be leveraged due to stationarity.

---

## 3. Feature Correlation & Stability
- Most features weakly correlated with target (<0.07).  
- High multicollinearity within:
  - Macro_Economic, Interest_Rate, Price_Valuation, Volatility, Sentiment.  
- Feature stability over time:
  - **Stable:** Dummy_Binary, M1  
  - **Unstable:** Macro_Economic (E11, E12), Price_Valuation (P12), Volatility & Sentiment (V10–V12, S12)  

**Implications:**  
- Non-linear models preferred.  
- Unstable features should be transformed (lag/rolling means, smoothing).

---

## 4. Outlier Analysis
- Significant outliers in Macro_Economic and Price_Valuation features.  
- Recommendation: Winsorization, clipping, or robust scaling.

---

## 5. Feature Engineering Strategy
- **Lag Features:** M1–M10, D1–D5, V1–V5 → lag1, lag3  
- **Rolling Features:** Volatility features → rolling mean/std (window=5)  
- **Interaction Features:** Selected pairs between Macro_Economic and Market_Dynamics features  
- **Outlier Handling:** Clipping based on category-specific bounds  
- **Dimensionality Reduction:** PCA (n_components=50) to reduce high correlation noise  
- **Feature Selection:** SelectKBest (mutual information) when PCA is not used

---

## 6. Modeling Approach
- **Ensemble:** LGBM + XGBoost + Random Forest  
- **Hyperparameters:** Reduced for fast inference (n_estimators ≤ 300, max_depth limited)  
- **Weighting:** LGBM 0.4, XGB 0.35, RF 0.25 (renormalized if a model fails)  
- **Scaling:** RobustScaler applied to features  
- **Training/Inference Flow:**
  - Fit scalers and PCA during training  
  - Handle outliers for each batch  
  - Create lag/rolling/interaction features consistently for training & prediction  
  - Maintain recent feature history for single-row lag computations  

**Advantages:**
- Robust to outliers  
- Handles weakly correlated features and non-linear effects  
- Maintains consistent feature structure between training and inference

---

## 7. Pipeline Usage
- **Training:** `pipeline.fit_from_file(train_path)`  
- **Prediction:** `predict(pl.DataFrame)` for batch inference  
- **Saving/Loading:** `pipeline.save_model('sp500_model.pkl')` / `pipeline.load_model('sp500_model.pkl')`  
- **Server Deployment:** `kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)`  

**Note:** Polars is used for batch input; features are converted to Pandas internally.

---

## 8. Key Recommendations
1. Start modeling with **tree-based ensembles** (LGBM, XGBoost, RF)  
2. Use **PCA or feature selection** to reduce dimensionality and multicollinearity  
3. Include **lag and rolling features** for unstable categories  
4. Apply **robust scaling/clipping** to mitigate extreme outliers  
5. Implement **time-series aware validation** (avoid random splits)  
6. Monitor **feature stability** for future iterations and potential new engineered features

---

**Summary:**  
This pipeline integrates insights from EDA, feature stability analysis, and target characteristics into an end-to-end, deployable model for predicting S&P 500 forward returns.


In [None]:
import os
import pickle
from typing import Dict, Tuple, List

import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import lightgbm as lgb
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

import kaggle_evaluation.default_inference_server


class LeakSafeSP500Pipeline:
    """
    Leakage-free, regularized pipeline for time-series S&P500 forward_returns.
    """

    def __init__(
        self,
        selected_lag_prefixes: Dict[str, Tuple[int]] = None,
        rolling_window: int = 5,
        use_pca: bool = True,
        pca_components: int = 50,
        var_threshold: float = 1e-5,
        random_state: int = 42,
        n_history: int = 10,
    ):
        self.random_state = random_state
        self.rolling_window = rolling_window
        self.use_pca = use_pca
        self.pca_components = pca_components
        self.var_threshold = var_threshold
        self.n_history = n_history

        if selected_lag_prefixes is None:
            self.selected_lag_prefixes = {
                'M': (1, 3),
                'D': (1, 3),
                'V': (1, 3),
            }
        else:
            self.selected_lag_prefixes = selected_lag_prefixes

        self.models = {
            'lgb': lgb.LGBMRegressor(
                objective='regression', metric='rmse', num_leaves=15,
                learning_rate=0.01, n_estimators=200, min_child_samples=50,
                subsample=0.7, colsample_bytree=0.7, reg_alpha=1.0, reg_lambda=1.0,
                random_state=self.random_state,
            ),
            'xgb': xgb.XGBRegressor(
                objective='reg:squarederror', n_estimators=200, max_depth=3,
                learning_rate=0.01, min_child_weight=10, subsample=0.7,
                colsample_bytree=0.7, reg_alpha=1.0, reg_lambda=1.0,
                random_state=self.random_state, verbosity=0,
            ),
            'rf': RandomForestRegressor(
                n_estimators=200, max_depth=6, min_samples_leaf=50,
                max_features='sqrt', n_jobs=-1, random_state=self.random_state,
            )
        }

        self.model_weights = {'lgb': 0.4, 'xgb': 0.35, 'rf': 0.25}

        self.clip_bounds = None
        self.medians = None
        self.scaler = None
        self.pca = None
        self.var_selector = None
        self.feature_names = None

        self.is_fitted = False
        self.history_buffer = None

    # ---------------------------
    # Feature engineering helpers
    # ---------------------------
    def _sorted_df(self, df: pd.DataFrame) -> pd.DataFrame:
        if 'date_id' in df.columns:
            return df.sort_values('date_id').reset_index(drop=True)
        else:
            return df.reset_index(drop=True)

    def _select_lag_columns(self, df_cols: List[str]) -> Dict[str, List[str]]:
        selected = {}
        for prefix in self.selected_lag_prefixes.keys():
            matched = [c for c in df_cols if c.startswith(prefix)]
            if prefix == 'M':
                matched = sorted(matched)[:10]
            elif prefix == 'D':
                matched = sorted(matched)[:5]
            elif prefix == 'V':
                matched = sorted(matched)[:5]
            selected[prefix] = matched
        return selected

    def _create_time_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Create lag and rolling features and drop raw columns to prevent mismatch.
        """
        df = df.copy()
        df = self._sorted_df(df)

        # identify columns to lag (excluding target)
        raw_cols = [c for c in df.columns if not c.startswith(('date_id', 'forward_returns', 'is_scored'))]
        matched = self._select_lag_columns(raw_cols)

        # Lags
        for prefix, cols in matched.items():
            lags = self.selected_lag_prefixes.get(prefix, ())
            for col in cols:
                for lag in lags:
                    df[f'lagged_{col}_lag{lag}'] = df[col].shift(lag)
            # drop original columns after lagging
            df.drop(cols, axis=1, inplace=True)

        # Rolling features for volatility
        vol_cols = matched.get('V', [])
        for col in vol_cols:
            df[f'lagged_{col}_roll_mean_{self.rolling_window}'] = (
                df[f'lagged_{col}_lag1'].rolling(self.rolling_window, min_periods=1).mean()
            )
            df[f'lagged_{col}_roll_std_{self.rolling_window}'] = (
                df[f'lagged_{col}_lag1'].rolling(self.rolling_window, min_periods=1).std().fillna(0.0)
            )

        # Interaction features (using lagged only)
        macro_cols = [c for c in df.columns if c.startswith('E')]
        market_cols = [c for c in df.columns if c.startswith('lagged_M')]
        for e in macro_cols[:3]:
            for m in market_cols[:3]:
                df[f'{e}_x_{m}'] = df[e].shift(1) * df[m].shift(1)

        return df

    # ---------------------------
    # Preprocessing (train-only)
    # ---------------------------
    def _fit_preprocessors(self, train_df: pd.DataFrame) -> Dict:
        X = train_df.copy()
        feature_cols = [c for c in X.columns if not c.startswith(('date_id', 'forward_returns', 'risk_free_rate', 'market_forward_excess_returns', 'is_scored'))]
        print(feature_cols)
        X = X[feature_cols].copy()

        clip_bounds = {col: X[col].quantile([0.05, 0.95]).values for col in X.columns}
        for col in X.columns:
            q05, q95 = clip_bounds[col]
            X[col] = X[col].clip(q05, q95)

        medians = X.median()
        X = X.fillna(medians).fillna(0)

        var_selector = VarianceThreshold(threshold=self.var_threshold)
        X_var = var_selector.fit_transform(X)

        scaler = RobustScaler()
        X_scaled = scaler.fit_transform(X_var)

        pca = None
        if self.use_pca:
            n_comp = min(self.pca_components, X_scaled.shape[1])
            pca = PCA(n_components=n_comp, random_state=self.random_state)
            X_reduced = pca.fit_transform(X_scaled)
            final_feature_names = [f'pca_{i}' for i in range(X_reduced.shape[1])]
        else:
            X_reduced = X_scaled
            sel_mask = var_selector.get_support()
            final_feature_names = [c for c, m in zip(X.columns, sel_mask) if m]

        preprocessors = {
            'clip_bounds': clip_bounds,
            'medians': medians,
            'var_selector': var_selector,
            'scaler': scaler,
            'pca': pca,
            'feature_names': final_feature_names,
        }
        return preprocessors

    def _transform_with_preprocessors(self, df: pd.DataFrame, preprocessors: Dict) -> pd.DataFrame:
        df = df.copy()
        feature_cols = ['D6', 'D7', 'D8', 'D9', 'E1', 'E10', 'E11', 'E12', 'E13', 'E14', 'E15', 'E16', 'E17', 'E18', 'E19', 'E2', 'E20', 'E3', 'E4', 'E5', 'E6', 'E7', 'E8', 'E9', 'I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'P1', 'P10', 'P11', 'P12', 'P13', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'S1', 'S10', 'S11', 'S12', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'lagged_M1_lag1', 'lagged_M1_lag3', 'lagged_M10_lag1', 'lagged_M10_lag3', 'lagged_M11_lag1', 'lagged_M11_lag3', 'lagged_M12_lag1', 'lagged_M12_lag3', 'lagged_M13_lag1', 'lagged_M13_lag3', 'lagged_M14_lag1', 'lagged_M14_lag3', 'lagged_M15_lag1', 'lagged_M15_lag3', 'lagged_M16_lag1', 'lagged_M16_lag3', 'lagged_M17_lag1', 'lagged_M17_lag3', 'lagged_M18_lag1', 'lagged_M18_lag3', 'lagged_D1_lag1', 'lagged_D1_lag3', 'lagged_D2_lag1', 'lagged_D2_lag3', 'lagged_D3_lag1', 'lagged_D3_lag3', 'lagged_D4_lag1', 'lagged_D4_lag3', 'lagged_D5_lag1', 'lagged_D5_lag3', 'lagged_V1_lag1', 'lagged_V1_lag3', 'lagged_V10_lag1', 'lagged_V10_lag3', 'lagged_V11_lag1', 'lagged_V11_lag3', 'lagged_V12_lag1', 'lagged_V12_lag3', 'lagged_V13_lag1', 'lagged_V13_lag3', 'lagged_V1_roll_mean_5', 'lagged_V1_roll_std_5', 'lagged_V10_roll_mean_5', 'lagged_V10_roll_std_5', 'lagged_V11_roll_mean_5', 'lagged_V11_roll_std_5', 'lagged_V12_roll_mean_5', 'lagged_V12_roll_std_5', 'lagged_V13_roll_mean_5', 'lagged_V13_roll_std_5', 'E1_x_lagged_M1_lag1', 'E1_x_lagged_M1_lag3', 'E1_x_lagged_M10_lag1', 'E10_x_lagged_M1_lag1', 'E10_x_lagged_M1_lag3', 'E10_x_lagged_M10_lag1', 'E11_x_lagged_M1_lag1', 'E11_x_lagged_M1_lag3', 'E11_x_lagged_M10_lag1']
        
        #print(feature_cols)
        X = df[feature_cols].copy()

        clip_bounds = preprocessors['clip_bounds']
        for col in X.columns:
            if col in clip_bounds:
                q05, q95 = clip_bounds[col]
                X[col] = X[col].clip(q05, q95)

        medians = preprocessors['medians']
        X = X.fillna(medians).fillna(0)

        var_selector = preprocessors['var_selector']
        X_var = var_selector.transform(X)

        scaler = preprocessors['scaler']
        X_scaled = scaler.transform(X_var)

        pca = preprocessors.get('pca', None)
        if pca is not None:
            X_final = pca.transform(X_scaled)
            col_names = preprocessors['feature_names']
            X_df = pd.DataFrame(X_final, columns=col_names, index=df.index)
        else:
            col_mask = var_selector.get_support()
            selected_cols = [c for c, m in zip(feature_cols, col_mask) if m]
            X_df = pd.DataFrame(X_scaled, columns=selected_cols, index=df.index)

        return X_df

    # ---------------------------
    # Fit / predict
    # ---------------------------
    def fit_from_file(self, train_path: str, target_col: str = 'forward_returns'):
        df_raw = pd.read_csv(train_path)
        df_raw = self._sorted_df(df_raw)

        df_feats_full = self._create_time_features(df_raw)
        final_preprocessors = self._fit_preprocessors(df_feats_full)

        self.clip_bounds = final_preprocessors['clip_bounds']
        self.medians = final_preprocessors['medians']
        self.var_selector = final_preprocessors['var_selector']
        self.scaler = final_preprocessors['scaler']
        self.pca = final_preprocessors['pca']
        self.feature_names = final_preprocessors['feature_names']

        X_full = self._transform_with_preprocessors(df_feats_full, final_preprocessors)
        y_full = df_feats_full[target_col].fillna(0).reset_index(drop=True)

        for name, model in self.models.items():
            model.fit(X_full, y_full)

        if 'date_id' in df_raw.columns:
            self.history_buffer = df_raw.sort_values('date_id').tail(self.n_history).reset_index(drop=True)
        else:
            self.history_buffer = df_raw.tail(self.n_history).reset_index(drop=True)

        self.is_fitted = True
        print("Training completed.")
        return self

    def predict_batch(self, df_batch: pd.DataFrame) -> np.ndarray:
        if not self.is_fitted:
            raise ValueError("Pipeline not fitted. Call fit_from_file() first.")
        #print("df_batch",df_batch)
        df_input = df_batch.copy().reset_index(drop=True)
        if len(df_input) < self.n_history and self.history_buffer is not None:
            hist = self.history_buffer.copy()
            #print("hist",hist)
            cols_inter = [c for c in hist.columns if c in df_input.columns or c in ['date_id']]
            hist = hist[cols_inter]
            df_input = pd.concat([hist, df_input], ignore_index=True)
        #print("df_input",df_input)
        df_feats = self._create_time_features(df_input)
        #print("df_feats",df_feats)
        preprocessors = {
            'clip_bounds': self.clip_bounds,
            'medians': self.medians,
            'var_selector': self.var_selector,
            'scaler': self.scaler,
            'pca': self.pca,
            'feature_names': self.feature_names,
        }
        X_all = self._transform_with_preprocessors(df_feats, preprocessors)
        X_preds = X_all.iloc[-len(df_batch):].reset_index(drop=True)
        #print("X_preds",X_preds)
        preds = np.zeros(len(X_preds))
        for name, model in self.models.items():
            w = self.model_weights.get(name, 0.0)
            if w <= 0:
                continue
            preds += w * model.predict(X_preds)

        return preds if len(preds) > 1 else preds.flatten()[0]

    def save_model(self, filepath: str = 'leaksafe_sp500_pipeline.pkl'):
        data = {
            'models': self.models,
            'model_weights': self.model_weights,
            'clip_bounds': self.clip_bounds,
            'medians': self.medians,
            'var_selector': self.var_selector,
            'scaler': self.scaler,
            'pca': self.pca,
            'feature_names': self.feature_names,
            'is_fitted': self.is_fitted,
            'history_buffer': self.history_buffer,
        }
        with open(filepath, 'wb') as f:
            pickle.dump(data, f)

    def load_model(self, filepath: str = 'leaksafe_sp500_pipeline.pkl'):
        with open(filepath, 'rb') as f:
            data = pickle.load(f)
        self.models = data['models']
        self.model_weights = data['model_weights']
        self.clip_bounds = data['clip_bounds']
        self.medians = data['medians']
        self.var_selector = data['var_selector']
        self.scaler = data['scaler']
        self.pca = data['pca']
        self.feature_names = data['feature_names']
        self.is_fitted = data['is_fitted']
        self.history_buffer = data.get('history_buffer', None)
        return self   # ✅ critical fix


# ---------------------------
# Kaggle inference
# ---------------------------
def create_prediction_function():
    pipeline = LeakSafeSP500Pipeline()
    train_path = '/kaggle/input/hull-tactical-market-prediction/train.csv'
    if os.path.exists(train_path):
        pipeline = pipeline.fit_from_file(train_path)
        pipeline.save_model()
        Model = pipeline.load_model()
        
    return Model
Model = create_prediction_function()
def predict(test):
        pipeline = LeakSafeSP500Pipeline()
        Model = pipeline.load_model()
        test_df = test.to_pandas() if hasattr(test, 'to_pandas') else test
        preds = Model.predict_batch(test_df)
        val = float(preds[0]) if hasattr(preds, '__iter__') else float(preds)
        print(val)
        signal = np.clip(val * 50.0 + 1.0, 0.0, 2.0)
        return float(signal)




In [None]:
pipeline = LeakSafeSP500Pipeline()
Model = pipeline.load_model("/kaggle/working/leaksafe_sp500_pipeline.pkl")
print("Loaded model:", Model)


In [None]:
import os
inference_server = kaggle_evaluation.default_inference_server.DefaultInferenceServer(predict)

if __name__ == "__main__":
    if os.getenv("KAGGLE_IS_COMPETITION_RERUN"):
        inference_server.serve()
    else:
        inference_server.run_local_gateway(("/kaggle/input/hull-tactical-market-prediction/",))