# 08_ensemble_validation.ipynb  
## Ensemble & Walk-Forward Validation

**Objective:**  
1. Load our four saved models (tuned RF, XGB, tuned MLP, stacked LSTM).  
2. Generate each model’s predictions on the **same** improved feature set.  
3. Train a simple **Ridge** meta‐model (stacking) on 80% of the data.  
4. Evaluate on the hold‐out 20%.  
5. Perform a **walk-forward** validation to ensure robustness.


## 1. Load Saved Models & Scaler

In [1]:
# 1. Imports & Setup
import numpy as np
import pandas as pd
from pathlib import Path

from sklearn.linear_model       import Ridge
from sklearn.metrics            import mean_absolute_error, r2_score
from sklearn.model_selection    import TimeSeriesSplit
import joblib
import xgboost                  as xgb
from tensorflow.keras.models    import load_model

# Paths
root       = Path().resolve().parent
models_dir = root/"models"

# Load models
rf    = joblib.load(models_dir/"rf_tuned_model.joblib")
xgb_model = xgb.XGBRegressor()
xgb_model.load_model(str(models_dir/"xgb_model.json"))
mlp   = load_model(models_dir/"mlp_improved_model")
lstm  = load_model(models_dir/"lstm_improved_model")
scaler= joblib.load(models_dir/"scaler.pkl")


## 2. Rebuild improved features (same as in Notebooks 06 & 07)

In [2]:
# Rebuild improved features (same as in Notebooks 06 & 07)
basic_csv = root/"data"/"processed"/"features.csv"
raw_csv   = root/"data"/"raw"/"sp500.csv"

df = pd.read_csv(basic_csv, index_col="Date", parse_dates=True)

# Lagged returns 1–5
for lag in range(1,6):
    df[f"ret_lag_{lag}"] = df["return"].shift(lag)
# 10-day rolling volatility
df["vol_10"] = df["return"].rolling(10).std()
# Volume % change
vol = pd.read_csv(raw_csv, index_col="Date", parse_dates=True)["Volume"]
df["vol_pct"] = vol.pct_change()

df = df.dropna()
feature_cols = ["rsi","macd"] + \
               [f"ret_lag_{i}" for i in range(1,6)] + \
               ["vol_10","vol_pct"]

X_all = df[feature_cols].values
y_all = df["return"].values

print("Full dataset shape:", X_all.shape, y_all.shape)



Full dataset shape: (1212, 9) (1212,)


## 3. Base-Model Predictions
We need each model’s predictions on the same feature matrix X_full.

In [3]:
# 3. Generate base-model predictions on the FULL dataset
pred_rf   = rf.predict(X_all)
pred_xgb  = xgb_model.predict(X_all)
pred_mlp  = mlp.predict(scaler.transform(X_all)).flatten()

# Prepare LSTM sequences (t=10) on SCALED features
def make_seq(X, t):
    xs = []
    for i in range(len(X)-t):
        xs.append(X[i:i+t])
    return np.array(xs)

t = 10
X_scaled = scaler.transform(X_all)
X_seq    = make_seq(X_scaled, t)
pred_lstm= lstm.predict(X_seq).flatten()

# Align all predictions & true returns to the same index range
# LSTM and y_seq both start at index t
preds = np.vstack([
    pred_rf[t:], 
    pred_xgb[t:], 
    pred_mlp[t:], 
    pred_lstm
]).T
y_seq = y_all[t:]
print("Stacking matrix shape:", preds.shape, y_seq.shape)




Stacking matrix shape: (1202, 4) (1202,)


## 4. Train/Test Hold-out for the meta-model

In [4]:
# 4. Train/Test Hold-out for the meta-model
split = int(len(y_seq)*0.8)
S_tr,  S_te  = preds[:split], preds[split:]
y_tr,  y_te  = y_seq[:split], y_seq[split:]

meta = Ridge()
meta.fit(S_tr, y_tr)
y_meta = meta.predict(S_te)

mae_meta = mean_absolute_error(y_te, y_meta)
r2_meta  = r2_score(y_te, y_meta)
print(f"Ensemble hold-out → MAE: {mae_meta:.5f},  R²: {r2_meta:.3f}")

# Save ensemble
joblib.dump(meta, models_dir/"ensemble_model.joblib")


Ensemble hold-out → MAE: 0.00751,  R²: 0.095


['C:\\Users\\Antho\\OneDrive\\Documentos\\Santiago\\Finance project\\sp500_dl\\models\\ensemble_model.joblib']

In [5]:
# 5. Walk-Forward (Rolling-Origin) Validation
tscv  = TimeSeriesSplit(n_splits=5)
maes, r2s = [], []

for train_idx, test_idx in tscv.split(preds):
    m = Ridge()
    m.fit(preds[train_idx], y_seq[train_idx])
    y_pf = m.predict(preds[test_idx])
    maes.append(mean_absolute_error(y_seq[test_idx], y_pf))
    r2s.append(r2_score(y_seq[test_idx], y_pf))

print("Walk-forward MAE:", np.mean(maes), "±", np.std(maes))
print("Walk-forward R²: ", np.mean(r2s),  "±", np.std(r2s))


Walk-forward MAE: 0.007575293057280901 ± 0.002258365075420527
Walk-forward R²:  0.09872798984380626 ± 0.044801937695414455


## Results

- **Hold-out Ensemble MAE:** 0.00751
- **Hold-out Ensemble R²:** 0.095 
- **Walk-forward MAE:** average 0.00758 ± 0.00226
- **Walk-forward R²:** average 0.099 ± 0.045

This confirms that our stacked model is stable over time and outperforms each base learner.

## 🗣 Discussion & Next Steps

### Discussion of Results  
- **MAE ≃ 0.0075 (hold-out) and 0.0076 ± 0.0023 (walk-forward)**  
  - Our ensemble’s average daily-return error is under **0.8%**, which is strong for noisy financial data.  
  - The low standard deviation in the walk-forward MAE shows that this performance is **stable over time**.

- **R² ≃ 0.095 (hold-out) and 0.099 ± 0.045 (walk-forward)**  
  - We explain about **10%** of the day-to-day variance.  
  - In practice, daily-return models rarely exceed **15–20% R²** without inside information—so our results are in line with industry benchmarks.

- **Error vs. Variance**  
  - While the MAE is low (we’re precise on average), the modest R² means large spikes still elude us.  
  - This gap is expected: markets are driven by unpredictable events, so perfectly anticipating extreme moves is unlikely.

### Key Takeaways  
1. **Ensembling** diversified our errors across four models (RF, XGB, MLP, LSTM) and gave the best hold-out MAE.  
2. **Walk-forward validation** confirms that our stacking method doesn’t overfit a single train/test split.  
3. **Feature engineering** (lags, volatility, volume) was critical—adding macro or sentiment features is the next frontier.

### Next Steps  
1. **Enrich the feature set**  
   - Incorporate macro series (VIX, interest rates), sector flows, social-media sentiment.  
2. **Broaden model diversity**  
   - Add a 1D-CNN, Transformer-based time-series model, or even a simple econometric panel model.  
3. **Tail-focused loss functions**  
   - Experiment with **Huber**, **Quantile**, or **Expectile** losses to better capture extreme moves.  
4. **Risk-adjusted backtests**  
   - Translate return forecasts into position sizes, simulate P&L, and optimize a Sharpe-ratio objective.  
5. **Deployment & Monitoring**  
   - Package this ensemble into a **Streamlit** (or FastAPI) app, schedule daily forecasts, and log performance metrics over time.
