# Chapter 34: Cross‑Validation Techniques

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand why standard k‑fold cross‑validation fails for time‑series data
- Implement time‑series cross‑validation using scikit‑learn's `TimeSeriesSplit`
- Apply blocked cross‑validation to handle multiple independent time series
- Use nested cross‑validation to perform model selection without optimistic bias
- Implement purged cross‑validation to prevent leakage from overlapping windows
- Understand combinatorial purged cross‑validation for hyperparameter tuning in financial applications
- Apply walk‑forward validation as a realistic out‑of‑sample testing framework
- Implement rolling origin evaluation for multi‑step forecasting
- Choose the appropriate cross‑validation strategy based on data characteristics and modeling goals
- Avoid common pitfalls that lead to overoptimistic performance estimates

---

## **34.1 Why Standard Cross‑Validation Fails for Time‑Series**

In standard machine learning, k‑fold cross‑validation (CV) randomly partitions the data into k folds, trains on k‑1 folds, and tests on the remaining fold. This works under the assumption that samples are independent and identically distributed (i.i.d.). However, time‑series data violates this assumption because observations are temporally dependent. Randomly assigning observations to folds can place future data in the training set and past data in the test set, leading to **look‑ahead bias** – the model is effectively trained on data from the future to predict the past, resulting in overly optimistic performance estimates.

For the NEPSE prediction system, using standard k‑fold CV would be disastrous: the model could learn patterns from 2023 to predict 2022, which is impossible in reality. Therefore, we must use cross‑validation methods that respect the temporal order.

---

## **34.2 K‑Fold Cross‑Validation (The Wrong Way)**

Let's first demonstrate why standard k‑fold CV fails on time‑series data.

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestRegressor

# Load NEPSE data (simplified)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)

# Use a single symbol for simplicity
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Create a simple feature: lagged return
df_stock['Return'] = df_stock['Close'].pct_change()
df_stock['Lag1'] = df_stock['Return'].shift(1)
df_stock['Target'] = df_stock['Return'].shift(-1)
df_ml = df_stock[['Lag1', 'Target']].dropna()

X = df_ml[['Lag1']]
y = df_ml['Target']

# Standard k-fold CV (WRONG for time series)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestRegressor(n_estimators=50, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-scores)
print(f"Standard k-fold RMSE: {rmse_scores.mean():.4f} (+/- {rmse_scores.std():.4f})")
```

**Explanation:**  
`KFold` with `shuffle=True` randomly mixes the data. The resulting RMSE is likely low, but it is an illusion because the model was tested on data that, in some folds, may be earlier than training data. In practice, this model would fail when deployed.

---

## **34.3 TimeSeriesSplit (Scikit‑Learn)**

`TimeSeriesSplit` is a cross‑validator that provides train/test indices to split time series data into train/test sets that respect temporal order. In each split, the training set consists of the first `k` folds, and the test set is the next fold. The test sets are non‑overlapping and always come after the training set.

### **34.3.1 How TimeSeriesSplit Works**

For example, with `n_splits=5`, the splits are:

- Fold 1: train indices [0], test indices [1]
- Fold 2: train indices [0,1], test indices [2]
- Fold 3: train indices [0,1,2], test indices [3]
- Fold 4: train indices [0,1,2,3], test indices [4]
- Fold 5: train indices [0,1,2,3,4], test indices [5]

The training set expands with each fold, simulating a scenario where we train on all available data up to a point and test on the next period.

### **34.3.2 Implementing TimeSeriesSplit**

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
rmse_scores = []

for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model = RandomForestRegressor(n_estimators=50, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    rmse_scores.append(rmse)

print(f"TimeSeriesSplit RMSE: {np.mean(rmse_scores):.4f} (+/- {np.std(rmse_scores):.4f})")
```

**Explanation:**  
This gives a more realistic estimate of how the model would perform if trained up to a certain date and tested on the subsequent period. Note that the first few folds have very small training sets, so the scores may be unstable; in practice, we might use fewer splits or ensure a minimum training size.

---

## **34.4 Blocked Cross‑Validation**

If we have multiple independent time series (e.g., many stocks), we can perform **blocked cross‑validation** where we split by time across all series simultaneously, or we can split by series. A common approach is to use a **rolling window** across all series, treating each time point as a block.

For example, with daily data for many stocks, we can define blocks of, say, 60 days. We train on the first `k` blocks and test on the next block. This respects the temporal order across all stocks.

```python
# Assume we have a multi-stock DataFrame with columns: Date, Symbol, Return
# We'll create a pivot table: rows = Date, columns = Symbol, values = Return
df_pivot = df.pivot(index='Date', columns='Symbol', values='Return').dropna()

# Now we have a matrix where each row is a date, each column a stock
# We can apply TimeSeriesSplit on the rows
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(df_pivot):
    train_dates = df_pivot.index[train_idx]
    test_dates = df_pivot.index[test_idx]
    # Now we can train on all stocks for those dates, and test on subsequent dates
    X_train = df_pivot.loc[train_dates]  # features could be lagged values, etc.
    y_train = df_pivot.loc[train_dates].shift(-1)  # next day returns (for each stock)
    # ... and so on
```

**Explanation:**  
Blocked CV treats each time point as an independent observation across series. It is appropriate when we have a panel of time series and we want to evaluate a model that uses features from all stocks (e.g., cross‑sectional momentum). The temporal split is still essential.

---

## **34.5 Nested Cross‑Validation**

When we need to both tune hyperparameters and evaluate model performance, we must use **nested cross‑validation** to avoid optimistic bias. The outer loop estimates generalization performance, while the inner loop performs model selection (e.g., grid search) on the training data of each outer fold.

```python
from sklearn.model_selection import GridSearchCV

# Outer CV
tscv_outer = TimeSeriesSplit(n_splits=3)
outer_scores = []

for train_idx_outer, test_idx_outer in tscv_outer.split(X):
    X_train_outer, X_test_outer = X.iloc[train_idx_outer], X.iloc[test_idx_outer]
    y_train_outer, y_test_outer = y.iloc[train_idx_outer], y.iloc[test_idx_outer]
    
    # Inner CV for hyperparameter tuning
    tscv_inner = TimeSeriesSplit(n_splits=3)
    param_grid = {'max_depth': [3, 5, 7], 'n_estimators': [50, 100]}
    model = RandomForestRegressor(random_state=42)
    grid = GridSearchCV(model, param_grid, cv=tscv_inner, scoring='neg_mean_squared_error')
    grid.fit(X_train_outer, y_train_outer)
    
    # Best model on this outer fold
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test_outer)
    rmse = np.sqrt(np.mean((y_test_outer - y_pred)**2))
    outer_scores.append(rmse)

print(f"Nested CV RMSE: {np.mean(outer_scores):.4f} (+/- {np.std(outer_scores):.4f})")
```

**Explanation:**  
The outer scores provide an unbiased estimate of the model's performance with hyperparameter tuning. The inner CV uses only the training portion of each outer fold, preventing leakage from the test data into the tuning process.

---

## **34.6 Purged Cross‑Validation**

In financial machine learning, Marcos López de Prado introduced **purged cross‑validation** to address the issue of overlapping observations when using features that depend on a lookback window. For example, if we use a 20‑day moving average as a feature, the first few test observations may have features that depend on training data (which is allowed), but the test labels might be too close to the training period, causing leakage. Purged CV removes from the training set any data that overlaps in time with the test set's feature lookback window.

The idea is to "purge" from the training set any observations whose feature window includes any time in the test set. This ensures that no training observation uses data from the test period.

Implementation is more complex and requires careful indexing. We'll outline the concept and provide a simplified version.

```python
def purged_time_series_split(X, y, test_size, gap=0):
    """
    Generator that yields train/test indices with purging.
    test_size: number of test samples per fold.
    gap: number of samples to purge between train and test.
    """
    n_samples = len(X)
    indices = np.arange(n_samples)
    for test_start in range(0, n_samples - test_size + 1, test_size):
        train_end = test_start - gap
        train_indices = indices[:train_end]
        test_indices = indices[test_start:test_start + test_size]
        if len(train_indices) > 0:
            yield train_indices, test_indices

# Example usage
for train_idx, test_idx in purged_time_series_split(X, y, test_size=50, gap=20):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    # train and evaluate
```

**Explanation:**  
The `gap` parameter ensures that the training set ends `gap` days before the test set starts, purging any potentially overlapping information. The size of the gap should be at least the maximum lookback period used in feature engineering.

---

## **34.7 Combinatorial Purged Cross‑Validation (CPCV)**

Combinatorial Purged Cross‑Validation (CPCV) is an advanced technique that creates many train/test splits by combining different training and test periods, while purging and embargoing to avoid leakage. It is particularly useful for backtesting trading strategies where we want to evaluate performance over many different market regimes.

CPCV is complex to implement from scratch. The `mlfinlab` library (now `mosef`) provides an implementation. We'll outline the idea:

- Divide the time series into `N` sequential groups.
- Select a number of training groups `k` and test groups `n`.
- Generate all combinations of groups that satisfy the temporal order (train groups before test groups).
- For each combination, purge any overlapping data and embargo a gap.

This yields many backtest paths, providing a robust estimate of performance distribution.

For the NEPSE system, CPCV might be overkill, but it's good to know for advanced applications.

---

## **34.8 Walk‑Forward Validation**

Walk‑forward validation is the most realistic simulation of how a forecasting model would be used in practice. It involves:

- Starting with an initial training window.
- Forecasting the next `h` steps.
- Expanding or rolling the training window to include the most recent actual data.
- Repeating until the end of the dataset.

This is essentially the same as `TimeSeriesSplit` but with the ability to forecast multiple steps ahead. We can implement it manually.

```python
def walk_forward_validation(X, y, train_size, test_size, step=1):
    """
    Walk-forward validation for multi-step forecasting.
    X, y: arrays (n_samples, ...)
    train_size: initial training window size
    test_size: number of steps to forecast each iteration
    step: how many steps to move forward each time (usually test_size)
    """
    n = len(X)
    scores = []
    for start in range(0, n - train_size - test_size + 1, step):
        train_end = start + train_size
        test_end = train_end + test_size
        
        X_train = X[start:train_end]
        y_train = y[start:train_end]
        X_test = X[train_end:test_end]
        y_test = y[train_end:test_end]
        
        # Train model (could be retrained each iteration)
        model = RandomForestRegressor(n_estimators=50, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        rmse = np.sqrt(np.mean((y_test - y_pred)**2))
        scores.append(rmse)
    return scores

# Example
scores = walk_forward_validation(X.values, y.values, train_size=500, test_size=50, step=50)
print(f"Walk-forward RMSE: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
```

**Explanation:**  
This mimics a realistic retraining schedule. It is computationally intensive but provides the most trustworthy estimate of out‑of‑sample performance.

---

## **34.9 Rolling Origin Evaluation**

Rolling origin evaluation (also known as time series cross‑validation with a fixed origin) is a variant where the training window expands from a fixed start point, but the test set is always the next `h` steps. This is similar to `TimeSeriesSplit` but with a fixed origin.

```python
def rolling_origin_evaluation(X, y, test_size, min_train_size):
    """
    Rolling origin evaluation.
    min_train_size: minimum training size for first fold.
    """
    n = len(X)
    scores = []
    for train_end in range(min_train_size, n - test_size + 1):
        X_train = X[:train_end]
        y_train = y[:train_end]
        X_test = X[train_end:train_end+test_size]
        y_test = y[train_end:train_end+test_size]
        
        model = RandomForestRegressor(n_estimators=50, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        rmse = np.sqrt(np.mean((y_test - y_pred)**2))
        scores.append(rmse)
    return scores
```

**Explanation:**  
This yields many evaluation points and is often used in forecasting competitions. It can be computationally expensive, but it gives a very detailed view of performance over time.

---

## **34.10 Implementation Strategies**

### **34.10.1 Custom Splitters for scikit‑learn**

We can create custom cross‑validator classes that implement `split` and `get_n_splits` methods to use with scikit‑learn's `cross_val_score` and `GridSearchCV`.

```python
from sklearn.model_selection import BaseCrossValidator

class PurgedTimeSeriesSplit(BaseCrossValidator):
    def __init__(self, n_splits=5, test_size=None, gap=0):
        self.n_splits = n_splits
        self.test_size = test_size
        self.gap = gap
    
    def split(self, X, y=None, groups=None):
        n_samples = len(X)
        indices = np.arange(n_samples)
        if self.test_size is None:
            self.test_size = n_samples // (self.n_splits + 1)
        for i in range(self.n_splits):
            test_start = (i+1) * self.test_size
            train_end = test_start - self.gap
            train_indices = indices[:train_end]
            test_indices = indices[test_start:test_start + self.test_size]
            if len(train_indices) == 0 or len(test_indices) == 0:
                break
            yield train_indices, test_indices
    
    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

# Usage
purgedsplit = PurgedTimeSeriesSplit(n_splits=5, test_size=50, gap=20)
for train_idx, test_idx in purgedsplit.split(X):
    # ...
```

### **34.10.2 Handling Multiple Assets**

When you have multiple assets, you can either:

- Treat each asset separately and average results (if you want asset‑specific models).
- Concatenate assets and use a feature that identifies the asset, then apply time‑series CV on the combined dataset (ensuring that train/test splits are by time, not by asset). This is valid if the model uses asset‑specific features.

### **34.10.3 Computational Considerations**

Walk‑forward and rolling origin evaluations can be slow. Use caching of models or parallel processing where possible. Also, consider reducing the number of folds or using a representative subset of folds.

---

## **34.11 Choosing the Right Cross‑Validation Strategy**

| Strategy | When to Use |
|----------|-------------|
| TimeSeriesSplit | Simple, standard for evaluating one‑step ahead forecasts with expanding window. |
| Blocked CV | Multiple independent time series, need to test across all series simultaneously. |
| Nested CV | When you need to tune hyperparameters without optimistic bias. |
| Purged CV | When features have long lookback windows that could overlap test set. |
| CPCV | Advanced backtesting of trading strategies with many possible paths. |
| Walk‑Forward | Most realistic simulation; use for final model evaluation before deployment. |
| Rolling Origin | Detailed performance over time; often used in research. |

For the NEPSE system, a good starting point is `TimeSeriesSplit` for initial model comparison, and walk‑forward validation for final evaluation.

---

## **34.12 Common Pitfalls**

- **Using standard k‑fold CV:** This will give overly optimistic results and is invalid for time‑series.
- **Not purging when using lagged features:** Even with time‑based splits, the first few test observations may have features that use training data (which is fine), but if the test set is too close to the training set, the model may benefit from short‑term autocorrelation. A small gap (embargo) can help.
- **Overlapping train/test in multi‑step forecasting:** Ensure that when forecasting multiple steps ahead, the test set does not contain data that overlaps with training (e.g., if you predict 5 steps ahead, the last 5 training points should be excluded from evaluation for that horizon).
- **Information leakage from scaling:** Always fit scalers on the training fold only, then transform test fold.
- **Not accounting for multiple assets:** If you have many stocks, ensure that the same time split is applied to all, and that features are computed per stock without leakage across stocks.

---

## **34.13 Chapter Summary**

In this chapter, we covered the essential cross‑validation techniques for time‑series forecasting, with applications to the NEPSE dataset.

- **Standard k‑fold CV** is invalid for time‑series due to look‑ahead bias.
- **TimeSeriesSplit** provides a simple expanding‑window CV that respects temporal order.
- **Blocked CV** handles multiple independent series.
- **Nested CV** enables unbiased hyperparameter tuning.
- **Purged CV** removes overlapping observations to prevent leakage.
- **Combinatorial Purged CV** is an advanced method for robust backtesting.
- **Walk‑forward validation** mimics real‑world deployment and gives trustworthy performance estimates.
- **Rolling origin evaluation** provides many evaluation points.
- Implementation strategies and custom splitters make these methods easy to use.
- Choosing the right method depends on the data and the goal.

### **Practical Takeaways for the NEPSE System:**

- Start with `TimeSeriesSplit` for quick model comparisons.
- Use walk‑forward validation for final performance assessment.
- If using features with long lookback (e.g., 50‑day moving average), apply a purge gap equal to that lookback.
- Always fit scalers and preprocessors on training folds only.
- Report performance across folds to understand stability.

In the next chapter, **Chapter 35: Hyperparameter Tuning**, we will explore how to systematically search for the best model parameters using grid search, random search, Bayesian optimization, and more, all within a time‑series context.

---

**End of Chapter 34**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='33. time_series_specific_evaluation.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='35. hyperparameter_tuning.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
