# **Chapter 18: Machine Learning Fundamentals for Time-Series**

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the core concepts of supervised learning as applied to time-series prediction
- Differentiate between regression, classification, and multi-step forecasting problems
- Grasp the implications of the No-Free-Lunch theorem for model selection
- Explain the bias-variance tradeoff and its role in model performance
- Identify and diagnose overfitting and underfitting in time-series models
- Understand model capacity and its relationship to data complexity
- Distinguish between training and test performance
- Appreciate the importance of generalization in financial forecasting
- Develop a systematic model selection strategy for the NEPSE prediction system

---

## **18.1 Supervised Learning Basics**

Supervised learning is the machine learning paradigm where we train models on labeled data – that is, data that contains both input features and corresponding output targets. For time-series prediction, this means we have historical observations and we want to predict future values. The model learns a mapping function `f` from input features `X` to target `y` such that `y ≈ f(X)`.

### **18.1.1 The Supervised Learning Framework for Time-Series**

In time-series, we transform the sequential data into a supervised learning format by creating lagged features. For the NEPSE dataset, we want to predict tomorrow's closing price (or return) using today's and past days' information.

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load and prepare NEPSE data
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)

# Create supervised learning format: predict next day's Close
# using today's features
df['Target'] = df['Close'].shift(-1)  # tomorrow's close

# Create feature matrix (using only information available today)
feature_cols = ['Open', 'High', 'Low', 'Close', 'Vol', 'VWAP']
df_features = df[feature_cols].copy()

# Add some lag features
df_features['Close_Lag1'] = df['Close'].shift(1)
df_features['Volume_Lag1'] = df['Vol'].shift(1)

# Drop rows with NaN (created by shift and target)
df_clean = pd.concat([df_features, df['Target']], axis=1).dropna()

# Define X (features) and y (target)
X = df_clean[feature_cols + ['Close_Lag1', 'Volume_Lag1']]
y = df_clean['Target']

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nFirst 3 rows of features:")
print(X.head(3))
print(f"\nCorresponding targets (next day close):")
print(y.head(3))
```

**Explanation:**

- **Target creation:** `df['Target'] = df['Close'].shift(-1)` aligns each row with the next day's closing price. This is the fundamental transformation that turns a time-series into a supervised learning problem. The `shift(-1)` moves future values upward, so row `i` contains today's features and tomorrow's target.
- **Feature matrix:** We select columns that are available at prediction time. Notice we also create `Close_Lag1` – yesterday's close – as an additional feature. In practice, we would include many more lags and derived features.
- **Data cleaning:** After shifting, the last row will have a NaN target (since there is no "tomorrow"), and the first few rows will have NaN for lag features. We drop these to create a clean dataset.
- **Result:** We now have a standard supervised learning dataset where each row is independent (though in reality, rows are temporally ordered). Models like Random Forest can be trained on this data.

### **18.1.2 The Learning Task**

The supervised learning task is to find a function `f` that minimizes the expected prediction error. For regression (predicting a continuous value like price), we typically minimize the mean squared error between predictions and actual values. For classification (predicting a discrete class like up/down), we minimize cross-entropy or misclassification rate.

The model learns patterns from the training data: for example, that when the RSI is below 30 and volume is high, the next day's price tends to increase. These patterns are captured in the model's parameters (e.g., coefficients in linear regression, split points in decision trees).

---

## **18.2 Prediction Problem Types**

Time-series prediction can be framed in several ways depending on the business goal. For the NEPSE system, we might want to predict the exact future price, the direction of movement, or a sequence of future values.

### **18.2.1 Regression**

Regression predicts a continuous numerical value. In our context, this could be the next day's closing price, the percentage return, or the price range.

```python
# Example: Regression to predict next day's close price
regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Train-test split (temporal, not random!)
train_size = int(len(X) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Regression RMSE: {rmse:.2f} NPR")
```

**Explanation:**

- This uses a Random Forest to predict the exact closing price. The error metric RMSE tells us, on average, how far off our predictions are in Nepalese Rupees.
- Regression is appropriate when the exact value matters – for example, in algorithmic trading where entry/exit prices determine profit.

### **18.2.2 Classification**

Classification predicts a discrete class. For the NEPSE system, a common classification task is predicting the direction of price movement: up, down, or unchanged.

```python
# Create binary classification target: 1 if price increases, 0 otherwise
df['Target_Direction'] = (df['Close'].shift(-1) > df['Close']).astype(int)

# Prepare data (using same features as before)
df_clean = pd.concat([df_features, df['Target_Direction']], axis=1).dropna()
X = df_clean[feature_cols + ['Close_Lag1', 'Volume_Lag1']]
y = df_clean['Target_Direction']

# Train a classifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Temporal split
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Direction prediction accuracy: {accuracy:.3f}")
```

**Explanation:**

- The binary target is 1 if tomorrow's close is higher than today's, 0 otherwise.
- Classification can be easier than regression because small price movements (noise) are less problematic – as long as we get the direction right, we can make profitable trades.
- Accuracy measures the fraction of correct direction predictions. A value above 0.5 indicates the model has some predictive power beyond random guessing.

We could also create multi-class targets: e.g., "large increase" (>2%), "small increase" (0-2%), "small decrease" (-2-0%), "large decrease" (<-2%). This might be useful for position sizing.

### **18.2.3 Multi-Step Forecasting**

Often we need to predict several steps ahead – for example, the next 5 days of prices. This is more challenging because errors accumulate. There are two main approaches:

1. **Direct multi-step:** Train separate models for each horizon (model for t+1, t+2, etc.).
2. **Recursive multi-step:** Use one-step model and feed predictions back as features.

```python
# Example: Direct multi-step forecasting (predict t+1, t+2, t+3)
horizons = [1, 2, 3]
models = {}
predictions = {}

for h in horizons:
    # Create target shifted by h days
    df[f'Target_{h}'] = df['Close'].shift(-h)
    
    # Prepare data
    df_clean = pd.concat([df_features, df[f'Target_{h}']], axis=1).dropna()
    X = df_clean[feature_cols + ['Close_Lag1', 'Volume_Lag1']]
    y = df_clean[f'Target_{h}']
    
    # Train model for this horizon
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X.iloc[:train_size], y.iloc[:train_size])
    models[h] = model
    
    # Predict on test set
    preds = model.predict(X.iloc[train_size:])
    predictions[h] = preds
    
    rmse = np.sqrt(mean_squared_error(y.iloc[train_size:], preds))
    print(f"Horizon {h} day(s) ahead - RMSE: {rmse:.2f}")
```

**Explanation:**

- For each horizon `h`, we create a target `Close.shift(-h)`. This gives, for each date, the price `h` days in the future.
- We train a separate model for each horizon. This is the **direct** approach. Its advantage is that each model can specialize in its own horizon; disadvantage is that it ignores dependencies between the forecasts (e.g., t+2 might be related to t+1).
- The **recursive** approach would use the t+1 model, then feed its prediction back as a feature to predict t+2, and so on. This is more efficient (only one model) but can compound errors.

For the NEPSE system, we might need both short-term (1-3 day) and medium-term (5-20 day) forecasts depending on the trading strategy.

---

## **18.3 The No-Free-Lunch Theorem**

The No-Free-Lunch (NFL) theorem states that no single machine learning algorithm is universally better than any other across all possible problems. In other words, the performance of an algorithm on one type of problem does not guarantee its performance on another. For the NEPSE prediction system, this means we cannot assume that a model that works well for US stocks will work for NEPSE, or that a model that performs well during a bull market will perform well during a bear market.

### **18.3.1 Implications for Model Selection**

The NFL theorem guides us to:

- **Experiment with multiple algorithms:** Try linear models, tree-based models, neural networks, and statistical methods (ARIMA, etc.). What works best depends on the specific characteristics of NEPSE data.
- **Use cross-validation:** Compare models on the same data to see which generalizes best.
- **Consider the data regime:** NEPSE may have different properties (low liquidity, high volatility, specific seasonality) than other markets. A model that accounts for these may outperform a generic one.

```python
# Example: Compare multiple algorithms on NEPSE data
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error
import time

models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'SVR': SVR(kernel='rbf')
}

results = {}

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    
    results[name] = {
        'MAE': mae,
        'Train Time (s)': train_time
    }

# Display results
results_df = pd.DataFrame(results).T.sort_values('MAE')
print(results_df)
```

**Explanation:**

- This code trains six different regression models on the same NEPSE dataset and compares their Mean Absolute Error (MAE) on a temporally separated test set.
- The results will likely show that some models (e.g., Random Forest) outperform others (e.g., Linear Regression) on this particular dataset. However, the NFL theorem reminds us that this ranking might change if we alter the features, the target horizon, or the time period.
- The practical takeaway: always benchmark multiple models. Do not assume that the latest deep learning architecture is automatically best for NEPSE.

---

## **18.4 Bias-Variance Tradeoff**

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tension between a model's ability to fit the training data (low bias) and its ability to generalize to unseen data (low variance).

- **Bias** is the error introduced by approximating a real-world problem with a simplified model. High bias models underfit the data – they miss important patterns.
- **Variance** is the error introduced by the model's sensitivity to fluctuations in the training set. High variance models overfit – they memorize noise instead of learning the true signal.

### **18.4.1 Visualizing Bias and Variance**

Imagine we are trying to model the relationship between past returns and future returns. A linear model might have high bias (it cannot capture non-linear patterns), while a high-degree polynomial might have high variance (it will fit every wiggle in the training data).

```python
import matplotlib.pyplot as plt

# Generate synthetic data that mimics a noisy NEPSE return pattern
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = 2 * np.sin(X).ravel() + 0.5 * X.ravel()
y = y_true + np.random.normal(0, 0.5, size=len(X))

# Fit models with different complexity
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

degrees = [1, 3, 15]
plt.figure(figsize=(15, 5))

for i, degree in enumerate(degrees):
    ax = plt.subplot(1, 3, i + 1)
    
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X, y)
    y_pred = model.predict(X)
    
    plt.scatter(X, y, s=10, alpha=0.5, label='Data')
    plt.plot(X, y_true, 'k--', label='True function')
    plt.plot(X, y_pred, 'r-', label='Model')
    plt.title(f'Degree {degree} Polynomial')
    plt.legend()

plt.tight_layout()
plt.show()
```

**Explanation:**

- **Degree 1 (linear):** High bias – the model is too simple and cannot capture the curvature. It underfits.
- **Degree 3:** Good balance – captures the main shape without fitting the noise.
- **Degree 15:** High variance – the model wiggles wildly to pass through every training point. It has memorized the noise and will not generalize well to new data.

### **18.4.2 Applying to NEPSE Prediction**

In the context of the NEPSE system:

- **High bias (underfitting):** Using only linear features (e.g., today's price) to predict tomorrow's price. The model may miss important patterns like momentum or mean reversion. Training error will be high, and test error will also be high.
- **High variance (overfitting):** Using hundreds of features (e.g., all tsfresh features) without regularization. The model will achieve very low training error but perform poorly on new data because it has learned spurious patterns specific to the training period.
- **Optimal complexity:** A model with enough capacity to capture the true underlying patterns but not so much that it fits noise. This is achieved through careful feature selection, regularization, and validation.

```python
# Demonstration of overfitting with a complex model
from sklearn.tree import DecisionTreeRegressor

# Train a decision tree with unlimited depth (high complexity)
tree_deep = DecisionTreeRegressor(max_depth=None, random_state=42)
tree_deep.fit(X_train, y_train)

# Train a shallow tree (low complexity)
tree_shallow = DecisionTreeRegressor(max_depth=3, random_state=42)
tree_shallow.fit(X_train, y_train)

# Evaluate
print("Deep Tree (max_depth=None):")
print(f"  Train R²: {tree_deep.score(X_train, y_train):.4f}")
print(f"  Test R²: {tree_deep.score(X_test, y_test):.4f}")

print("\nShallow Tree (max_depth=3):")
print(f"  Train R²: {tree_shallow.score(X_train, y_train):.4f}")
print(f"  Test R²: {tree_shallow.score(X_test, y_test):.4f}")
```

**Explanation:**

- The deep tree will likely have perfect or near-perfect training R² (it can memorize the training data) but much lower test R² – a classic sign of overfitting.
- The shallow tree may have lower training R² (higher bias) but better test R² because it has learned more generalizable patterns.
- In practice, we would use cross-validation to find the optimal depth that balances bias and variance.

---

## **18.5 Overfitting and Underfitting**

Overfitting and underfitting are the practical manifestations of the bias-variance tradeoff.

### **18.5.1 Identifying Overfitting**

Overfitting occurs when the model performs well on training data but poorly on unseen data. In time-series, overfitting can be especially insidious because of temporal dependencies – a model might appear to perform well in backtesting but fail in live trading due to overfitting to past market regimes.

**Signs of overfitting:**
- Large gap between training and test performance.
- Model weights or feature importances that are unstable across different time periods.
- Predictions that are overly sensitive to small changes in input.

```python
# Example: Monitoring overfitting during training
from sklearn.model_selection import learning_curve

train_sizes, train_scores, test_scores = learning_curve(
    RandomForestRegressor(n_estimators=100, random_state=42),
    X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='r2'
)

train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', label='Training score')
plt.plot(train_sizes, test_mean, 'o-', label='Cross-validation score')
plt.xlabel('Training examples')
plt.ylabel('R² score')
plt.legend()
plt.title('Learning Curves - Detecting Overfitting')
plt.grid(True)
plt.show()
```

**Explanation:**

- Learning curves show model performance as a function of training set size.
- If the training score remains high while the validation score plateaus or decreases, it suggests overfitting (the model is not generalizing).
- For the NEPSE dataset, if we see a large and persistent gap, we need to reduce model complexity or add regularization.

### **18.5.2 Identifying Underfitting**

Underfitting happens when the model is too simple to capture the underlying structure. Signs include:
- Low training and test performance.
- Performance that does not improve with more data.
- Residuals that show clear patterns (e.g., the model consistently misses turning points).

```python
# Check for underfitting: are residuals random?
y_pred_simple = LinearRegression().fit(X_train, y_train).predict(X_test)
residuals = y_test - y_pred_simple

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(y_test.values, label='Actual')
plt.plot(y_pred_simple, label='Predicted')
plt.legend()
plt.title('Simple Linear Model - Actual vs Predicted')

plt.subplot(1, 2, 2)
plt.scatter(y_pred_simple, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()
```

**Explanation:**

- If the residuals show a pattern (e.g., they are positive when predictions are low and negative when predictions are high), the model is underfitting – it's missing a systematic relationship.
- For the NEPSE system, underfitting might mean the model fails to capture volatility clustering or momentum effects.

---

## **18.6 Model Capacity**

Model capacity refers to a model's ability to fit a variety of functions. High-capacity models (e.g., deep neural networks, random forests with many trees) can approximate very complex relationships. Low-capacity models (e.g., linear regression) are restricted to linear functions.

### **18.6.1 Matching Capacity to Data**

The optimal model capacity depends on:
- **Amount of data:** With more data, we can safely use higher-capacity models.
- **Noise level:** In noisy environments (like financial markets), high-capacity models are more likely to overfit.
- **Problem complexity:** If the true relationship is simple, a low-capacity model suffices.

```python
# Demonstrate increasing model capacity
from sklearn.neural_network import MLPRegressor

capacities = [
    ('Linear Regression', LinearRegression()),
    ('Small NN (10 neurons)', MLPRegressor(hidden_layer_sizes=(10,), max_iter=1000, random_state=42)),
    ('Large NN (100,50 neurons)', MLPRegressor(hidden_layer_sizes=(100,50), max_iter=1000, random_state=42))
]

results = {}
for name, model in capacities:
    model.fit(X_train, y_train)
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    results[name] = {'Train R²': train_score, 'Test R²': test_score}

results_df = pd.DataFrame(results).T
print(results_df)
```

**Explanation:**

- Linear Regression has the lowest capacity. It may underfit if the relationship is non-linear.
- The small neural network has moderate capacity. It may capture some non-linearity without overfitting.
- The large neural network has high capacity. It might overfit, especially with limited NEPSE data.
- By comparing train and test R², we can see which capacity level generalizes best.

### **18.6.2 Capacity Control in Practice**

We control capacity through:
- **Model choice:** Linear models < tree-based < neural networks.
- **Hyperparameters:** Tree depth, number of trees, neural network layers/neurons, regularization strength.
- **Feature engineering:** More features increase effective capacity; feature selection reduces it.

For the NEPSE system, start with a modest capacity (e.g., Random Forest with depth 5-10) and increase only if validation performance improves.

---

## **18.7 Training vs. Test Performance**

Understanding the difference between training and test performance is crucial for building reliable models.

### **18.7.1 The Training-Test Split**

In time-series, the standard random train-test split is invalid because it would use future data to predict the past. Instead, we must use a temporal split: train on earlier data, test on later data.

```python
# Temporal split for NEPSE data
split_date = '2023-06-01'  # example cutoff

train = df[df['Date'] < split_date]
test = df[df['Date'] >= split_date]

print(f"Train period: {train['Date'].min()} to {train['Date'].max()}")
print(f"Train samples: {len(train)}")
print(f"Test period: {test['Date'].min()} to {test['Date'].max()}")
print(f"Test samples: {len(test)}")

# Prepare features and targets for both sets
# (assuming we already created features)
X_train = train[feature_cols + lags]
y_train = train['Target']
X_test = test[feature_cols + lags]
y_test = test['Target']
```

**Explanation:**

- This split respects the temporal order: the model is trained on past data and evaluated on future data, simulating real deployment.
- The cutoff date should be chosen so that the test period is representative of future conditions (e.g., includes both bull and bear phases).

### **18.7.2 Interpreting Performance Metrics**

- **High training accuracy, low test accuracy:** Overfitting. The model has memorized the training data but not learned generalizable patterns.
- **Low training and test accuracy:** Underfitting. The model is too simple.
- **Both high and close:** Good generalization – the sweet spot.

For the NEPSE system, we might see training R² of 0.95 and test R² of 0.30 on a complex model – a classic overfitting signal. A simpler model might show 0.60 and 0.55, which is much better for actual deployment.

---

## **18.8 Generalization**

Generalization is the model's ability to perform well on unseen data. It is the ultimate goal of machine learning. In financial forecasting, generalization is particularly challenging because markets evolve – a model that generalizes well must capture stable relationships, not ephemeral patterns.

### **18.8.1 Factors Affecting Generalization**

- **Data quality:** Clean, representative data leads to better generalization.
- **Feature relevance:** Irrelevant features hurt generalization.
- **Model complexity:** Appropriate capacity (not too high, not too low).
- **Regularization:** Techniques that constrain the model to prefer simpler explanations.
- **Training regime:** Enough data to learn patterns, not just noise.

### **18.8.2 Testing Generalization**

We can test generalization by evaluating the model on:
- **Out-of-time data:** A later period not used in training.
- **Out-of-sample data:** Different stocks (if we trained on some stocks, test on others).
- **Cross-validation:** Time-series CV gives multiple estimates of generalization.

```python
# Time-series cross-validation to estimate generalization
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
scores = []

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
    
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train_fold, y_train_fold)
    score = model.score(X_val_fold, y_val_fold)
    scores.append(score)
    print(f"Fold {fold+1} validation R²: {score:.4f}")

print(f"\nAverage validation R²: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
```

**Explanation:**

- `TimeSeriesSplit` creates expanding windows: each fold uses more training data and tests on a subsequent period.
- The average score across folds gives a robust estimate of how the model will generalize to new, unseen time periods.
- The standard deviation tells us about stability – high variability suggests the model is sensitive to the specific test period.

---

## **18.9 Model Selection Strategy**

Model selection is the process of choosing the best model among candidates. For the NEPSE prediction system, we need a systematic strategy that balances performance, interpretability, and robustness.

### **18.9.1 Steps in Model Selection**

1. **Define evaluation metric:** Choose a metric that aligns with business goals (e.g., RMSE for price prediction, accuracy for direction, or a custom financial metric like Sharpe ratio).

2. **Create candidate models:** Include diverse families – linear, tree-based, neural networks, statistical.

3. **Use time-series cross-validation:** Evaluate each model on multiple train-test splits.

4. **Compare performance:** Use the mean and variance of the chosen metric across folds.

5. **Consider complexity:** Simpler models are preferable if performance is similar (Occam's razor).

6. **Check for stability:** Model coefficients or feature importances should be reasonably consistent across folds.

7. **Final test:** After selecting a model, evaluate it on a completely held-out final test set (the most recent data) to confirm generalization.

```python
# Example model selection framework
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define candidate models with hyperparameter grids
models_to_try = {
    'Ridge': {
        'model': Ridge(),
        'params': {'alpha': [0.1, 1.0, 10.0]}
    },
    'Random Forest': {
        'model': RandomForestRegressor(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'max_depth': [5, 10, None]
        }
    },
    'Gradient Boosting': {
        'model': GradientBoostingRegressor(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'learning_rate': [0.05, 0.1],
            'max_depth': [3, 5]
        }
    }
}

# Use TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=3)

best_models = {}

for name, config in models_to_try.items():
    print(f"\nTuning {name}...")
    grid = GridSearchCV(
        config['model'],
        config['params'],
        cv=tscv,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    grid.fit(X, y)  # Note: In practice, use a separate validation set or nested CV
    best_models[name] = grid.best_estimator_
    print(f"Best params: {grid.best_params_}")
    print(f"Best CV score (negative MSE): {grid.best_score_:.4f}")

# Compare on a separate test set (final evaluation)
test_scores = {}
for name, model in best_models.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    test_scores[name] = mse

print("\nFinal Test MSE:")
for name, mse in sorted(test_scores.items(), key=lambda x: x[1]):
    print(f"  {name}: {mse:.4f}")
```

**Explanation:**

- We define a dictionary of models and their hyperparameter grids.
- `GridSearchCV` with `TimeSeriesSplit` performs hyperparameter tuning and cross-validation in one go. Note: For a completely unbiased estimate, we should use nested cross-validation (an inner loop for tuning, outer loop for evaluation), but that's computationally expensive.
- After tuning, we evaluate the best models on a final test set that was not used during tuning/CV. This gives a realistic estimate of how each model would perform in production.
- The model with the lowest test MSE is our final choice, but we should also consider interpretability and stability.

### **18.9.2 Avoiding Common Pitfalls**

- **Lookahead:** Ensure all preprocessing (scaling, feature engineering) is done within each CV fold using only training data.
- **Multiple comparisons:** If you try many models, the best one might outperform by chance. Use a validation set or nested CV to get unbiased performance estimates.
- **Overfitting to CV:** If you tune too aggressively, you might overfit to the validation folds. Keep the search space reasonable.

---

## **Chapter Summary**

In this chapter, we established the machine learning foundations necessary for building time-series prediction systems, using the NEPSE stock prediction as our guiding example.

**Key concepts covered:**

- **Supervised learning framework:** Transforming time-series into a supervised format by creating lagged features and future targets.
- **Problem types:** Regression (predicting exact prices), classification (predicting direction), and multi-step forecasting (direct vs. recursive approaches).
- **No-Free-Lunch theorem:** No single model is universally best; we must experiment and compare multiple algorithms on our specific NEPSE dataset.
- **Bias-variance tradeoff:** The tension between underfitting (high bias) and overfitting (high variance). We saw how model complexity affects this balance.
- **Overfitting and underfitting:** Practical signs and diagnostic tools like learning curves and residual analysis.
- **Model capacity:** Matching model complexity to data size and problem difficulty.
- **Training vs. test performance:** The importance of temporal splits and the interpretation of performance gaps.
- **Generalization:** The ultimate goal – models must perform well on unseen future data, not just past data.
- **Model selection strategy:** A systematic approach using time-series cross-validation, hyperparameter tuning, and final hold-out evaluation.

### **Practical Takeaways for the NEPSE System:**

1. Always use temporal splits – never random splits – when evaluating time-series models.
2. Start with simple models (linear regression, shallow trees) as baselines.
3. Gradually increase complexity, monitoring the gap between train and test performance.
4. Use time-series cross-validation to get robust estimates of generalization.
5. Compare multiple model families – what works for US stocks may not work for NEPSE.
6. Be wary of overfitting: in financial markets, yesterday's patterns may not repeat tomorrow.

With these fundamentals in place, we are ready to dive into the specifics of model development. In **Chapter 19: Defining Prediction Targets**, we will explore how to design target variables that align with trading goals, whether we want to predict exact prices, directional moves, or risk-adjusted returns.

---

**End of Chapter 18**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../3. feature_engineering/17. advanced_feature_engineering_techniques.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='19. defining_prediction_targets.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
