# Chapter 35: Hyperparameter Tuning

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the role of hyperparameters in machine learning models and their impact on performance
- Distinguish between model parameters (learned during training) and hyperparameters (set before training)
- Perform manual hyperparameter tuning by systematically varying key parameters
- Implement grid search and random search using scikit‑learn with time‑series cross‑validation
- Apply Bayesian optimization techniques (e.g., Gaussian Processes, Tree‑structured Parzen Estimator) for more efficient tuning
- Use libraries like Hyperopt, Optuna, and scikit‑optimize for automated hyperparameter optimization
- Implement multi‑fidelity optimization (e.g., successive halving, Hyperband) to speed up tuning
- Understand early stopping in the context of hyperparameter tuning for neural networks
- Avoid common pitfalls such as overfitting to the validation set and data leakage during tuning
- Develop a practical hyperparameter tuning strategy for the NEPSE prediction system

---

## **35.1 Introduction to Hyperparameters**

In machine learning, we distinguish between two types of parameters:

- **Model parameters:** These are learned from the data during training (e.g., weights in a neural network, coefficients in linear regression, split points in a decision tree).
- **Hyperparameters:** These are set before training and control the learning process and model architecture (e.g., learning rate, number of trees in a random forest, depth of a tree, regularization strength).

Choosing the right hyperparameters is crucial for model performance. Too simple a model (e.g., shallow tree) may underfit; too complex (e.g., deep tree with many leaves) may overfit. Hyperparameter tuning is the process of searching for the combination of hyperparameters that yields the best generalization performance on unseen data.

For the NEPSE prediction system, we will encounter hyperparameters in almost every model:

- **Linear models:** regularization strength `alpha` in Ridge, Lasso, ElasticNet.
- **Tree‑based models:** `max_depth`, `min_samples_split`, `n_estimators`, `learning_rate` (for boosting).
- **Neural networks:** number of layers, number of units per layer, dropout rate, learning rate, batch size.
- **Support vector machines:** `C`, `gamma`, kernel choice.
- **Statistical models (ARIMA):** `p`, `d`, `q` orders.

Tuning these hyperparameters can significantly improve forecast accuracy, but it must be done carefully to avoid overfitting to the validation data.

---

## **35.2 Hyperparameter Types**

Hyperparameters can be categorized as:

- **Model hyperparameters:** related to the architecture (e.g., number of trees, depth).
- **Training hyperparameters:** related to the optimization (e.g., learning rate, batch size, number of epochs).
- **Regularization hyperparameters:** control overfitting (e.g., dropout rate, L2 penalty).

Some hyperparameters are continuous (e.g., learning rate), others are discrete (e.g., number of layers), and some are categorical (e.g., kernel type). This influences the choice of search method.

---

## **35.3 Manual Tuning**

Manual tuning involves changing hyperparameters by hand based on intuition, experience, or trial and error. It is feasible when the number of hyperparameters is small and the search space is well understood. For example, we might try a few values of `max_depth` for a random forest and observe validation performance.

```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Assume X_train, y_train are prepared
tscv = TimeSeriesSplit(n_splits=3)

depths = [3, 5, 7, 10]
best_depth = None
best_score = -np.inf

for depth in depths:
    scores = []
    for train_idx, val_idx in tscv.split(X_train):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
        
        model = RandomForestRegressor(max_depth=depth, n_estimators=100, random_state=42)
        model.fit(X_tr, y_tr)
        score = model.score(X_val, y_val)  # R²
        scores.append(score)
    avg_score = np.mean(scores)
    if avg_score > best_score:
        best_score = avg_score
        best_depth = depth

print(f"Best max_depth: {best_depth} with average R²: {best_score:.4f}")
```

**Explanation:**  
Manual tuning is simple but becomes impractical when the number of hyperparameters grows. It also relies on the practitioner's skill and may miss optimal combinations.

---

## **35.4 Grid Search**

Grid search is an exhaustive search over a manually specified subset of the hyperparameter space. For each combination of hyperparameters, the model is evaluated (usually via cross‑validation) and the best combination is selected.

### **35.4.1 Grid Search with Time‑Series Cross‑Validation**

Scikit‑learn's `GridSearchCV` can be used with a custom cross‑validator like `TimeSeriesSplit`.

```python
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor

# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'n_estimators': [50, 100, 200],
    'min_samples_split': [2, 5, 10]
}

# Create model
rf = RandomForestRegressor(random_state=42)

# TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

# Grid search
grid_search = GridSearchCV(rf, param_grid, cv=tscv, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-grid_search.best_score_):.4f}")
```

**Explanation:**  
Grid search evaluates every combination in the grid. With `n_splits=3` and 4×3×3 = 36 combinations, that's 108 model fits. This can become computationally expensive as the grid grows.

**Limitations:**  
- Curse of dimensionality: number of combinations grows exponentially with the number of hyperparameters.
- Continuous parameters must be discretized, potentially missing the optimum.

---

## **35.5 Random Search**

Random search samples hyperparameter combinations from a distribution. It is more efficient than grid search when some hyperparameters are more important than others, because it explores a wider range of values for each hyperparameter with the same budget.

### **35.5.1 Implementing Random Search**

Scikit‑learn provides `RandomizedSearchCV` which samples a fixed number of combinations from specified distributions.

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Define parameter distributions
param_dist = {
    'max_depth': randint(3, 15),
    'n_estimators': randint(50, 300),
    'min_samples_split': randint(2, 20),
    'max_features': uniform(0.5, 0.5)  # uniform between 0.5 and 1.0
}

# Random search
random_search = RandomizedSearchCV(rf, param_dist, n_iter=50, cv=tscv, 
                                   scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-random_search.best_score_):.4f}")
```

**Explanation:**  
We specify distributions (e.g., `randint` for integers, `uniform` for floats). The search runs for `n_iter=50` iterations, which is much less than the full grid (which could be thousands). Random search is often more effective than grid search, especially when only a few hyperparameters matter.

---

## **35.6 Bayesian Optimization**

Bayesian optimization builds a probabilistic model of the objective function (e.g., validation RMSE) and uses it to select the most promising hyperparameters to evaluate next. It balances exploration (trying new areas) and exploitation (focusing on areas known to be good).

### **35.6.1 Gaussian Processes**

One common approach uses Gaussian Processes to model the objective function. The expected improvement (EI) acquisition function guides the next sample.

Libraries like `scikit-optimize` (skopt) provide an easy interface.

```python
from skopt import BayesSearchCV
from skopt.space import Integer, Real

# Define search spaces
search_spaces = {
    'max_depth': Integer(3, 15),
    'n_estimators': Integer(50, 300),
    'min_samples_split': Integer(2, 20),
    'max_features': Real(0.5, 1.0)
}

# Bayesian search
bayes_search = BayesSearchCV(rf, search_spaces, n_iter=30, cv=tscv, 
                              scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
bayes_search.fit(X_train, y_train)

print(f"Best parameters: {bayes_search.best_params_}")
print(f"Best CV RMSE: {np.sqrt(-bayes_search.best_score_):.4f}")
```

**Explanation:**  
`BayesSearchCV` from `skopt` implements Bayesian optimization over hyperparameters. It typically finds better configurations faster than random search, especially when the search space is large.

### **35.6.2 Tree‑structured Parzen Estimator (TPE)**

TPE is another Bayesian optimization method, popularized by the Hyperopt library. It models the density of good and bad configurations separately.

```python
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
from sklearn.model_selection import cross_val_score

# Define search space
space = {
    'max_depth': hp.choice('max_depth', [3, 5, 7, 10, 15]),
    'n_estimators': hp.choice('n_estimators', [50, 100, 200, 300]),
    'min_samples_split': hp.choice('min_samples_split', [2, 5, 10, 20]),
    'max_features': hp.uniform('max_features', 0.5, 1.0)
}

def objective(params):
    model = RandomForestRegressor(**params, random_state=42)
    # Use TimeSeriesSplit for CV
    scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error')
    rmse = np.sqrt(-np.mean(scores))
    return {'loss': rmse, 'status': STATUS_OK}

trials = Trials()
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=trials)
print(f"Best hyperparameters: {best}")
```

**Explanation:**  
Hyperopt allows flexible search spaces and can handle conditional hyperparameters (e.g., different kernel choices). TPE often performs well in practice.

---

## **35.7 Multi‑Fidelity Optimization**

Multi‑fidelity methods speed up hyperparameter tuning by evaluating unpromising configurations with lower fidelity (e.g., fewer training iterations, smaller subset of data) and only fully evaluating promising ones.

### **35.7.1 Successive Halving**

Successive halving allocates a budget to a set of configurations, evaluates them, keeps the best half, and increases the budget. This is the basis of **Hyperband**.

### **35.7.2 Hyperband**

Hyperband combines random search with adaptive resource allocation. It is particularly effective for deep learning where training time is long.

```python
# Using the `hyperband` library or `keras-tuner` for neural networks
# Example with keras-tuner (for neural networks)
import keras_tuner as kt

def build_model(hp):
    model = keras.Sequential()
    model.add(layers.Dense(units=hp.Int('units_1', 32, 256, step=32), activation='relu'))
    model.add(layers.Dropout(hp.Float('dropout_1', 0, 0.5, step=0.1)))
    model.add(layers.Dense(1))
    model.compile(optimizer=keras.optimizers.Adam(hp.Float('lr', 1e-4, 1e-2, sampling='log')),
                  loss='mse')
    return model

tuner = kt.Hyperband(build_model, objective='val_loss', max_epochs=50, factor=3, directory='my_dir')
tuner.search(X_train, y_train, validation_data=(X_val, y_val), callbacks=[early_stop])
best_hps = tuner.get_best_hyperparameters(1)[0]
```

**Explanation:**  
Hyperband trains many configurations for a few epochs, then promotes the best ones to longer training. This is efficient for deep learning.

---

## **35.8 Early Stopping in Tuning**

When tuning neural networks, we often use early stopping to prevent overfitting. However, during hyperparameter search, we must be careful: if we stop early based on validation loss, that loss becomes the objective we optimize. This is acceptable as long as we use the same validation set for all trials. But we must still have a separate test set for final evaluation.

In `GridSearchCV` or `RandomizedSearchCV`, we can pass an early stopping callback, but the search will use the validation score at the stopping point. This is fine.

```python
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', patience=5)

# Inside a custom objective function for Hyperopt, we can include early stopping
def objective_nn(params):
    model = create_model(params)  # build model with given hyperparameters
    history = model.fit(X_train, y_train, validation_data=(X_val, y_val),
                        epochs=100, callbacks=[early_stop], verbose=0)
    best_val_loss = min(history.history['val_loss'])
    return {'loss': best_val_loss, 'status': STATUS_OK}
```

---

## **35.9 Practical Considerations for NEPSE**

### **35.9.1 Time‑Series Cross‑Validation in Tuning**

Always use time‑series cross‑validation (e.g., `TimeSeriesSplit`) inside the tuning loop. Random shuffling is invalid. With `GridSearchCV`, pass `cv=TimeSeriesSplit()`.

### **35.9.2 Nested Cross‑Validation for Unbiased Performance**

If you want an unbiased estimate of the final model's performance after tuning, use nested CV (as in Chapter 34). The outer loop splits data, inner loop tunes hyperparameters. This gives a realistic estimate of how the model will perform on new data.

### **35.9.3 Computational Budget**

For the NEPSE dataset (a few thousand rows), random search with 50‑100 iterations is often sufficient. Bayesian optimization can reduce that number. For neural networks, use Hyperband to save time.

### **35.9.4 Hyperparameter Spaces**

Define reasonable ranges based on domain knowledge:

- **Tree depth:** For small datasets, depth > 10 may overfit.
- **Learning rate:** Log‑uniform between 1e-4 and 1e-2.
- **Regularization:** Try small values first (e.g., 0.001, 0.01, 0.1).
- **Number of estimators:** Start with 100, go up to 500 if not overfitting.

### **35.9.5 Avoiding Overfitting to Validation**

The more you search, the higher the chance of finding a combination that performs well on the validation set by chance. To mitigate:

- Use a separate test set that is never used during tuning.
- Use nested CV.
- Limit the search space and number of trials.
- Report performance on the test set only after final model selection.

---

## **35.10 Chapter Summary**

In this chapter, we covered the essential techniques for hyperparameter tuning in machine learning, with applications to the NEPSE prediction system.

- **Hyperparameters** are settings that control model architecture and training; they must be tuned to optimize performance.
- **Manual tuning** is simple but limited.
- **Grid search** exhaustively evaluates all combinations but is inefficient for many hyperparameters.
- **Random search** samples from distributions and is more efficient.
- **Bayesian optimization** (Gaussian Processes, TPE) models the objective and focuses on promising regions.
- **Multi‑fidelity methods** (Hyperband) speed up tuning by early stopping poor configurations.
- **Early stopping** can be integrated into tuning for neural networks.
- **Practical considerations** include using time‑series CV, nested CV for unbiased estimates, and defining sensible search spaces.

### **Practical Takeaways for the NEPSE System:**

- For tree‑based models, start with random search (50‑100 iterations) using `RandomizedSearchCV` with `TimeSeriesSplit`.
- For neural networks, use Hyperband (via `keras‑tuner`) to efficiently explore architectures.
- Always keep a separate test set for final evaluation after tuning.
- Use nested CV if you need an unbiased estimate of the tuned model's performance.
- Document the tuning process and the final chosen hyperparameters for reproducibility.

In the next chapter, **Chapter 36: Model Interpretation and Explainability**, we will explore methods to understand and explain model predictions, which is crucial for trust and debugging in financial applications.

---

**End of Chapter 35**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='34. cross_validation_techniques.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='36. model_interpretation_and_explainability.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
