# Chapter 51: Ensemble Methods

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the principles behind ensemble learning and why combining multiple models often yields better predictions
- Implement bagging methods, such as Random Forest, for time‑series classification or regression using the NEPSE dataset
- Apply boosting techniques (AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost) and understand their strengths and weaknesses
- Construct stacking ensembles that combine diverse base models with a meta‑learner
- Use blending (holdout stacking) to create robust ensembles without overfitting
- Implement voting classifiers and regressors for simple averaging of predictions
- Appreciate the importance of diversity in ensemble members and how to achieve it
- Evaluate ensemble models and avoid common pitfalls like overfitting and excessive computational cost
- Apply ensemble methods to the NEPSE stock prediction problem and compare their performance

---

## Introduction

No single machine learning model is universally the best—this is the essence of the **No Free Lunch Theorem**. However, by combining multiple models, we can often achieve better and more robust predictions than any single model alone. This is the core idea behind **ensemble methods**. Ensembles work because different models make different errors; by averaging or voting, those errors can cancel out, leading to improved accuracy and stability.

For the NEPSE stock prediction system, ensemble methods are particularly valuable. Financial time series are noisy, and different models may capture different aspects of the data: linear models might capture trends, tree‑based models might capture non‑linear interactions, and neural networks might capture complex patterns. An ensemble can blend these strengths.

In this chapter, we will explore the main families of ensemble methods: **bagging**, **boosting**, **stacking**, and **voting**. We will implement them using Python libraries (scikit‑learn, XGBoost, etc.) and apply them to our NEPSE prediction task. By the end, you will be equipped to build powerful ensembles for your own time‑series problems.

---

## 51.1 Ensemble Learning Principles

Ensemble learning combines several base models (also called weak learners or base learners) to produce one final prediction. The key requirement for an ensemble to be effective is that the base models are **accurate** and **diverse**. Accuracy means each model performs better than random guessing. Diversity means the models make different errors on different instances. If all models make the same errors, the ensemble will not improve.

There are two main ways to create diversity:

1. **Using different training data** (e.g., bootstrap samples in bagging).
2. **Using different model architectures or hyperparameters** (e.g., different algorithms in stacking).

Ensembles can be used for both classification and regression. For classification, common combination rules are **majority voting** (hard voting) or **averaging predicted probabilities** (soft voting). For regression, we typically average the predictions.

---

## 51.2 Bagging (Bootstrap Aggregating)

Bagging creates multiple models by training each on a different bootstrap sample (random sample with replacement) of the original training data. The final prediction is the average (for regression) or majority vote (for classification) of all models. Bagging reduces variance without increasing bias, making it especially effective for high‑variance models like decision trees.

### 51.2.1 Random Forest

Random Forest is an extension of bagging applied to decision trees. In addition to using bootstrap samples, it also randomly selects a subset of features at each split, further increasing diversity. This makes Random Forest one of the most popular and effective off‑the‑shelf machine learning methods.

**Applying Random Forest to NEPSE Data**

Assume we have already engineered features for each trading day and created a binary target: 1 if next day's closing price is higher than today's, else 0.

```python
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the feature‑engineered dataset (created in previous chapters)
df = pd.read_csv('nepse_features.csv', parse_dates=['Date'])
df = df.sort_values('Date')

# Define features and target
feature_cols = [col for col in df.columns if col not in ['Date', 'Symbol', 'Close', 'Target']]
X = df[feature_cols]
y = df['Target']  # 1 if next day up, 0 otherwise

# Time‑based split: train on data before 2024, test on 2024
train_mask = df['Date'] < '2024-01-01'
test_mask = df['Date'] >= '2024-01-01'
X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Predict and evaluate
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

# Feature importance
importances = pd.Series(rf.feature_importances_, index=feature_cols).sort_values(ascending=False)
print("\nTop 10 Important Features:")
print(importances.head(10))
```

**Explanation:**  
We create a Random Forest classifier with 100 trees. The `n_jobs=-1` uses all available CPU cores for parallel training. After fitting, we evaluate on the test set. The `feature_importances_` attribute tells us which features the forest found most predictive (e.g., lagged returns, RSI, volume). This interpretability is a key advantage of tree‑based ensembles.

**Tuning Random Forest**

Hyperparameters to tune include `n_estimators`, `max_depth`, `min_samples_split`, `max_features`. Use `RandomizedSearchCV` or `GridSearchCV` with time‑series cross‑validation.

---

## 51.3 Boosting

Boosting builds models sequentially, where each new model tries to correct the errors of the previous ones. Unlike bagging, boosting reduces both bias and variance. Popular boosting algorithms include AdaBoost, Gradient Boosting, and their modern variants XGBoost, LightGBM, and CatBoost.

### 51.3.1 AdaBoost (Adaptive Boosting)

AdaBoost assigns weights to training instances. Initially, all weights are equal. After each weak learner (often a shallow decision tree), it increases the weights of misclassified instances, so the next learner focuses on them. The final prediction is a weighted vote of all learners.

**AdaBoost with scikit‑learn**

```python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

base_estimator = DecisionTreeClassifier(max_depth=1)  # stump
ada = AdaBoostClassifier(
    base_estimator=base_estimator,
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)
ada.fit(X_train, y_train)
y_pred_ada = ada.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, y_pred_ada):.4f}")
```

**Explanation:**  
AdaBoost with stumps (depth‑1 trees) often works well. The `learning_rate` controls the contribution of each weak learner. Lower values require more estimators but can generalise better.

### 51.3.2 Gradient Boosting Machines (GBM)

Gradient boosting generalises AdaBoost by allowing any differentiable loss function. It fits new models to the **residuals** (errors) of the previous ensemble. scikit‑learn provides `GradientBoostingClassifier` and `GradientBoostingRegressor`.

```python
from sklearn.ensemble import GradientBoostingClassifier

gbm = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    random_state=42
)
gbm.fit(X_train, y_train)
y_pred_gbm = gbm.predict(X_test)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gbm):.4f}")
```

### 51.3.3 XGBoost

XGBoost (Extreme Gradient Boosting) is an optimised implementation of gradient boosting with additional features like regularisation, handling of missing values, and built‑in cross‑validation. It is widely used in Kaggle competitions and industry.

**XGBoost for NEPSE**

```python
import xgboost as xgb

# Create DMatrix (optimised data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {
    'objective': 'binary:logistic',
    'max_depth': 5,
    'eta': 0.1,          # learning rate
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'logloss',
    'seed': 42
}

# Train with early stopping
watchlist = [(dtrain, 'train'), (dtest, 'eval')]
num_rounds = 1000
model = xgb.train(
    params,
    dtrain,
    num_rounds,
    evals=watchlist,
    early_stopping_rounds=50,
    verbose_eval=100
)

# Predict probabilities
y_pred_proba = model.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Accuracy: {accuracy:.4f}")
```

**Explanation:**  
XGBoost's `train` method accepts an evaluation set to monitor performance and stop early if no improvement, preventing overfitting. The parameters `subsample` and `colsample_bytree` introduce randomness, similar to random forest, to improve generalisation.

**XGBoost with scikit‑learn API**

```python
from xgboost import XGBClassifier

xgb_clf = XGBClassifier(
    n_estimators=1000,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    early_stopping_rounds=50,
    random_state=42
)
xgb_clf.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
y_pred_xgb = xgb_clf.predict(X_test)
```

### 51.3.4 LightGBM

LightGBM is another gradient boosting framework that uses histogram‑based algorithms for faster training and lower memory usage. It often outperforms XGBoost on large datasets.

```python
import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1
}

model_lgb = lgb.train(
    params,
    train_data,
    valid_sets=[test_data],
    num_boost_round=1000,
    early_stopping_rounds=50,
    verbose_eval=False
)

y_pred_lgb = (model_lgb.predict(X_test) > 0.5).astype(int)
print(f"LightGBM Accuracy: {accuracy_score(y_test, y_pred_lgb):.4f}")
```

### 51.3.5 CatBoost

CatBoost is designed to handle categorical features automatically, but it also works well with numerical data. It uses ordered boosting to reduce overfitting.

```python
from catboost import CatBoostClassifier

catboost = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=5,
    eval_metric='Accuracy',
    verbose=False,
    random_seed=42
)
catboost.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=50)
y_pred_cat = catboost.predict(X_test)
print(f"CatBoost Accuracy: {accuracy_score(y_test, y_pred_cat):.4f}")
```

---

## 51.4 Stacking (Stacked Generalization)

Stacking combines multiple base models (level‑0 models) and trains a meta‑model (level‑1 model) to make the final prediction using the base models' outputs as features. The base models are typically diverse (e.g., Random Forest, XGBoost, SVM). To avoid overfitting, the base models are trained on out‑of‑fold predictions (using cross‑validation).

**Stacking with scikit‑learn**

```python
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)),
    ('svm', SVC(probability=True, random_state=42))  # need probabilities for meta-model
]

# Meta-model (logistic regression)
meta_model = LogisticRegression()

# Stacking ensemble
stack = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5  # use 5-fold cross-validation to generate out-of-fold predictions
)
stack.fit(X_train, y_train)
y_pred_stack = stack.predict(X_test)
print(f"Stacking Accuracy: {accuracy_score(y_test, y_pred_stack):.4f}")
```

**Explanation:**  
`StackingClassifier` automatically performs cross‑validation for each base model: it splits the training data into folds, trains on k‑1 folds, predicts on the held‑out fold, and uses those predictions as features for the meta‑model. This prevents the meta‑model from seeing the same data the base models were trained on, reducing overfitting.

**Custom Stacking Implementation (for deeper understanding)**

```python
from sklearn.model_selection import StratifiedKFold
import numpy as np

def stacking_cv(base_models, meta_model, X, y, n_folds=5):
    """
    Generate out-of-fold predictions for base models and train meta-model.
    """
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    # Array to hold out-of-fold predictions (samples x n_models)
    oof_preds = np.zeros((X.shape[0], len(base_models)))

    for i, (name, model) in enumerate(base_models):
        for train_idx, val_idx in skf.split(X, y):
            X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
            y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]

            model_clone = clone(model)
            model_clone.fit(X_train_fold, y_train_fold)
            oof_preds[val_idx, i] = model_clone.predict_proba(X_val_fold)[:, 1]

    # Train meta-model on out-of-fold predictions
    meta_model.fit(oof_preds, y)

    # Now retrain base models on full training data
    for name, model in base_models:
        model.fit(X, y)

    return meta_model, base_models

# Usage
from sklearn.base import clone
meta, base = stacking_cv(base_models, LogisticRegression(), X_train, y_train)

# For test set, get base model predictions and feed to meta-model
test_preds = np.column_stack([model.predict_proba(X_test)[:, 1] for _, model in base])
y_pred_stack_custom = meta.predict(test_preds)
```

**Explanation:**  
This manual implementation shows the inner workings of stacking. The out‑of‑fold predictions become the training set for the meta‑model. After the meta‑model is trained, we retrain each base model on the full training set so they can be used for test predictions.

---

## 51.5 Blending

Blending is a simpler variant of stacking. Instead of using cross‑validation, you hold out a portion of the training data (e.g., 10%) as a validation set. Base models are trained on the remaining data, then predict on the validation set. Those predictions become the training data for the meta‑model. The meta‑model is then trained, and base models are retrained on the full training set (or left as is).

**Blending example**

```python
# Split training data into train and blend sets
X_train_base, X_blend, y_train_base, y_blend = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42, stratify=y_train
)

# Train base models on train_base
base_models = [
    RandomForestClassifier(n_estimators=100, random_state=42),
    XGBClassifier(n_estimators=100, random_state=42),
    SVC(probability=True, random_state=42)
]
for model in base_models:
    model.fit(X_train_base, y_train_base)

# Generate blend predictions
blend_preds = np.column_stack([model.predict_proba(X_blend)[:, 1] for model in base_models])

# Train meta-model on blend predictions
meta_model = LogisticRegression()
meta_model.fit(blend_preds, y_blend)

# Retrain base models on full training data (optional)
for model in base_models:
    model.fit(X_train, y_train)

# Test predictions
test_preds_base = np.column_stack([model.predict_proba(X_test)[:, 1] for model in base_models])
y_pred_blend = meta_model.predict(test_preds_base)
```

**Explanation:**  
Blending is faster than stacking because it requires only one training of base models (plus the retraining step). However, it uses less data to train the meta‑model, so it may be less robust than stacking with cross‑validation.

---

## 51.6 Voting Classifiers and Regressors

The simplest ensemble is voting: combine predictions from multiple models by averaging (for regression) or majority vote (for classification). Scikit‑learn provides `VotingClassifier` and `VotingRegressor`.

**Hard Voting (majority rule)**

```python
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('xgb', XGBClassifier(n_estimators=100, random_state=42)),
        ('svm', SVC(probability=False, random_state=42))  # hard voting needs no probabilities
    ],
    voting='hard'
)
voting_clf.fit(X_train, y_train)
y_pred_vote = voting_clf.predict(X_test)
print(f"Hard Voting Accuracy: {accuracy_score(y_test, y_pred_vote):.4f}")
```

**Soft Voting (average of probabilities)**

```python
voting_clf_soft = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('xgb', XGBClassifier(n_estimators=100, random_state=42)),
        ('svm', SVC(probability=True, random_state=42))
    ],
    voting='soft'
)
voting_clf_soft.fit(X_train, y_train)
y_pred_soft = voting_clf_soft.predict(X_test)
print(f"Soft Voting Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}")
```

**Explanation:**  
Soft voting often performs better because it takes into account the confidence of each model. Note that `SVC` must have `probability=True` to output calibrated probabilities.

---

## 51.7 Ensemble Diversity

Diversity is crucial. If all models are similar, the ensemble will not improve much. Ways to increase diversity:

- **Different algorithms** (trees, linear models, neural networks).
- **Different feature subsets** (feature bagging).
- **Different training data** (bagging, boosting).
- **Different hyperparameters** (e.g., shallow vs. deep trees).

You can measure diversity using metrics like **Yule's Q statistic** or **correlation of errors**. In practice, simply combining a few well‑performing but different models often works well.

---

## 51.8 Dynamic Selection and Ensemble Pruning

Instead of using all base models all the time, **dynamic selection** chooses, for each test instance, the most competent model(s) based on the local region of the feature space. Methods like **KNORA** (K‑Nearest Oracles) evaluate base models on the k‑nearest neighbours of the test point and select those that performed well.

**Ensemble pruning** removes redundant or underperforming models from the ensemble to reduce complexity and sometimes improve accuracy. Greedy forward selection or optimisation‑based methods can be used.

These advanced techniques are beyond the scope of this chapter, but they are worth exploring if you need to further optimise your ensemble.

---

## 51.9 Implementation Strategies

When implementing ensembles for production (like the NEPSE system), consider:

- **Computational cost**: Training many models can be expensive. Use parallelisation (`n_jobs=-1`) and cloud resources.
- **Model persistence**: Save all trained models (e.g., with `joblib`) so they can be loaded for inference.
- **Inference latency**: For real‑time predictions, an ensemble of many models may be too slow. Consider using a single strong model (like XGBoost) or pruning the ensemble.
- **Updating models**: In a streaming context, you may need to update ensemble members incrementally. Some libraries (e.g., `river`) support online ensembles.

**Saving and loading an ensemble**

```python
import joblib

# Save
joblib.dump(stack, 'nepse_stacking_ensemble.pkl')

# Load
stack_loaded = joblib.load('nepse_stacking_ensemble.pkl')
predictions = stack_loaded.predict(X_new)
```

---

## 51.10 Best Practices and Pitfalls

**Best practices:**

1. **Start simple**: Try a single good model (e.g., XGBoost) before building a complex ensemble.
2. **Ensure diversity**: Use models that are truly different.
3. **Use cross‑validation** to evaluate ensembles (time‑series cross‑validation for temporal data).
4. **Monitor for overfitting**: If the ensemble performs much better on training than test, it may be overfitting.
5. **Consider computational constraints** – ensembles are slower to train and serve.

**Pitfalls:**

- **Overfitting**: Stacking with many base models and a complex meta‑model can overfit the small out‑of‑fold dataset. Use simple meta‑models (e.g., logistic regression).
- **Look‑ahead bias**: In time series, ensure that when generating out‑of‑fold predictions for stacking, you do not use future data. Use time‑series cross‑validation (e.g., `TimeSeriesSplit`) instead of random folds.
- **Ignoring calibration**: For probability ensembles, ensure base models output well‑calibrated probabilities (use `calibrate` in scikit‑learn if needed).

**Time‑series cross‑validation for stacking**

```python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
# Then use tscv.split(X) instead of KFold in the custom stacking implementation.
```

---

## Chapter Summary

In this chapter, we explored ensemble methods and their application to the NEPSE stock prediction system. We covered:

- The fundamental principle of combining diverse and accurate models.
- Bagging and Random Forest, demonstrating feature importance.
- Boosting algorithms (AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost) and their sequential error‑correction mechanism.
- Stacking, which uses a meta‑model to learn how to best combine base models, with both library and custom implementations.
- Blending as a simpler alternative to stacking.
- Voting classifiers for straightforward averaging or majority voting.
- The importance of diversity and how to achieve it.
- Practical considerations for deploying ensembles in production.

Ensemble methods are a powerful tool in any data scientist's arsenal. By combining models, you can often achieve state‑of‑the‑art performance on challenging problems like financial time‑series prediction. For the NEPSE system, you might find that a well‑tuned XGBoost or a stacking ensemble of a few diverse models yields the best accuracy.

In the next chapter, we will discuss **Transfer Learning and Pre‑training**, exploring how to leverage models trained on large datasets to improve predictions when data is limited.

---

**End of Chapter 51**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../6. production_systems/50. cloud_deployment.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='52. transfer_learning_and_pre_training.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
