Ensembling models can improve predictive performance by leveraging the **diversity** and **collective wisdom** of multiple models. Instead of relying on a single model, we train several individual models and combine their predictions to make a final decision. 

We have already seen ensemble methods like bagging and boosting. These ensembles primarily reduce error by:

* **Reducing bias** — as in boosting

* **Reducing variance** — as in bagging and random forests

In this chapter, we'll go a step further and explore ensembles that **combine different types of models**. For example, we might ensemble a linear regression model, a random forest, and a gradient boosting model. The goal is to build stronger predictors by combining models that complement each other's strengths and weaknesses.

## Why Ensemble Diverse Models

**Bias Reduction:** 

- Different models often exhibit distinct biases. For example, a linear regression model might underfit complex patterns, while a Random Forest might overfit noisy data. Combining their predictions can mitigate these biases, leading to a more generalized model.
- **Example**: If a linear model overpredicts and a boosting model underpredicts for the same instance, averaging their predictions can cancel out the biases.

**Variance Reduction:**

- As seen with Random Forests, averaging predictions from multiple models reduces variance, especially when the models are uncorrelated (recall the variance reduction formula for bagging).
- **Key Requirement**: For effective variance reduction, the models should have low correlation in their predictions.

###  Mathematical Justification

We can mathematically illustrate how ensembling improves prediction accuracy using the case of regression.  
Let the predictors be denoted by \( X \), and the response by \( Y \). Assume we have \( m \) individual models \( f_1, f_2, \dots, f_m \). The ensemble predictor is the average:

$$
\hat{f}_{ensemble}(X) = \frac{1}{m} \sum_{i=1}^{m} f_i(X)
$$

The expected mean squared error (MSE) of the ensemble model is:

$$
E(MSE_{Ensemble}) = E\left[\left( \frac{1}{m} \sum_{i = 1}^{m} f_i(X) - Y \right)^2 \right]
$$

This expands to:

$$
E(MSE_{Ensemble}) = \frac{1}{m^2} \sum_{i = 1}^{m} E\left[(f_i(X) - Y)^2 \right] + \frac{1}{m^2} \sum_{i \ne j} E\left[(f_i(X) - Y)(f_j(X) - Y) \right]
$$

$$
= \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right) + \frac{1}{m^2} \sum_{i \ne j} E\left[(f_i(X) - Y)(f_j(X) - Y) \right]
$$

If the individual models \( f_1, \dots, f_m \) are **unbiased**, the cross terms become covariances:

$$
E(MSE_{Ensemble}) = \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right) + \frac{1}{m^2} \sum_{i \ne j} Cov(f_i(X), f_j(X))
$$

If the models are **uncorrelated**, the covariance terms vanish:

$$
E(MSE_{Ensemble}) = \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right)
$$

> 🔍 **Conclusion**: When the individual models are both **unbiased** and **uncorrelated**, the expected MSE of the ensemble is **strictly lower** than the average MSE of the individual models. This provides a strong theoretical justification for ensembling diverse models to improve prediction accuracy.
>
> In practice, the ensemble's performance tends to improve unless a single model is significantly more accurate than the others. For example, if one model has near-zero MSE while others perform poorly, ensembling may actually hurt performance. Therefore, the benefit of ensembling depends not only on diversity but also on the **relative quality** of the individual models.



## Combining Model Predictions: Two Common Approaches

There are two widely used methods for combining model predictions in ensemble learning:

- **Voting**: Combines the predictions of multiple models directly. In classification, this could be majority voting; in regression, it's often simple averaging. Voting is intuitive and works well when the base models are reasonably strong and diverse.

In [None]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/voting_vs_bagging.webp", width=400, height=200)


The idea behind voting is similar to bagging in that it combines predictions from multiple models by averaging their predictions. However, unlike bagging which typically uses homogeneous models (e.g., multiple decision trees), voting allows for combining **heterogeneous models**—different types of algorithms—to leverage their individual strengths.


- **Stacking**: Trains a new model (called a **meta-learner**) to learn how to best combine the predictions of the base models. Stacking can capture more complex relationships among the models and often yields higher accuracy, especially when base models differ significantly in structure or behavior.


In [3]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/stacking.jpg", width=400, height=200)

## Exploring Stacking and Voting in Regression

Building on the previous chapters where we consistently used the car dataset for regression tasks, we will now explore the application of stacking and voting ensemble methods using the same dataset.

Let's begin by importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score,train_test_split, GridSearchCV, ParameterGrid, \
StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import root_mean_squared_error, mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, roc_auc_score, f1_score
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import VotingRegressor, VotingClassifier, StackingRegressor, \
StackingClassifier, GradientBoostingRegressor,GradientBoostingClassifier, BaggingRegressor, \
BaggingClassifier,RandomForestRegressor,RandomForestClassifier,AdaBoostRegressor,AdaBoostClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression, Ridge, ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import seaborn as sns
import matplotlib.pyplot as plt
import itertools as it
import time as time
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

Load the dataset

In [2]:
# Load the dataset
car = pd.read_csv('Datasets/car.csv')
car.head()

Unnamed: 0,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,vw,Beetle,2014,Manual,55457,Diesel,30,65.3266,1.6,7490
1,vauxhall,GTC,2017,Manual,15630,Petrol,145,47.2049,1.4,10998
2,merc,G Class,2012,Automatic,43000,Diesel,570,25.1172,3.0,44990
3,audi,RS5,2019,Automatic,10,Petrol,145,30.5593,2.9,51990
4,merc,X-CLASS,2018,Automatic,14000,Diesel,240,35.7168,2.3,28990


Data preprocessing

In [3]:
X = car.drop(columns=['price'])
y = car['price']

# extract the categorical columns and put them in a list
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# extract the numerical columns and put them in a list
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

While some of the models we will ensemble—such as tree-based models—do not require feature scaling or one-hot encoding, we'll apply a unified preprocessing pipeline for all models to ensure consistency and compatibility. This approach simplifies the workflow and avoids errors when combining models with different preprocessing requirements.

In [4]:
# Create preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

For quick prototyping, we started with default model settings to evaluate whether ensembling improves performance. We also set a fixed random state to ensure reproducibility. Below is a list of the models you have learned so far.

In [5]:
# Define models to evaluate
regressor_models = {
    'Baseline Linear Regression': LinearRegression(),
    'Baseline KNN Regressor': KNeighborsRegressor(),
    'Baseline Decision Tree': DecisionTreeRegressor(random_state=42),
    'Baseline Random Forest': RandomForestRegressor( random_state=42),
    'Baseline XGBoost': xgb.XGBRegressor( random_state=42),
    'Baseline LightGBM': lgb.LGBMRegressor(random_state=42, verbose=0),
    'Baseline CatBoost': cb.CatBoostRegressor(random_state=42, verbose=0)
}

We will first build each model using its default settings to establish baseline performance before applying ensembling techniques.

In [6]:
# store the results
reg_results = {}

# Loop through models
for name, model in regressor_models.items():
    # Create a pipeline with preprocessing and the model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Calculate RMSE
    rmse = root_mean_squared_error(y_test, y_pred)
    
    # Store the results
    reg_results[name] = rmse

# Convert results to DataFrame for better visualization
reg_results_df = pd.DataFrame.from_dict(reg_results, orient='index', columns=['RMSE'])
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True)
reg_results_df.reset_index(inplace=True)
reg_results_df.columns = ['Model', 'RMSE']
# Print the results
reg_results_df



Unnamed: 0,Model,RMSE
0,Baseline CatBoost,3296.493137
1,Baseline XGBoost,3397.155518
2,Baseline Random Forest,3660.14597
3,Baseline LightGBM,3729.955778
4,Baseline KNN Regressor,4062.83968
5,Baseline Decision Tree,5015.812547
6,Baseline Linear Regression,5801.435399


Evidently, **CatBoost** outperforms the other models using its default settings, achieving the lowest RMSE. This aligns with what you've learned—CatBoost typically requires less hyperparameter tuning compared to **XGBoost** and **LightGBM**.


### Voting Regressor

In this section, Next, we will build an ensemble model, starting with **voting regressor**, which assigns **equal weight** to each base model. In this approach, all predictions are treated equally when averaged to produce the final combined prediction.

Below is how you can **ensemble the same models** using VotingRegressor with the same preprocessor in a pipeline

In [8]:
# Define base regressors (same as before)
base_regressor_list = [
    ('lr', LinearRegression()),
    ('knn', KNeighborsRegressor()),
    ('dt', DecisionTreeRegressor(random_state=42)),
    ('rf', RandomForestRegressor(random_state=42)),
    ('xgb', xgb.XGBRegressor(random_state=42)),
    ('lgb', lgb.LGBMRegressor(random_state=42, verbose=0)),
    ('cat', cb.CatBoostRegressor(random_state=42, verbose=0))
]

# Create a VotingRegressor with equal weights
voting_regressor = VotingRegressor(estimators=base_regressor_list)

# Create a pipeline with preprocessing and voting ensemble
voting_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('ensemble', voting_regressor)
])

# Fit the ensemble model (no need to fit individual models beforehand!)
voting_pipeline.fit(X_train, y_train)

**Note**: You **do not need to fit the models individually** before including them in the `VotingRegressor`. Doing so would result in unnecessary computation and **waste time**, as `VotingRegressor.fit()` will handle training for all models internally.

In [9]:
# Predict and evaluate
y_pred_vote = voting_pipeline.predict(X_test)
rmse_vote = root_mean_squared_error(y_test, y_pred_vote)

# Add ensemble result to the results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Voting Regressor', rmse_vote]
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Show updated results
reg_results_df



Unnamed: 0,Model,RMSE
0,Baseline CatBoost,3296.493137
1,Voting Regressor,3302.202159
2,Baseline XGBoost,3397.155518
3,Baseline Random Forest,3660.14597
4,Baseline LightGBM,3729.955778
5,Baseline KNN Regressor,4062.83968
6,Baseline Decision Tree,5015.812547
7,Baseline Linear Regression,5801.435399


It's not surprising that CatBoost outperformed the Voting Ensemble, as a well-optimized gradient boosting model can often outshine an ensemble—especially when the ensemble's constituent models lack diversity, are poorly tuned, or fail to leverage the dataset's characteristics.

####  Strategies to Improve Voting Ensemble Performance

To boost the effectiveness of a voting ensemble, consider the following enhancements:

- **Increase Model Diversity**  
  Incorporate a wider range of model types (e.g., SVMs, neural networks, or k-NN) alongside tree-based models. Diverse models are more likely to capture different patterns in the data and produce uncorrelated errors — a key factor in ensemble success, as discussed earlier in this chapter.

- **Tune Base Models Individually**  
  Optimize each base model using hyperparameter tuning techniques such as **Optuna**. Well-tuned individual models provide stronger building blocks for the ensemble, improving the final averaged prediction.

- **Use Weighted Voting**  
  Instead of assigning equal importance to each model, assign weights based on their individual performance (e.g., lower RMSE → higher weight). This helps emphasize the contribution of stronger models like CatBoost or XGBoost.  
  *(Note: In a more advanced setup, stacking takes this further by learning the best combination strategy using a meta-model.)*


### Stacking Regressor

**Stacking** is a more sophisticated ensembling technique that learns how to best combine multiple base models using a separate meta-model (also called the `final_estimator`).



Here’s how the process works:

1. **Cross-validated predictions for base models**  
   The training data is split into *K* folds (typically using cross-validation). For each fold:
   - The base models are trained on the remaining *K–1* folds.
   - Predictions are made on the held-out fold.

2. **Out-of-fold predictions become new features**  
   This process generates **out-of-fold predictions** for each training point from each base model (i.e., predictions made on data not seen during training). These predictions are used as **features** for the next stage.

3. **Training the meta-model (`final_estimator`)**  
   The meta-model is trained on these out-of-fold predictions as input features and the original target variable as the response. It learns how to combine the base model outputs to make a better overall prediction.

Please see the stacking implementation below

   Training Set  
│  
├── Cross-Validation Process (k folds)  
│   ├── Fold 1: Train C₁-C₄ on folds 2-k → Predict on Fold 1 → P₁-P₄ for Fold 1  
│   ├── Fold 2: Train C₁-C₄ on folds 1,3-k → Predict on Fold 2 → P₁-P₄ for Fold 2  
│   └── ... (repeat for all k folds)  
│  
├── Aggregated Meta-Features: P₁-P₄ for entire training set  
│  
└── Meta-Classifier → Final Prediction  

> The goal of stacking is to leverage the strengths of each individual model while minimizing their weaknesses, often resulting in improved accuracy over any single model.

#### Metamodel: Linear regression

In [11]:
import warnings
warnings.filterwarnings("ignore", message="X does not have valid feature names")
# Define a meta-model
meta_lr = LinearRegression()

# Create the stacking regressor
stacking_model = StackingRegressor(
    estimators=base_regressor_list,
    final_estimator=meta_lr,
    cv=KFold(n_splits=5, shuffle=True, random_state=42), # ensures all base models use the same 5-fold CV
    n_jobs=-1,  # Use all available cores
    passthrough=False  # Set to True if you want to include original features in meta-model
)

# Wrap with pipeline (using your preprocessor)
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_model)
])

# Fit the stacking model
stacking_pipeline.fit(X_train, y_train)

In [12]:
# Predict and evaluate
y_pred_stack = stacking_pipeline.predict(X_test)
rmse_stack = root_mean_squared_error(y_test, y_pred_stack)

print(f"Stacking Regressor RMSE: {rmse_stack:.2f}")

Stacking Regressor RMSE: 3190.11


In [13]:
# Append the Stacking Regressor result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Lr Stacking Regressor', rmse_stack]

# Sort by RMSE in ascending order and reset index
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Display updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,Lr Stacking Regressor,3190.112423
1,Baseline CatBoost,3296.493137
2,Voting Regressor,3302.202159
3,Baseline XGBoost,3397.155518
4,Baseline Random Forest,3660.14597
5,Baseline LightGBM,3729.955778
6,Baseline KNN Regressor,4062.83968
7,Baseline Decision Tree,5015.812547
8,Baseline Linear Regression,5801.435399


From the results, the **stacking Regressor** not only outperforms the **Voting Regressor**, but also surpasses the best individual model—**CatBoost**—when using default settings. 

While voting assigns equal weights to each base model, stacking uses a **meta-model** (in this case, linear regression) to learn an optimal weighted combination of predictions. This allows it to assign different importance to each base model based on their contributions to overall performance.

Next, let's examine the coefficients learned by the stacking model to understand how it weighted each base model's prediction.


In [14]:
# Access the trained meta-model inside the stacking pipeline
meta_model = stacking_pipeline.named_steps['stacking'].final_estimator_

# Get model names in the same order as the coefficients
model_names = [name for name, _ in base_regressor_list]

# Extract coefficients
coefs = meta_model.coef_

# Create a DataFrame to display model weights
import pandas as pd
coef_df = pd.DataFrame({'Base Model': model_names, 'Meta-Model Coefficient': coefs})

# Sort by weight (optional)
coef_df = coef_df.sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)

# Show the weights
coef_df

Unnamed: 0,Base Model,Meta-Model Coefficient
0,cat,0.763462
1,knn,0.140259
2,xgb,0.084874
3,lr,0.04366
4,dt,0.037379
5,rf,0.005741
6,lgb,-0.056054


Note the above coefficients of the meta-model. The model gives the **highest weight** to the **catboost** model, and the **rf** model, and the **lowest weight** to the **lgb** model. 

Also, note that the **coefficients need not sum to one.**

####  Why did LightGBM get the lowest (even negative) coefficient?

1. **Stacking is not based on model performance alone**
   - The meta-model (`LinearRegression`) **doesn't assign weights based on RMSE directly**.
   - Instead, it **learns how to combine the model predictions** to best fit the training data (specifically, the **out-of-fold predictions** from each base model).
   - So, even if LightGBM performs decently **on its own**, its predictions may be **redundant** or **highly correlated** with stronger models (e.g., CatBoost or XGBoost).

2. **LightGBM and XGBoost are often similar**
   - Both are gradient boosting methods — if they make **very similar predictions**, the meta-model may favor just one of them (in this case, XGBoost slightly more).
   - Including both may introduce **multicollinearity**, and the linear model tries to **suppress redundancy** by assigning a near-zero or **negative weight**.

3. **Linear regression allows negative weights**
   - Unlike voting (which only uses positive weights), a linear model may assign a **negative coefficient** if it slightly improves the overall fit.
   - This doesn’t mean the model is “bad,” but rather that **its prediction direction may not help much** in the presence of other models.


To further improve the RMSE, we will refine the ensemble by removing weaker or highly correlated base models. Specifically, we will retain only the top four models, selected based on the magnitude of their coefficients in the linear regression meta-model.

In [15]:
# Define top 4 base models
top4_regressor_list = [
    ('cat', cb.CatBoostRegressor(random_state=42, verbose=0)),
    ('rf', RandomForestRegressor(random_state=42)),
    ('xgb', xgb.XGBRegressor(random_state=42)),
    ('knn', KNeighborsRegressor())
]

# Meta-model
top4_meta_lr = LinearRegression()

# Build the stacking ensemble
stacking_top4 = StackingRegressor(
    estimators=top4_regressor_list,
    final_estimator=top4_meta_lr,
    cv=5
)

# Create pipeline with preprocessing
stacking_top4_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_top4)
])

# Fit the pipeline
stacking_top4_pipeline.fit(X_train, y_train)

In [16]:
# Predict and evaluate
y_pred_top4 = stacking_top4_pipeline.predict(X_test)
rmse_top4 = root_mean_squared_error(y_test, y_pred_top4)

# Add result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Lr Stacking Top 4 Regressor', rmse_top4]
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Show updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,Lr Stacking Top 4 Regressor,3135.705167
1,Lr Stacking Regressor,3190.112423
2,Baseline CatBoost,3296.493137
3,Voting Regressor,3302.202159
4,Baseline XGBoost,3397.155518
5,Baseline Random Forest,3660.14597
6,Baseline LightGBM,3729.955778
7,Baseline KNN Regressor,4062.83968
8,Baseline Decision Tree,5015.812547
9,Baseline Linear Regression,5801.435399


The metamodel accuracy **improves further**, when **strong models** are ensembled.

In [17]:
# Access the trained meta-model inside the stacking pipeline
meta_model_top4 = stacking_top4_pipeline.named_steps['stacking'].final_estimator_

# Get the names of the base models
top_model_names = [name for name, _ in top4_regressor_list]

# Extract coefficients
top4_coefs = meta_model_top4.coef_

# Create a DataFrame to display the weights
top4_coef_df = pd.DataFrame({
    'Base Model': top_model_names,
    'Meta-Model Coefficient': top4_coefs
}).sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)

# Show the result
top4_coef_df

Unnamed: 0,Base Model,Meta-Model Coefficient
0,cat,0.602477
1,rf,0.156181
2,knn,0.131
3,xgb,0.122639


#### Choosing the Meta-Model in Stacking

In stacking, the **meta-model** (also called the *`final_estimator`*) is responsible for learning how to combine the predictions of the base models. It takes the base models' predictions as input features and learns how to best map them to the target.

While `LinearRegression` is a popular default choice due to its **simplicity**, **speed**, and **interpretability**, you are not limited to it. **Any regression model** can be used as the meta-model, depending on your goals:

- Use `Ridge` or `Lasso` if regularization is needed (e.g., to handle multicollinearity).
- Use a **tree-based model** (e.g., `RandomForestRegressor`, `XGBRegressor`) if you suspect **nonlinear interactions** between base model predictions.
- Use `SVR`, `MLPRegressor`, or `KNeighborsRegressor` for flexible, non-parametric alternatives (though often more sensitive to tuning and data scale).

The choice of meta-model can significantly affect the performance of your stacked ensemble.

####  General Guidelines for Choosing a Meta-Model in Stacking

When selecting a meta-model (final estimator) for a stacking ensemble, consider the following:

- **If your base models are highly correlated**  
  Use a regularized linear model like `Ridge` or `Lasso`. These models help reduce overfitting by shrinking or zeroing out redundant coefficients.

- **If you suspect nonlinear interactions between base model predictions**  
  Use a flexible, non-linear model like `RandomForestRegressor`, `XGBRegressor`, or `MLPRegressor` as the meta-model. These can better capture complex relationships among the base model outputs.

- **If interpretability is not a priority**  
  Consider using more powerful learners (e.g., tree ensembles or neural networks) for the meta-model to potentially boost performance, especially on complex datasets.

The meta-model should complement your base models and match the complexity of the prediction task.


In this car dataset, CatBoost, XGBoost, LightGBM, and Random Forest are all tree-based models, and they are likely generating similar predictions. This can lead to:

- **Multicollinearity** among the base models  
- **Unstable or suboptimal coefficients** in the meta-model

#### Metamodel: Ridge regression

To address this, we'll try using a **regularized linear model like Ridge regression** as the meta-model. Ridge can better handle redundant information by shrinking correlated coefficients, which helps reduce overfitting and improve generalization.

In [18]:
# Replace meta-model
stacking_ridge_model = StackingRegressor(
    estimators=base_regressor_list,
    final_estimator=Ridge(),
    cv=5
)

# Create pipeline with preprocessing
stacking_ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_ridge_model)
])
# Fit the pipeline
stacking_ridge_pipeline.fit(X_train, y_train)

In [19]:
# Predict and evaluate
y_pred_ridge = stacking_ridge_pipeline.predict(X_test)
rmse_ridge = root_mean_squared_error(y_test, y_pred_ridge)
# Add result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Stacking with Ridge', rmse_ridge]
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)
# Show updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,Lr Stacking Top 4 Regressor,3135.705167
1,Stacking with Ridge,3147.701164
2,Lr Stacking Regressor,3190.112423
3,Baseline CatBoost,3296.493137
4,Voting Regressor,3302.202159
5,Baseline XGBoost,3397.155518
6,Baseline Random Forest,3660.14597
7,Baseline LightGBM,3729.955778
8,Baseline KNN Regressor,4062.83968
9,Baseline Decision Tree,5015.812547


**Interpretation of Results:**

From the table, we observe that the **"Lr Stacking Top 4 Regressor"** achieved the lowest RMSE, indicating the best performance among all models evaluated. This model stacked the top 4 individually optimized regressors (CatBoost, XGBoost, Random Forest, and KNN) using a simple linear regression as the meta-model.

Interestingly, the **"Stacking with Ridge"** model—another stacking model using Ridge Regression as the meta-model—performed slightly worse than the simple linear stacking. This could be due to the Ridge model introducing regularization that slightly underweighted some strong base learners, leading to a minor decrease in overall performance.

Additional observations:

- **"Voting Regressor"** performed better than most baseline models but was still outperformed by stacking approaches. This is expected, as voting ensembles assign equal weights to base learners, whereas stacking learns optimal weights through a meta-model.
- All **baseline tree-based models** (CatBoost, XGBoost, Random Forest, LightGBM) performed reasonably well, with CatBoost being the best among them even without tuning.
- The **baseline linear regression** performed the worst due to its limited capability to capture non-linear relationship in the data


###  Should You Tune Base Models Before Stacking or Voting?

Tuning base models is not strictly required, but it’s **strongly recommended** for building effective ensembles — especially when using **stacking**.



####  Benefits of Tuning Base Models
- **Improved accuracy**: Better-tuned models contribute more meaningful predictions.
- **Reduced noise**: Untuned or weak models can hurt ensemble performance, especially in **voting**, where all models contribute equally.
- **Stronger stacking**: The meta-model in stacking learns how to combine predictions — and works best when those predictions are strong and diverse.


####  Summary

| Ensemble Method | Should You Tune Base Models? | Why? |
|------------------|------------------------------|------|
| **Voting**       | Optional but helpful          | Weak models can dilute strong ones. |
| **Stacking**     | Strongly recommended          | Meta-model depends on meaningful base predictions. |

For quick prototyping, so far we have used default model settings to assess whether ensembling could improve performance. Now that we have identified the top 4 models, we will leverage their optimized versions for stacking. These models were fine-tuned in previous chapters using tools like **Optuna**, **RandomizedSearchCV**, or **BayesSearchCV** to efficiently search for the best hyperparameters.

In [20]:
# tuned catboost model from section 11.3.6
catboost_tuned = cb.CatBoostRegressor(
    iterations=1021,
    learning_rate=0.0536,
    depth=7,
    l2_leaf_reg=2.56,
    random_seed=42,
    min_data_in_leaf=28,
    bagging_temperature=0.034,
    random_strength=2.2,
    verbose=0
)
# tuned random forest model from section 7.4.4
rf_tuned = RandomForestRegressor(
    n_estimators=60,
    max_depth=28,
    max_features=0.58,
    max_samples=1.0,
    random_state=42
)

#tuned knn model from section 2.4.1
knn_tuned = KNeighborsRegressor(
    n_neighbors=8,
    weights='distance',
    metric='manhattan',
    p=2
)

# tuned xgboost model from section  10.5.10
xgb_tuned = xgb.XGBRegressor(
    n_estimators=193,
    learning_rate=0.15,
    max_depth=7,
    min_child_weight=1,
    gamma=0,
    subsample=0.78,
    reg_lambda=9.33,
    reg_alpha=5.0,
    colsample_bytree=0.635,
    random_state=42
)

In [21]:
# Define models to evaluate
tuned_regressor_models = {
    'Optmized KNN Regressor': knn_tuned,
    'Optmized Random Forest': rf_tuned,
    'Optmized XGBoost': xgb_tuned,
    'Optmized CatBoost': catboost_tuned
}

In [22]:
# Initialize results dictionary
tuned_top4_results = {}

# Loop through tuned regressor models
for name, model in tuned_regressor_models.items():
    # Create a pipeline with preprocessing and the model
    tuned_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Fit the model
    tuned_pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_tuned_pred = tuned_pipeline.predict(X_test)
    
    # Calculate RMSE (correct target used)
    tuned_rmse = root_mean_squared_error(y_test, y_tuned_pred)
    
    # Store the result with a consistent label
    tuned_top4_results[name] = tuned_rmse

# Append new results to reg_results_df
for name, rmse in tuned_top4_results.items():
    reg_results_df.loc[len(reg_results_df)] = [name, rmse]

# Sort and reset index
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Show updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,Optmized XGBoost,3105.835205
1,Lr Stacking Top 4 Regressor,3135.705167
2,Stacking with Ridge,3147.701164
3,Optmized CatBoost,3175.161327
4,Lr Stacking Regressor,3190.112423
5,Optmized Random Forest,3279.094977
6,Baseline CatBoost,3296.493137
7,Voting Regressor,3302.202159
8,Baseline XGBoost,3397.155518
9,Baseline Random Forest,3660.14597


#### Metamodel:  Linear Regression on Tuned Top 4 Models

In [23]:
# define top 4 optimized base models
tuned_top4_regressor_list = [
    ('cat', catboost_tuned),
    ('rf', rf_tuned),
    ('knn', knn_tuned),
    ('xgb', xgb_tuned)
]

In [24]:
# Build the stacking ensemble
stacking_tuned_top4 = StackingRegressor(
    estimators=tuned_top4_regressor_list,
    final_estimator=LinearRegression(),
    cv=5
)
# Create pipeline with preprocessing
stacking_tuned_top4_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_tuned_top4)
])

# Fit the pipeline
stacking_tuned_top4_pipeline.fit(X_train, y_train)

In [25]:
# Predict and evaluate
y_pred_top4_tuned = stacking_tuned_top4_pipeline.predict(X_test)
rmse_top4_tuned = root_mean_squared_error(y_test, y_pred_top4_tuned)

# Add result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Stacking Optimized Top 4 Models', rmse_top4_tuned]

# Sort and reset index
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Display results
reg_results_df

Unnamed: 0,Model,RMSE
0,Stacking Optimized Top 4 Models,2971.389848
1,Optmized XGBoost,3105.835205
2,Lr Stacking Top 4 Regressor,3135.705167
3,Stacking with Ridge,3147.701164
4,Optmized CatBoost,3175.161327
5,Lr Stacking Regressor,3190.112423
6,Optmized Random Forest,3279.094977
7,Baseline CatBoost,3296.493137
8,Voting Regressor,3302.202159
9,Baseline XGBoost,3397.155518


**Key Takeaway**

- By using the tuned regressors, the performance of our stacking model improved further, achieving the lowest RMSE overall.
- Optimized regularized boosting tree models—such as **XGBoost** and **CatBoost**—demonstrated strong predictive performance even without stacking.
- These optimized models outperformed several stacking models that used default base regressors.
- This highlights the importance of proper **hyperparameter tuning**, especially for powerful individual learners.

Overall, stacking with well-tuned base models and a simple meta-model (even without regularization) offers a significant boost in predictive accuracy, outperforming both individual models and simple averaging ensembles.

### Ensembling Models Based on Different Feature Sets

Ensemble learning benefits significantly from model diversity—it helps reduce overfitting and improves the robustness of predictions. By combining models that learn from different perspectives, an ensemble has the potential to outperform any single model.

So far, we've shown how to ensemble diverse models by varying algorithms. Another powerful approach is to introduce diversity through **different feature subsets**, allowing each model to focus on specific aspects of the data. This idea is similar to how methods like **bagging** and **boosting** inject diversity through data sampling or random feature selection (`max_features`, `colsample_bytree`, etc.).

An alternative way to introduce such diversity is to train **strong base models on different subsets of features**. These subsets can be selected using techniques like:
- **Polynomial features**
- **Tree-based feature importance**
- **Stepwise feature selection**

By assigning tailored feature sets to different models, tuning each individually, and then combining them in a stacked ensemble, we can potentially achieve even lower RMSE.

Building such a system involves:
1. Selecting a meaningful feature subset for each base model,
2. Tuning each model on its selected features, and
3. Ensembling them using a meta-learner.

This more advanced form of stacking is a powerful technique—I'll leave its full exploration to you!

## Exploring Stacking and Voting in Classification

### Voting Classifier

There are two main types of voting in classification ensembles:

**Hard Voting:**

- Each base model **votes for a class label**.
- The final prediction is the **majority class** across models.
- Simple and interpretable.

**Soft Voting:**
- Each base model predicts **class probabilities**.
- The probabilities are **averaged**, and the final prediction is the class with the **highest average probability**.
- Usually performs better than hard voting (especially with well-calibrated models).

The figure below illustrates the difference between hard and soft voting.

In [26]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./Datasets/hard_soft.png", width=800, height=300)

We'll build an ensemble of models to predict whether a person has heart disease, focusing on improving classification accuracy.

In [27]:
heart_data = pd.read_csv('./Datasets/Heart.csv')
print(heart_data.shape)
heart_data.head()

(303, 14)


Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [28]:
# Define target and features
clf_X = heart_data.drop(columns='AHD')
clf_y = heart_data['AHD'].map({'Yes': 1, 'No': 0})  # Convert target to binary 0/1

# Train-test split
clf_X_train, clf_X_test, clf_y_train, clf_y_test = train_test_split(
    clf_X, clf_y, stratify=clf_y, test_size=0.2, random_state=42
)

In [29]:
# Identify column types
clf_numeric_cols = ['Age', 'RestBP', 'Chol', 'MaxHR', 'Oldpeak']
clf_categorical_cols = ['Sex', 'ChestPain', 'Fbs', 'RestECG', 'ExAng', 'Slope', 'Ca', 'Thal']

# Preprocessing pipelines
clf_numeric_transformer = StandardScaler()
clf_categorical_transformer = OneHotEncoder(handle_unknown='ignore')

clf_preprocessor = ColumnTransformer([
    ('num', clf_numeric_transformer, clf_numeric_cols),
    ('cat', clf_categorical_transformer, clf_categorical_cols)
])

In [30]:
# Define all 8 base classifiers
base_clf_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(eval_metric='logloss', random_state=42),
    'LightGBM': lgb.LGBMClassifier(random_state=42, verbose=-1),
    'CatBoost': cb.CatBoostClassifier(random_state=42, verbose=0)
}

# Store results for both base and ensemble models
clf_results = []

for name, model in base_clf_models.items():
    # Create pipeline
    base_clf_pipeline = Pipeline([
        ('preprocessor', clf_preprocessor),
        ('classifier', model)
    ])
    
    # Train
    base_clf_pipeline.fit(clf_X_train, clf_y_train)
    
    # Predict
    base_y_pred = base_clf_pipeline.predict(clf_X_test)
    base_y_proba = base_clf_pipeline.predict_proba(clf_X_test)[:, 1]
    
    # Compute metrics
    base_acc = accuracy_score(clf_y_test, base_y_pred)
    base_roc = roc_auc_score(clf_y_test, base_y_proba)
    base_f1 = f1_score(clf_y_test, base_y_pred)
    
    clf_results.append({
        'Model': name,
        'Accuracy': base_acc,
        'ROC AUC': base_roc,
        'F1 Score': base_f1
    })

# Create DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Decision Tree,0.655738,0.660173,0.655738


Define base classifiers (as tuples) for ensemble models like VotingClassifier or StackingClassifier

In [31]:
# Define all classifiers (as tuples for VotingClassifier)
clf_base_learners = [
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('knn', KNeighborsClassifier()),
    ('rf', RandomForestClassifier(random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42)),
    ('xgb', xgb.XGBClassifier(eval_metric='logloss', random_state=42)),
    ('lgb', lgb.LGBMClassifier(random_state=42)),
    ('cat', cb.CatBoostClassifier(random_state=42, verbose=0))
]

#### Hard Voting Ensemble

In [32]:
# Create a hard voting classifier (voting='hard')
ens_clf_hard_voting = VotingClassifier(
    estimators=clf_base_learners,
    voting='hard'  # Use predicted class labels for majority rule voting
)

# Pipeline with preprocessing
ens_clf_hard_voting_pipeline = Pipeline([
    ('preprocessor', clf_preprocessor),
    ('voting', ens_clf_hard_voting)
])

# Fit model
ens_clf_hard_voting_pipeline.fit(clf_X_train, clf_y_train)


In [33]:
# Predict
ens_y_pred_hard = ens_clf_hard_voting_pipeline.predict(clf_X_test)

# Compute metrics
ens_acc_hard = accuracy_score(clf_y_test, ens_y_pred_hard)
ens_roc_hard = roc_auc_score(clf_y_test, ens_y_pred_hard)
ens_f1_hard = f1_score(clf_y_test, ens_y_pred_hard)

# Store results
clf_results.append({
    'Model': 'Hard Voting Classifier',
    'Accuracy': ens_acc_hard,
    'ROC AUC': ens_roc_hard,
    'F1 Score': ens_f1_hard
})

# Create or update results DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Classifier,0.901639,0.906385,0.9
8,Decision Tree,0.655738,0.660173,0.655738


#### Soft Voting Ensemble

In [34]:
## Soft Voting Ensemble

# Define soft voting classifier
ens_clf_soft_voting = VotingClassifier(
    estimators=clf_base_learners,
    voting='soft'  # Use predicted probabilities
)

# Create pipeline with preprocessing
ens_clf_soft_voting_pipeline = Pipeline([
    ('preprocessor', clf_preprocessor),
    ('voting', ens_clf_soft_voting)
])

# Fit model
ens_clf_soft_voting_pipeline.fit(clf_X_train, clf_y_train)

In [35]:
# Predict
ens_y_pred_soft = ens_clf_soft_voting_pipeline.predict(clf_X_test)

# Compute metrics
ens_acc_soft = accuracy_score(clf_y_test, ens_y_pred_soft)
ens_roc_soft = roc_auc_score(clf_y_test, ens_y_pred_soft)
ens_f1_soft = f1_score(clf_y_test, ens_y_pred_soft)

# Store results
clf_results.append({
    'Model': 'Soft Voting Classifier',
    'Accuracy': ens_acc_soft,
    'ROC AUC': ens_roc_soft,
    'F1 Score': ens_f1_soft
})

# Update results DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Classifier,0.901639,0.906385,0.9
8,Soft Voting Classifier,0.901639,0.906385,0.9
9,Decision Tree,0.655738,0.660173,0.655738


### Stacking Classifier

Conceptually, the idea is similar to that of Stacking regressor.

In [36]:
# Create a meta-model
ens_meta_clf = LogisticRegression(max_iter=1000, random_state=42)

# Define the stacking classifier
ens_clf_stacking = StackingClassifier(
    estimators=clf_base_learners,
    final_estimator=ens_meta_clf,
    cv=5
)

# Create pipeline with preprocessing
ens_clf_stacking_pipeline = Pipeline([
    ('preprocessor', clf_preprocessor),
    ('stacking', ens_clf_stacking)
])

# Fit the pipeline
ens_clf_stacking_pipeline.fit(clf_X_train, clf_y_train)

# Predict
ens_y_pred_stack = ens_clf_stacking_pipeline.predict(clf_X_test)

# Compute metrics
ens_acc_stack = accuracy_score(clf_y_test, ens_y_pred_stack)
ens_roc_stack = roc_auc_score(clf_y_test, ens_y_pred_stack)
ens_f1_stack = f1_score(clf_y_test, ens_y_pred_stack)

# Store results
clf_results.append({
    'Model': 'Stacking Classifier',
    'Accuracy': ens_acc_stack,
    'ROC AUC': ens_roc_stack,
    'F1 Score': ens_f1_stack
})

# Update results DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Classifier,0.901639,0.906385,0.9
8,Soft Voting Classifier,0.901639,0.906385,0.9
9,Stacking Classifier,0.901639,0.906385,0.9


Let's print out the coefficients of the logistic regression meta-model

In [37]:
# Access the trained meta-model inside the stacking pipeline
ens_meta_clf = ens_clf_stacking_pipeline.named_steps['stacking'].final_estimator_

# Get model names in the same order as the coefficients
meta_model_names = [name for name, _ in clf_base_learners]

# Extract coefficients
meta_model_coefs = ens_meta_clf.coef_[0]  # For binary classification

# Create a DataFrame to display model weights
meta_coef_df = pd.DataFrame({
    'Base Model': meta_model_names,
    'Meta-Model Coefficient': meta_model_coefs
})

# Sort by coefficient value (optional)
meta_coef_df = meta_coef_df.sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)

# Show the coefficients
meta_coef_df

Unnamed: 0,Base Model,Meta-Model Coefficient
0,lr,2.296773
1,rf,1.251281
2,cat,0.636938
3,knn,0.471488
4,dt,0.446707
5,xgb,0.338406
6,lgb,0.16809
7,gb,-0.317501


### Interpretation of Ensemble Results

Interestingly, the **Hard Voting**, **Soft Voting**, and **Stacking Classifier** all achieved identical performance across `accuracy`, `ROC_AUC`, and `F1_score`. This outcome can be explained by the following factors:

- **Highly correlated base models**: All three ensemble methods use the same set of strong classifiers (e.g., LightGBM, CatBoost, Logistic Regression), which already produce very similar predictions. As a result, the ensembles also produce very similar outputs.

- **Limited benefit from ensembling**: Since the individual base models perform well and are well-aligned, combining them using different ensemble strategies yields nearly identical predictions.

- **Meta-model in stacking mimics soft voting**: The stacking classifier uses a Logistic Regression meta-model, which may learn weights close to equal when base predictions are similar—effectively behaving like soft voting.

- **No diversity in feature views or training subsets**: All base models were trained on the same feature set and data. Adding diversity (e.g., by using different subsets of features or training data) could help stacking outperform voting-based ensembles.

While ensemble methods are designed to improve performance by combining diverse models, in this case, the base learners were already strong and aligned, limiting the potential gain from ensembling.



## Summary: Stacking and Voting Ensembles

In this chapter, we explored ensemble learning methods—**Voting** and **Stacking**—and applied them to both regression and classification tasks. We incorporated all the models you’ve learned throughout this sequence and compared their performance against individual base models using a variety of evaluation metrics, including **RMSE**, **Accuracy**, **ROC AUC**, and **F1 Score**.

Key takeaways include:

- **Voting** combines predictions from multiple models either by majority rule (hard voting) or by averaging probabilities (soft voting). It's simple, intuitive, and often improves robustness.
- **Stacking** uses a meta-model to learn the best way to combine base model predictions. It has the potential to outperform voting when base models offer complementary strengths.
- In practice, the performance gains from ensembling depend heavily on the **diversity and strength of the base models**. When base models are highly correlated or already strong, ensembling may yield limited additional benefit.

This chapter also highlighted the importance of:

- Selecting diverse and well-tuned base learners,
- Understanding the behavior of meta-models in stacking,
- Evaluating ensemble strategies based on the specific task and dataset.