Ensembling models can improve predictive performance by leveraging the **diversity** and **collective wisdom** of multiple models. Instead of relying on a single model, we train several individual models and combine their predictions to make a final decision. 

We have already seen ensemble methods like bagging and boosting. These ensembles primarily reduce error by:

* **Reducing bias** — as in boosting

* **Reducing variance** — as in bagging and random forests

In this chapter, we'll go a step further and explore ensembles that **combine different types of models**. For example, we might ensemble a linear regression model, a random forest, and a gradient boosting model. The goal is to build stronger predictors by combining models that complement each other's strengths and weaknesses.

## Why Ensemble Diverse Models:

**Bias Reduction:** 

- Different models often exhibit distinct biases. For example, a linear regression model might underfit complex patterns, while a Random Forest might overfit noisy data. Combining their predictions can mitigate these biases, leading to a more generalized model.
- **Example**: If a linear model overpredicts and a boosting model underpredicts for the same instance, averaging their predictions can cancel out the biases.

**Variance Reduction:**

- As seen with Random Forests, averaging predictions from multiple models reduces variance, especially when the models are uncorrelated (recall the variance reduction formula for bagging).
- **Key Requirement**: For effective variance reduction, the models should have low correlation in their predictions.




###  Mathematical Justification for Ensembles

We can mathematically illustrate how ensembling improves prediction accuracy using the case of regression.  
Let the predictors be denoted by \( X \), and the response by \( Y \). Assume we have \( m \) individual models $( f_1, f_2, \dots, f_m $). The ensemble predictor is the average:

$$
\hat{f}_{ensemble}(X) = \frac{1}{m} \sum_{i=1}^{m} f_i(X)
$$

The expected mean squared error (MSE) of the ensemble model is:

$$
E(MSE_{Ensemble}) = E\left[\left( \frac{1}{m} \sum_{i = 1}^{m} f_i(X) - Y \right)^2 \right]
$$

This expands to:

$$
E(MSE_{Ensemble}) = \frac{1}{m^2} \sum_{i = 1}^{m} E\left[(f_i(X) - Y)^2 \right] + \frac{1}{m^2} \sum_{i \ne j} E\left[(f_i(X) - Y)(f_j(X) - Y) \right]
$$

$$
= \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right) + \frac{1}{m^2} \sum_{i \ne j} E\left[(f_i(X) - Y)(f_j(X) - Y) \right]
$$

If the individual models $( f_1, \dots, f_m $) are **unbiased**, the cross terms become covariances:

$$
E(MSE_{Ensemble}) = \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right) + \frac{1}{m^2} \sum_{i \ne j} Cov(f_i(X), f_j(X))
$$

If the models are **uncorrelated**, the covariance terms vanish:

$$
E(MSE_{Ensemble}) = \frac{1}{m} \left( \frac{1}{m} \sum_{i=1}^m E(MSE_{f_i}) \right)
$$

> 🔍 **Conclusion**: When the individual models are both **unbiased** and **uncorrelated**, the expected MSE of the ensemble is **strictly lower** than the average MSE of the individual models. This provides a strong theoretical justification for ensembling diverse models to improve prediction accuracy.
>
> In practice, the ensemble's performance tends to improve unless a single model is significantly more accurate than the others. For example, if one model has near-zero MSE while others perform poorly, ensembling may actually hurt performance. Therefore, the benefit of ensembling depends not only on diversity but also on the **relative quality** of the individual models.



## Combining Model Predictions: Two Common Approaches

There are two widely used methods for combining model predictions in ensemble learning:

- **Voting**: Combines the predictions of multiple models directly. In classification, this could be majority voting; in regression, it's often simple averaging. Voting is intuitive and works well when the base models are reasonably strong and diverse.

- **Stacking**: Trains a new model (called a **meta-learner**) to learn how to best combine the predictions of the base models. Stacking can capture more complex relationships among the models and often yields higher accuracy, especially when base models differ significantly in structure or behavior.

Let's apply them to our car dataset

In [65]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score,train_test_split, GridSearchCV, ParameterGrid, \
StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import root_mean_squared_error, mean_squared_error,r2_score,roc_curve,auc,precision_recall_curve, accuracy_score, roc_auc_score, f1_score
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import VotingRegressor, VotingClassifier, StackingRegressor, \
StackingClassifier, GradientBoostingRegressor,GradientBoostingClassifier, BaggingRegressor, \
BaggingClassifier,RandomForestRegressor,RandomForestClassifier,AdaBoostRegressor,AdaBoostClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
import itertools as it
import time as time
import xgboost as xgb
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [66]:
# Load the dataset
car = pd.read_csv('Datasets/car.csv')
car.head()

Unnamed: 0,brand,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,vw,Beetle,2014,Manual,55457,Diesel,30,65.3266,1.6,7490
1,vauxhall,GTC,2017,Manual,15630,Petrol,145,47.2049,1.4,10998
2,merc,G Class,2012,Automatic,43000,Diesel,570,25.1172,3.0,44990
3,audi,RS5,2019,Automatic,10,Petrol,145,30.5593,2.9,51990
4,merc,X-CLASS,2018,Automatic,14000,Diesel,240,35.7168,2.3,28990


In [67]:
X = car.drop(columns=['price'])
y = car['price']

# extract the categorical columns and put them in a list
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# extract the numerical columns and put them in a list
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [68]:

from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Create preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

Below is the models we have covered so far

In [69]:
import lightgbm as lgb
import catboost as cb

# Define models to evaluate
models = {
    'Linear Regression': LinearRegression(),
    'KNN Regressor': KNeighborsRegressor(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor( random_state=42),
    'XGBoost': xgb.XGBRegressor( random_state=42),
    'LightGBM': lgb.LGBMRegressor(random_state=42),
    'CatBoost': cb.CatBoostRegressor(random_state=42, verbose=0)
}

We’ll first evaluate each model with default settings to establish baseline performance before ensembling.

In [70]:
# store the results
reg_results = {}

# Loop through models
for name, model in models.items():
    # Create a pipeline with preprocessing and the model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Calculate RMSE
    rmse = root_mean_squared_error(y_test, y_pred)
    
    # Store the results
    reg_results[name] = rmse

# Convert results to DataFrame for better visualization
reg_results_df = pd.DataFrame.from_dict(reg_results, orient='index', columns=['RMSE'])
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True)
reg_results_df.reset_index(inplace=True)
reg_results_df.columns = ['Model', 'RMSE']
# Print the results
reg_results_df



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000154 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 764
[LightGBM] [Info] Number of data points in the train set: 6105, number of used features: 86
[LightGBM] [Info] Start training from score 23554.115971


Unnamed: 0,Model,RMSE
0,CatBoost,3296.493137
1,XGBoost,3397.155518
2,Random Forest,3660.14597
3,LightGBM,3729.955778
4,KNN Regressor,4062.83968
5,Decision Tree,5015.812547
6,Linear Regression,5801.435399


## Voting

In this section, we explore the **Voting Regressor**, a simple but effective ensemble method that combines the predictions of multiple models.

### What Is Voting in Regression?

While voting is often associated with classification (e.g., majority vote), in regression tasks, it typically refers to **averaging** the predicted values from multiple base models. This approach is implemented in scikit-learn using `VotingRegressor`.

### How It Works

The `VotingRegressor` takes a list of individual regression models and:
- Trains all of them **simultaneously** on the same training data.
- Averages their predictions during inference to generate the final prediction.

> 📌 **Note**: We **do not need to fit the models individually** before including them in the `VotingRegressor`. Doing so would result in unnecessary computation and **waste time**, as `VotingRegressor.fit()` will handle training for all models internally.

### Why It Works

- **Variance Reduction**: Averaging helps smooth out individual model fluctuations.
- **Error Compensation**: If one model overpredicts and another underpredicts, the ensemble prediction may be closer to the true value.
- **Robustness**: Combining diverse models helps ensure that weaknesses of one model are offset by strengths of another.

### Equal Weights

By default, `VotingRegressor` assigns **equal weights** to all models — treating each prediction equally in the average.  
Let us now ensemble the models using the voting ensemble with **equal weights** to obtain a combined prediction.

### When to Use

- You have multiple well-performing but diverse models.
- You want a quick ensemble method without the added complexity of training a meta-model (as in stacking).


Below is how you can **ensemble the same models** using VotingRegressor with the same preprocessor in a pipeline, without fitting each model individually again:

In [71]:
from sklearn.ensemble import VotingRegressor

# Define base regressors (same as before)
voting_estimators = [
    ('lr', LinearRegression()),
    ('knn', KNeighborsRegressor()),
    ('dt', DecisionTreeRegressor(random_state=42)),
    ('rf', RandomForestRegressor(random_state=42)),
    ('xgb', xgb.XGBRegressor(random_state=42)),
    ('lgb', lgb.LGBMRegressor(random_state=42)),
    ('cat', cb.CatBoostRegressor(random_state=42, verbose=0))
]

# Create a VotingRegressor with equal weights
voting_regressor = VotingRegressor(estimators=voting_estimators)

# Create a pipeline with preprocessing and voting ensemble
voting_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('ensemble', voting_regressor)
])

# Fit the ensemble model (no need to fit individual models beforehand!)
voting_pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred_vote = voting_pipeline.predict(X_test)
rmse_vote = root_mean_squared_error(y_test, y_pred_vote)

# Add ensemble result to the results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Voting Ensemble', rmse_vote]
results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Show updated results
results_df




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000180 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 764
[LightGBM] [Info] Number of data points in the train set: 6105, number of used features: 86
[LightGBM] [Info] Start training from score 23554.115971




Unnamed: 0,Model,RMSE
0,CatBoost,3296.493137
1,Voting Ensemble,3302.202159
2,XGBoost,3397.155518
3,Random Forest,3660.14597
4,LightGBM,3729.955778
5,KNN Regressor,4062.83968
6,Decision Tree,5015.812547
7,Linear Regression,5801.435399


It’s not uncommon for a single, well-optimized model like CatBoost to outperform a voting ensemble, especially if the ensemble’s models lack diversity, aren’t well-tuned, or if the dataset doesn’t benefit from ensembling. CatBoost’s advanced categorical feature handling and robust defaults likely gave it an edge, particularly if your voting ensemble relied on models with correlated predictions or wasn’t optimized for the specific dataset characteristics (e.g., size, complexity, or imbalance).



###  Strategies to Improve Voting Ensemble Performance

To boost the effectiveness of a voting ensemble, consider the following enhancements:

- **Increase Model Diversity**  
  Incorporate a wider range of model types (e.g., SVMs, neural networks, or k-NN) alongside tree-based models. Diverse models are more likely to capture different patterns in the data and produce uncorrelated errors — a key factor in ensemble success, as discussed earlier in this chapter.

- **Tune Base Models Individually**  
  Optimize each base model using hyperparameter tuning techniques such as **Optuna**. Well-tuned individual models provide stronger building blocks for the ensemble, improving the final averaged prediction.

- **Use Weighted Voting**  
  Instead of assigning equal importance to each model, assign weights based on their individual performance (e.g., lower RMSE → higher weight). This helps emphasize the contribution of stronger models like CatBoost or XGBoost.  
  *(Note: In a more advanced setup, stacking takes this further by learning the best combination strategy using a meta-model.)*


## Stacking

**Stacking** is a more sophisticated ensembling technique that learns how to best combine multiple base models using a separate meta-model (also called the `final_estimator`).

Here’s how the process works:

1. **Cross-validated predictions for base models**  
   The training data is split into *K* folds (typically using cross-validation). For each fold:
   - The base models are trained on the remaining *K–1* folds.
   - Predictions are made on the held-out fold.

2. **Out-of-fold predictions become new features**  
   This process generates **out-of-fold predictions** for each training point from each base model (i.e., predictions made on data not seen during training). These predictions are used as **features** for the next stage.

3. **Training the meta-model (`final_estimator`)**  
   The meta-model is trained on these out-of-fold predictions as input features and the original target variable as the response. It learns how to combine the base model outputs to make a better overall prediction.

> The goal of stacking is to leverage the strengths of each individual model while minimizing their weaknesses, often resulting in improved accuracy over any single model.


#### Metamodel: Linear regression

In [78]:
# Define your full list of diverse base models
stacking_estimators = [
    ('lr', LinearRegression()),
    ('knn', KNeighborsRegressor()),
    ('dt', DecisionTreeRegressor(random_state=42)),
    ('rf', RandomForestRegressor(random_state=42)),
    ('xgb', xgb.XGBRegressor(random_state=42)),
    ('lgb', lgb.LGBMRegressor(random_state=42, verbose=-1)),
    ('cat', cb.CatBoostRegressor(random_state=42, verbose=0))
]

In [None]:
# Define a meta-model
meta_model = LinearRegression()

# Create the stacking regressor
stacking_model = StackingRegressor(
    estimators=stacking_estimators,
    final_estimator=meta_model,
    cv=5,
    passthrough=False  # Set to True if you want to include original features in meta-model
)

# Wrap with pipeline (using your preprocessor)
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_model)
])

# Fit the stacking model
stacking_pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred_stack = stacking_pipeline.predict(X_test)
rmse_stack = root_mean_squared_error(y_test, y_pred_stack)

print(f"Stacking Regressor RMSE: {rmse_stack:.2f}")



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000157 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 764
[LightGBM] [Info] Number of data points in the train set: 6105, number of used features: 86
[LightGBM] [Info] Start training from score 23554.115971




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000191 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 749
[LightGBM] [Info] Number of data points in the train set: 4884, number of used features: 80
[LightGBM] [Info] Start training from score 23420.609746
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000178 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 752
[LightGBM] [Info] Number of data points in the train set: 4884, number of used features: 82
[LightGBM] [Info] Start training from score 23802.703522
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000208 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 752
[LightGBM] [Info] Number of data points in the train set: 4884, number of used features: 81
[LightGBM] [Info] Start t



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000256 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 750
[LightGBM] [Info] Number of data points in the train set: 4884, number of used features: 81
[LightGBM] [Info] Start training from score 23507.457821
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000201 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 747
[LightGBM] [Info] Number of data points in the train set: 4884, number of used features: 79
[LightGBM] [Info] Start training from score 23429.221130
Stacking Regressor RMSE: 3147.70




In [73]:
# Append the Stacking Regressor result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Stacking Regressor', rmse_stack]

# Sort by RMSE in ascending order and reset index
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Display updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,Stacking Regressor,3147.701164
1,CatBoost,3296.493137
2,Voting Ensemble,3302.202159
3,XGBoost,3397.155518
4,Random Forest,3660.14597
5,LightGBM,3729.955778
6,KNN Regressor,4062.83968
7,Decision Tree,5015.812547
8,Linear Regression,5801.435399


In [74]:
# Access the trained meta-model inside the stacking pipeline
meta_model = stacking_pipeline.named_steps['stacking'].final_estimator_

# Get model names in the same order as the coefficients
model_names = [name for name, _ in stacking_estimators]

# Extract coefficients
coefs = meta_model.coef_

# Create a DataFrame to display model weights
import pandas as pd
coef_df = pd.DataFrame({'Base Model': model_names, 'Meta-Model Coefficient': coefs})

# Sort by weight (optional)
coef_df = coef_df.sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)

# Show the weights
coef_df



Unnamed: 0,Base Model,Meta-Model Coefficient
0,cat,0.583701
1,rf,0.147426
2,xgb,0.126955
3,knn,0.124413
4,lr,0.034756
5,dt,0.008769
6,lgb,-0.011164


Note the above coefficients of the meta-model. The model gives the **highest weight** to the **catboost** model, and the **rf** model, and the **lowest weight** to the **lgb** model. 

Also, note that the **coefficients need not sum to one.**

###  Why did LightGBM get the lowest (even negative) coefficient?

1. **Stacking is not based on model performance alone**
   - The meta-model (`LinearRegression`) **doesn't assign weights based on RMSE directly**.
   - Instead, it **learns how to combine the model predictions** to best fit the training data (specifically, the **out-of-fold predictions** from each base model).
   - So, even if LightGBM performs decently **on its own**, its predictions may be **redundant** or **highly correlated** with stronger models (e.g., CatBoost or XGBoost).

2. **LightGBM and XGBoost are often similar**
   - Both are gradient boosting methods — if they make **very similar predictions**, the meta-model may favor just one of them (in this case, XGBoost slightly more).
   - Including both may introduce **multicollinearity**, and the linear model tries to **suppress redundancy** by assigning a near-zero or **negative weight**.

3. **Linear regression allows negative weights**
   - Unlike voting (which only uses positive weights), a linear model may assign a **negative coefficient** if it slightly improves the overall fit.
   - This doesn’t mean the model is “bad,” but rather that **its prediction direction may not help much** in the presence of other models.


Let us try improving the RMSE further by removing the weaker models from the ensemble. Let us remove the three weakest models based on the size of their coefficients in the linear regression metamodel.

In [75]:
# Define top 4 base models
top_models = [
    ('cat', cb.CatBoostRegressor(random_state=42, verbose=0)),
    ('rf', RandomForestRegressor(random_state=42)),
    ('xgb', xgb.XGBRegressor(random_state=42)),
    ('knn', KNeighborsRegressor())
]

# Meta-model
meta_model = LinearRegression()

# Build the stacking ensemble
stacking_top4 = StackingRegressor(
    estimators=top_models,
    final_estimator=meta_model,
    cv=5
)

# Create pipeline with preprocessing
stacking_top4_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_top4)
])

# Fit the pipeline
stacking_top4_pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred_top4 = stacking_top4_pipeline.predict(X_test)
rmse_top4 = root_mean_squared_error(y_test, y_pred_top4)

# Add result to results_df
reg_results_df.loc[len(results_df.index)] = ['Stacking Top 4 Models', rmse_top4]
reg_results_df = results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Show updated results
reg_results_df

Unnamed: 0,Model,RMSE
0,CatBoost,3296.493137
1,Voting Ensemble,3302.202159
2,XGBoost,3397.155518
3,Random Forest,3660.14597
4,LightGBM,3729.955778
5,KNN Regressor,4062.83968
6,Decision Tree,5015.812547
7,Linear Regression,5801.435399


The metamodel accuracy **improves further**, when **strong models** are ensembled.

In [76]:
# Access the trained meta-model inside the stacking pipeline
meta_model_top4 = stacking_top4_pipeline.named_steps['stacking'].final_estimator_

# Get the names of the base models
top_model_names = [name for name, _ in top_models]

# Extract coefficients
top4_coefs = meta_model_top4.coef_

# Create a DataFrame to display the weights
top4_coef_df = pd.DataFrame({
    'Base Model': top_model_names,
    'Meta-Model Coefficient': top4_coefs
}).sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)

# Show the result
top4_coef_df

Unnamed: 0,Base Model,Meta-Model Coefficient
0,cat,0.602477
1,rf,0.156181
2,knn,0.131
3,xgb,0.122639


### Choosing the Meta-Model in Stacking

In stacking, the **meta-model** (also called the *final estimator*) is responsible for learning how to combine the predictions of the base models. It takes the base models' predictions as input features and learns how to best map them to the target.

While `LinearRegression` is a popular default choice due to its **simplicity**, **speed**, and **interpretability**, you are not limited to it. **Any regression model** can be used as the meta-model, depending on your goals:

- Use `Ridge` or `Lasso` if regularization is needed (e.g., to handle multicollinearity).
- Use a **tree-based model** (e.g., `RandomForestRegressor`, `XGBRegressor`) if you suspect **nonlinear interactions** between base model predictions.
- Use `SVR`, `MLPRegressor`, or `KNeighborsRegressor` for flexible, non-parametric alternatives (though often more sensitive to tuning and data scale).

The choice of meta-model can significantly affect the performance of your stacked ensemble.

###  General Guidelines for Choosing a Meta-Model in Stacking

When selecting a meta-model (final estimator) for a stacking ensemble, consider the following:

- **If your base models are highly correlated**  
  Use a regularized linear model like `Ridge` or `Lasso`. These models help reduce overfitting by shrinking or zeroing out redundant coefficients.

- **If you suspect nonlinear interactions between base model predictions**  
  Use a flexible, non-linear model like `RandomForestRegressor`, `XGBRegressor`, or `MLPRegressor` as the meta-model. These can better capture complex relationships among the base model outputs.

- **If interpretability is not a priority**  
  Consider using more powerful learners (e.g., tree ensembles or neural networks) for the meta-model to potentially boost performance, especially on complex datasets.

The meta-model should complement your base models and match the complexity of the prediction task.


In this car dataset, CatBoost, XGBoost, LightGBM, and Random Forest are all tree-based. They're likely making similar predictions. This causes:

* Multicollinearity

* Unstable or suboptimal coefficients
Let's try a regularized model Ridge to suppress redundant contributions more effectively and reduce overfitting


In [79]:
from sklearn.linear_model import Ridge, RidgeCV

# Replace meta-model
stacking_model = StackingRegressor(
    estimators=stacking_estimators,
    final_estimator=Ridge(),
    cv=5
)

# Create pipeline with preprocessing
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_model)
])
# Fit the pipeline
stacking_pipeline.fit(X_train, y_train)
# Predict and evaluate
y_pred_ridge = stacking_pipeline.predict(X_test)
rmse_ridge = root_mean_squared_error(y_test, y_pred_ridge)
# Add result to results_df
reg_results_df.loc[len(results_df.index)] = ['Stacking with Ridge', rmse_ridge]
reg_results_df = results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)
# Show updated results
reg_results_df



Unnamed: 0,Model,RMSE
0,CatBoost,3296.493137
1,Voting Ensemble,3302.202159
2,XGBoost,3397.155518
3,Random Forest,3660.14597
4,LightGBM,3729.955778
5,KNN Regressor,4062.83968
6,Decision Tree,5015.812547
7,Linear Regression,5801.435399


It obtained the same result as linearregression, alpha needs to be tuned for better performance, 

#### Metamodel: CatBoost

In [80]:
# use catboost as meta-model
stacking_model = StackingRegressor(
    estimators=stacking_estimators,
    final_estimator=cb.CatBoostRegressor(random_state=42, verbose=0),
    cv=5
)
# Create pipeline with preprocessing
stacking_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_model)
])
# Fit the pipeline
stacking_pipeline.fit(X_train, y_train)
# Predict and evaluate
y_pred_cat = stacking_pipeline.predict(X_test)
rmse_cat = root_mean_squared_error(y_test, y_pred_cat)
# Add result to results_df
reg_results_df.loc[len(results_df.index)] = ['Stacking with CatBoost', rmse_cat]
reg_results_df = results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)
# Show updated results
reg_results_df



Unnamed: 0,Model,RMSE
0,CatBoost,3296.493137
1,Voting Ensemble,3302.202159
2,XGBoost,3397.155518
3,Random Forest,3660.14597
4,LightGBM,3729.955778
5,KNN Regressor,4062.83968
6,Decision Tree,5015.812547
7,Linear Regression,5801.435399



###  Should You Tune Base Models Before Stacking or Voting?

Tuning base models is not strictly required, but it’s **strongly recommended** for building effective ensembles — especially when using **stacking**.



####  Benefits of Tuning Base Models
- **Improved accuracy**: Better-tuned models contribute more meaningful predictions.
- **Reduced noise**: Untuned or weak models can hurt ensemble performance, especially in **voting**, where all models contribute equally.
- **Stronger stacking**: The meta-model in stacking learns how to combine predictions — and works best when those predictions are strong and diverse.



####  Exceptions & Practical Advice
- For **quick prototypes**, you can start with default settings to test if ensembling helps.
- For **computational efficiency**, focus tuning on your top models (e.g., CatBoost, XGBoost).
- Use tools like **Optuna**, **RandomizedSearchCV**, or **BayesSearchCV** to efficiently tune key hyperparameters.



####  Summary

| Ensemble Method | Should You Tune Base Models? | Why? |
|------------------|------------------------------|------|
| **Voting**       | Optional but helpful          | Weak models can dilute strong ones. |
| **Stacking**     | Strongly recommended          | Meta-model depends on meaningful base predictions. |

Since we have tuned the top 4 models in the previous chapters, let's use their optimized hyperparamters for stacking next

In [81]:
# tuned catboost model from section 11.3.6
catboost_model = cb.CatBoostRegressor(
    iterations=1021,
    learning_rate=0.0536,
    depth=7,
    l2_leaf_reg=2.56,
    random_seed=42,
    min_data_in_leaf=28,
    bagging_temperature=0.034,
    random_strength=2.2,
    verbose=0
)
# tuned random forest model from section 7.4.4
rf_model = RandomForestRegressor(
    n_estimators=60,
    max_depth=28,
    max_features=0.58,
    max_samples=1.0,
    random_state=42
)

#tuned knn model from section 2.4.1
knn_model = KNeighborsRegressor(
    n_neighbors=8,
    weights='distance',
    metric='manhattan',
    p=2
)

# tuned xgboost model from section  10.5.10
xgb_model = xgb.XGBRegressor(
    n_estimators=193,
    learning_rate=0.15,
    max_depth=7,
    min_child_weight=1,
    gamma=0,
    subsample=0.78,
    reg_lambda=9.33,
    reg_alpha=5.0,
    colsample_bytree=0.635,
    random_state=42
)

In [None]:
optimized_estimators = [
    ('cat', cb.CatBoostRegressor(depth=6, learning_rate=0.1, verbose=0, random_state=42)),
    ('rf', RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)),
    ('knn', KNeighborsRegressor(n_neighbors=5)),
    ('xgb', xgb.XGBRegressor(max_depth=4, learning_rate=0.1, n_estimators=100, random_state=42))
]

In [85]:
# loop through the models
tuned_models = {
    'Optmized CatBoost': catboost_model,
    'Optmized Random Forest': rf_model,
    'Optmized KNN': knn_model,
    'Optmized XGBoost': xgb_model
}
# store the results
tuned_results = {}
# Loop through models
for name, model in tuned_models.items():
    # Create a pipeline with preprocessing and the model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    # Fit the model
    pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = pipeline.predict(X_test)
    
    # Calculate RMSE
    rmse = root_mean_squared_error(y_test, y_pred)
    
    # Store the results
    tuned_results[name] = rmse
# Convert results to DataFrame for better visualization
tuned_results_df = pd.DataFrame.from_dict(tuned_results, orient='index', columns=['RMSE'])
tuned_results_df = tuned_results_df.sort_values(by='RMSE', ascending=True)
tuned_results_df.reset_index(inplace=True)
tuned_results_df.columns = ['Model', 'RMSE']
# Print the results
tuned_results_df

Unnamed: 0,Model,RMSE
0,Optmized XGBoost,3105.835205
1,Optmized CatBoost,3175.161327
2,Optmized Random Forest,3279.094977
3,Optmized KNN,3680.370319


In [None]:

# Meta-model
meta_model = LinearRegression()

# Build the stacking ensemble
stacking_optimized_top4 = StackingRegressor(
    estimators=optimized_estimators,
    final_estimator=meta_model,
    cv=5
)
# Create pipeline with preprocessing
stacking_top4_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('stacking', stacking_optimized_top4)
])

# Fit the pipeline
stacking_top4_pipeline.fit(X_train, y_train)

In [86]:
# Predict and evaluate
y_pred_top4_optimized = stacking_top4_pipeline.predict(X_test)
rmse_top4_optimized = root_mean_squared_error(y_test, y_pred_top4_optimized)

# Add result to results_df
reg_results_df.loc[len(reg_results_df.index)] = ['Stacking Optimized Top 4 Models', rmse_top4_optimized]

# Sort and reset index
reg_results_df = reg_results_df.sort_values(by='RMSE', ascending=True).reset_index(drop=True)

# Display results
reg_results_df


Unnamed: 0,Model,RMSE
0,Stacking Optimized Top 4 Models,3037.69844
1,Stacking Optimized Top 4 Models,3037.69844
2,CatBoost,3296.493137
3,Voting Ensemble,3302.202159
4,XGBoost,3397.155518
5,Random Forest,3660.14597
6,LightGBM,3729.955778
7,KNN Regressor,4062.83968
8,Decision Tree,5015.812547
9,Linear Regression,5801.435399


## Ensembling Classification Models with Stacking and Voting

###  How Voting Works for Classification

There are two main types of voting in classification ensembles:

####  Hard Voting
- Each base model **votes for a class label**.
- The final prediction is the **majority class** across models.
- Simple and interpretable.

####  Soft Voting
- Each base model predicts **class probabilities**.
- The probabilities are **averaged**, and the final prediction is the class with the **highest average probability**.
- Usually performs better than hard voting (especially with well-calibrated models).


We'll ensemble models for predicting accuracy of identifying people having a heart disease.

In [89]:
heart_data = pd.read_csv('./Datasets/Heart.csv')
print(heart_data.shape)
heart_data.head()

(303, 14)


Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
0,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
1,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
2,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
3,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
4,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [95]:
# Define target and features
X = heart_data.drop(columns='AHD')
y = heart_data['AHD'].map({'Yes': 1, 'No': 0})  # Convert target to binary 0/1

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Identify column types
numeric_cols = ['Age', 'RestBP', 'Chol', 'MaxHR', 'Oldpeak']
categorical_cols = ['Sex', 'ChestPain', 'Fbs', 'RestECG', 'ExAng', 'Slope', 'Ca', 'Thal']

In [96]:
# Preprocessing pipelines
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_cols),
    ('cat', categorical_transformer, categorical_cols)
])


In [97]:
# Define all 8 classifiers
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42),
    'LightGBM': lgb.LGBMClassifier(random_state=42, verbose=-1),
    'CatBoost': cb.CatBoostClassifier(random_state=42, verbose=0)
}

# Store results
clf_results = []

for name, model in models.items():
    # Create pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    # Train
    pipeline.fit(X_train, y_train)
        # Predict
    y_pred = pipeline.predict(X_test)
    y_proba = pipeline.predict_proba(X_test)[:, 1]
    
    # Compute metrics
    acc = accuracy_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_proba)
    f1 = f1_score(y_test, y_pred)
    
    clf_results.append({
        'Model': name,
        'Accuracy': acc,
        'ROC AUC': roc,
        'F1 Score': f1
    })

# Create DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Decision Tree,0.655738,0.660173,0.655738


In [98]:
## Hard voting ensemble

# Define all classifiers (as tuples for VotingClassifier)
voting_estimators = [
    ('lr', LogisticRegression(max_iter=1000, random_state=42)),
    ('dt', DecisionTreeClassifier(random_state=42)),
    ('knn', KNeighborsClassifier()),
    ('rf', RandomForestClassifier(random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42)),
    ('xgb', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)),
    ('lgb', lgb.LGBMClassifier(random_state=42)),
    ('cat', cb.CatBoostClassifier(random_state=42, verbose=0))
]

# Create a hard voting classifier (voting='hard')
hard_voting_clf = VotingClassifier(
    estimators=voting_estimators,
    voting='hard'
)

# Pipeline with preprocessing
hard_voting_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('voting', hard_voting_clf)
])

# Fit model
hard_voting_pipeline.fit(X_train, y_train)

# Predict
y_pred_hard = hard_voting_pipeline.predict(X_test)
# Compute metrics
acc_hard = accuracy_score(y_test, y_pred_hard)
roc_hard = roc_auc_score(y_test, y_pred_hard)
f1_hard = f1_score(y_test, y_pred_hard)

# Store results
clf_results.append({
    'Model': 'Hard Voting Ensemble',
    'Accuracy': acc_hard,
    'ROC AUC': roc_hard,
    'F1 Score': f1_hard
})
# Create DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Ensemble,0.901639,0.906385,0.9
8,Decision Tree,0.655738,0.660173,0.655738


### Create a soft voting

In [100]:
# define soft voting ensemble
soft_voting_clf = VotingClassifier(
    estimators=voting_estimators,
    voting='soft'
)
# Create pipeline with preprocessing
soft_voting_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('voting', soft_voting_clf)
])
# Fit model
soft_voting_pipeline.fit(X_train, y_train)
# Predict
y_pred_soft = soft_voting_pipeline.predict(X_test)
# Compute metrics
acc_soft = accuracy_score(y_test, y_pred_soft)
roc_soft = roc_auc_score(y_test, y_pred_soft)
f1_soft = f1_score(y_test, y_pred_soft)
# Store results
clf_results.append({
    'Model': 'Soft Voting Ensemble',
    'Accuracy': acc_soft,
    'ROC AUC': roc_soft,
    'F1 Score': f1_soft
})
# Create DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Ensemble,0.901639,0.906385,0.9
8,Soft Voting Ensemble,0.901639,0.906385,0.9
9,Decision Tree,0.655738,0.660173,0.655738


### Stacking classifier
Conceptually, the idea is similar to that of Stacking regressor.

In [101]:
# create a meta-model
meta_model = LogisticRegression(max_iter=1000, random_state=42)
# Create the stacking classifier    
stacking_model = StackingClassifier(
    estimators=voting_estimators,
    final_estimator=meta_model,
    cv=5
)
# Create pipeline with preprocessing
stacking_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('stacking', stacking_model)
])
# Fit the pipeline
stacking_pipeline.fit(X_train, y_train)
# Predict
y_pred_stack = stacking_pipeline.predict(X_test)
# Compute metrics
acc_stack = accuracy_score(y_test, y_pred_stack)
roc_stack = roc_auc_score(y_test, y_pred_stack)
f1_stack = f1_score(y_test, y_pred_stack)
# Store results
clf_results.append({
    'Model': 'Stacking Ensemble',
    'Accuracy': acc_stack,
    'ROC AUC': roc_stack,
    'F1 Score': f1_stack
})
# Create DataFrame
clf_results_df = pd.DataFrame(clf_results).sort_values(by='ROC AUC', ascending=False).reset_index(drop=True)
clf_results_df



Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Model,Accuracy,ROC AUC,F1 Score
0,LightGBM,0.901639,0.965368,0.896552
1,Logistic Regression,0.885246,0.958874,0.881356
2,KNN,0.868852,0.94697,0.862069
3,CatBoost,0.885246,0.945887,0.885246
4,Gradient Boosting,0.868852,0.943723,0.866667
5,Random Forest,0.868852,0.937229,0.866667
6,XGBoost,0.836066,0.920996,0.83871
7,Hard Voting Ensemble,0.901639,0.906385,0.9
8,Soft Voting Ensemble,0.901639,0.906385,0.9
9,Stacking Ensemble,0.901639,0.906385,0.9


In [102]:
#Coefficients of the logistic regression metamodel
# Access the trained meta-model inside the stacking pipeline
meta_model = stacking_pipeline.named_steps['stacking'].final_estimator_
# Get model names in the same order as the coefficients
model_names = [name for name, _ in voting_estimators]
# Extract coefficients
coefs = meta_model.coef_[0]  # For binary classification, we take the first row
# Create a DataFrame to display model weights
coef_df = pd.DataFrame({'Base Model': model_names, 'Meta-Model Coefficient': coefs})
# Sort by weight (optional)
coef_df = coef_df.sort_values(by='Meta-Model Coefficient', ascending=False).reset_index(drop=True)
# Show the weights
coef_df


Unnamed: 0,Base Model,Meta-Model Coefficient
0,lr,2.296773
1,rf,1.251281
2,cat,0.636938
3,knn,0.471488
4,dt,0.446707
5,xgb,0.338406
6,lgb,0.16809
7,gb,-0.317501


## Ensembling Models Based on Different Feature Sets

While tree-based models like CatBoost and XGBoost are often the most accurate, weaker models (e.g., KNN, bagging, linear models) can sometimes hurt ensemble performance despite adding diversity.

An alternative way to introduce diversity is to train **strong models on different subsets of predictors**. These subsets can be derived using techniques like **PolynomialFeature**, **tree-based feature importance**, or **stepwise feature selection**. Even if the models are of the same type, using distinct feature sets allows the ensemble to benefit from varied perspectives while maintaining strong individual performance.
