## A8: Ensemble Learning for Complex Regression Modeling on Bike Share Data

### Part A: Data Preprocessing and Baseline

In [3]:
import pandas as pd 
import numpy as np 
df= pd.read_csv("hour.csv")
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [4]:
df.shape

(17379, 17)

In [5]:
Target = df['cnt']
Features = df.drop(columns=['cnt', 'dteday', 'instant', 'casual', 'registered'])

In [6]:
df.shape

(17379, 17)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [None]:

from sklearn.model_selection import train_test_split


df = df.drop(columns=["instant", "dteday", "casual", "registered"])
print("After dropping columns:", df.shape)

categorical_features = ["season","yr", "weathersit", "mnth", "hr", "weekday"]

df = pd.get_dummies(df, columns=categorical_features, drop_first=True)
print("After one-hot encoding:", df.shape)


X = df.drop(columns=["cnt"])
y = df["cnt"]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


After dropping columns: (17379, 13)
After one-hot encoding: (17379, 54)
Train set shape: (13903, 53)
Test set shape: (3476, 53)


In [None]:

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

dt_model = DecisionTreeRegressor(max_depth=6, random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print(f"Decision Tree RMSE: {rmse_dt:.3f}")


lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
print(f"Linear Regression RMSE: {rmse_lr:.3f}")

#Compare and Choose Baseline
if rmse_dt < rmse_lr:
    print(f"Baseline Model: Decision Tree (RMSE = {rmse_dt:.3f})")
else:
    print(f"Baseline Model: Linear Regression (RMSE = {rmse_lr:.3f})")


Decision Tree RMSE: 118.456
Linear Regression RMSE: 100.446
Baseline Model: Linear Regression (RMSE = 100.446)


### Part B: Ensemble Techniques for Bias and Variance Reduction

#### 1. Bagging (Variance Reduction)

In [None]:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error


single_dt = DecisionTreeRegressor(max_depth=6, random_state=42)
single_dt.fit(X_train, y_train)
y_pred_dt = single_dt.predict(X_test)
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print(f"Single Decision Tree RMSE: {rmse_dt:.4f}")


bag_model = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=6),
    n_estimators=50,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)
bag_model.fit(X_train, y_train)


y_pred_bag = bag_model.predict(X_test)
rmse_bag = np.sqrt(mean_squared_error(y_test, y_pred_bag))
print(f"Bagging Regressor (50 trees) RMSE: {rmse_bag:.4f}")


improvement = (rmse_dt - rmse_bag) / rmse_dt * 100.0
print(f"RMSE improvement vs single tree: {improvement:.2f}%")



Single Decision Tree RMSE: 118.4555
Bagging Regressor (50 trees) RMSE: 112.3281
RMSE improvement vs single tree: 5.17%


The single Decision Tree had an RMSE of 118.46, while the Bagging Regressor (50 trees) achieved an RMSE of 112.33, showing about a 5.17% improvement in predictive performance.

This reduction in RMSE indicates that bagging effectively reduced variance compared to the single tree baseline. Decision Trees are known to be high-variance models — small changes in training data can lead to big differences in the model structure and predictions. Bagging combats this by training multiple trees on different random subsets of the data and averaging their outputs.

Averaging across 50 trees stabilizes the predictions, smooths out noise, and makes the overall model less sensitive to fluctuations in the training data. The moderate improvement is expected here since the base trees were already somewhat regularized with max_depth=6, which limits overfitting.

#### 2. Boosting (Bias Reduction)

In [None]:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error


gb_model = GradientBoostingRegressor(
    n_estimators=100,     
    learning_rate=0.1,    
    max_depth=3,          
    random_state=42
)
gb_model.fit(X_train, y_train)


y_pred_gb = gb_model.predict(X_test)

rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print(f"Gradient Boosting Regressor RMSE: {rmse_gb:.4f}")

print(f"Single Decision Tree RMSE: {rmse_dt:.4f}")
print(f"Bagging Regressor RMSE: {rmse_bag:.4f}")


if rmse_gb < min(rmse_dt, rmse_bag):
    print("Boosting achieved the best result — supports the bias reduction hypothesis.")
else:
    print("Boosting did not outperform both models — check tuning or data bias.")


Gradient Boosting Regressor RMSE: 78.9652
Single Decision Tree RMSE: 118.4555
Bagging Regressor RMSE: 112.3281
Boosting achieved the best result — supports the bias reduction hypothesis.


Unlike bagging, which builds multiple independent trees in parallel to reduce variance, boosting trains trees sequentially, with each new tree focusing on correcting the previous model’s errors. This step-by-step correction process allows the ensemble to capture more complex patterns and relationships that a single or bagged tree might miss — hence reducing systematic bias.

The large RMSE drop (≈ 33% improvement over bagging) supports the hypothesis that boosting targets and minimizes bias, leading to more accurate predictions.

### Part C: Stacking for Optimal Performance 

Stacking is an ensemble learning technique that combines multiple diverse models (called base learners) to improve predictive performance.

Level-0 (Base Learners):
Multiple models are trained on the same training data, but each learns different aspects of the pattern ( KNN is instance-based, Tree-based models capture nonlinear splits, etc.).

Level-1 (Meta-Learner):
The predictions of these base models (on unseen or validation data) are used as new input features to train a meta-model (Ridge Regression).

The meta-learner learns how to optimally weight and combine the outputs of the base learners. For example, giving more weight to the model that performs better in certain regions of the data.

Bagging reduces variance and Boosting reduces bias
Stacking combines the strengths of different learners to improve both bias and variance trade-off.


In [None]:

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge


#Base Learners (Level-0)
base_learners = [
    ("knn", KNeighborsRegressor(n_neighbors=5)),
    ("bagging", BaggingRegressor(
        estimator=DecisionTreeRegressor(max_depth=6),
        n_estimators=50,
        random_state=42
    )),
    ("gboost", GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    ))
]

# Meta-Learner (Level-1)
meta_learner = Ridge(alpha=1.0)

#Stacking Regressor
stack_model = StackingRegressor(
    estimators=base_learners,
    final_estimator=meta_learner,
    n_jobs=-1
)

#Train Stacking Regressor
stack_model.fit(X_train, y_train)


y_pred_stack = stack_model.predict(X_test)
rmse_stack = np.sqrt(mean_squared_error(y_test, y_pred_stack))
print(f"Stacking Regressor RMSE: {rmse_stack:.4f}")


Stacking Regressor RMSE: 67.0248


### Part D: Final Analysis

In [15]:

rmse_results = {
    "Model": [
        "Baseline (Linear Regression / Decision Tree)",
        "Bagging Regressor",
        "Gradient Boosting Regressor",
        "Stacking Regressor"
    ],
    "RMSE": [
        118.4555,  
        112.3281,  
        78.9652,   
        67.0248 
    ]
}

rmse_df = pd.DataFrame(rmse_results)
rmse_df = rmse_df.sort_values(by="RMSE").reset_index(drop=True)

print(" RMSE Comparison Table:\n")
print(rmse_df)

# best model
best_model = rmse_df.iloc[0]
print("\n Best Performing Model:")
print(f"{best_model['Model']} (RMSE = {best_model['RMSE']:.4f})")


 RMSE Comparison Table:

                                          Model      RMSE
0                            Stacking Regressor   67.0248
1                   Gradient Boosting Regressor   78.9652
2                             Bagging Regressor  112.3281
3  Baseline (Linear Regression / Decision Tree)  118.4555

 Best Performing Model:
Stacking Regressor (RMSE = 67.0248)


This performance highlights how stacking effectively combines the strengths of multiple diverse learners. While the single Decision Tree suffered from high variance and limited generalization, and Bagging mainly reduced variance through averaging, Stacking goes a step further. It integrates multiple base models such as KNN, Bagging, and Gradient Boosting, each capturing different aspects of the data’s structure.

The meta-learner (Ridge Regression) then learns to optimally weight their predictions, leveraging the complementary strengths of these models.
This results in a balanced ensemble that reduces both bias and variance, leading to superior generalization on unseen data.

Stacking Regressor outperformed the single model baseline because:

1. It reduced bias by including complex learners like Gradient Boosting.

2. It reduced variance by averaging across multiple diverse models.

3. It exploited model diversity, ensuring the weaknesses of one model are compensated by the strengths of others.

Hence, Stacking demonstrates the ideal trade-off in the bias–variance spectrum, achieving the most accurate and stable predictions overall.