# Ensemble Learning for Bike Sharing Demand 
**Objective:** Implement and compare bagging, boosting and stacking to forecast the hourly bike sharing count (`cnt`) using the UCI Bike sharing Dataset. 



In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.preprocessing import StandardScaler 
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer 

from sklearn.tree import DecisionTreeRegressor 
from sklearn.linear_model import LinearRegression, Ridge 
from sklearn.ensemble import BaggingRegressor,GradientBoostingRegressor, StackingRegressor
from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.model_selection import TimeSeriesSplit 

import warnings 
warnings.filterwarnings("ignore") 

RND = 42 

def rmse(y_true, y_pred): 
    return np.sqrt(mean_squared_error(y_true,y_pred)) 


def plot_actual_vs_pred(y_true,y_pred,title = "Actual and Predicted",n_plot = 500): 
    plt.figure(figsize=(12,4)) 
    plt.plot(y_true[:n_plot],label = 'Actual') 
    plt.plot(y_pred[:n_plot],label = "Predict",alpha = 0.8) 
    plt.title(title) 
    plt.legend() 
    plt.tight_layout() 
    plt.show()




### Data loading & quick EDA

- Load `hour.csv`.
- Drop columns `instant`, `dteday`, `casual`, and `registered` as required.
- Convert categorical features (`season`, `weathersit`, `mnth`, `hr`, `weekday`) to dummies (one-hot encoding).
- We will keep the temporal order for train/test split (no random shuffling).


In [2]:
# load dataset 
df = pd.read_csv("hour.csv") 
print("Initial shape: ",df.shape) 
df.head()

Initial shape:  (17379, 17)


Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [3]:
print(df.dtypes)

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object


In [4]:
# Cell: Preprocessing 
#Dropping columns which are supposed to be unrelated to the the target 
drop_cols = ['instant','dteday','casual','registered'] 
df = df.drop(columns=[c for c in drop_cols if c in df.columns])


print(df.dtypes)



season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
cnt             int64
dtype: object


as can be seen from the dtypes of the remaining columns there are no categorical columns per say in the dataframe. All of them have already been converted into numerical columns in the dataset which was present over the web.

### Train/Test split(time-aware) 

Because the data is hourly time-series, we must preserve temporal order. We use the first 80% of rows for training and the last 20% for testing(index-based split). This mimics a realistic forecasting scenario

In [5]:
target = 'cnt' 
X = df.drop(columns=[target])
y = df[target].copy() 

n = len(df) 
split_idx = int(n*0.8) 

X_train,X_test = X.iloc[:split_idx].copy(), X.iloc[split_idx:].copy() 
y_train,y_test = y.iloc[:split_idx].copy(), y.iloc[split_idx:].copy() 

print(f'Train shape: {X_train.shape}, Test Shape: {X_test.shape}') 


#Standard scaling for numeric features
scaler = StandardScaler() 
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train),columns = X_train.columns,index = X_train.index) 
X_test_scaled = pd.DataFrame(scaler.transform(X_test),columns = X_test.columns, index = X_test.index) 


Train shape: (13903, 12), Test Shape: (3476, 12)


## Baseline Models (Markdowns) 

### Part A: Baseline Models 

Train two baseline models: 
- Decision Tree Regressor (max_depth = 6) 
- Linear Regression 

Evaluate wtih RMSE on the test set, the better of these two will serve as the baseline for comparison

In [6]:
dt = DecisionTreeRegressor(max_depth=6,random_state=RND) 
dt.fit(X_train_scaled,y_train) 
y_pred_dt = dt.predict(X_test_scaled) 
rmse_dt = rmse(y_test,y_pred_dt) 
r2_dt = r2_score(y_test,y_pred_dt)


#Linear Regression 
lr = LinearRegression() 
lr.fit(X_train_scaled,y_train) 
y_pred_lr = lr.predict(X_test_scaled) 
rmse_lr = rmse(y_test,y_pred_lr) 
r2_lr= r2_score(y_test,y_pred_lr) 

print(f"Decision Tree(max_depth = 6) -> RMSE: {rmse_dt:.3f},R2: {r2_dt:.3f}")
print(f"Linear Regression -> RMSE: {rmse_lr:.3f}, R2: {r2_lr:.3f}")

baseline_name = "Decision Tree" if rmse_dt < rmse_lr else "Linear Regression" 
baseline_rmse = min(rmse_dt,rmse_lr) 
print(f"Base chose: {baseline_name} with RMSE: {baseline_rmse:.3f}")

Decision Tree(max_depth = 6) -> RMSE: 135.115,R2: 0.624
Linear Regression -> RMSE: 183.278, R2: 0.309
Base chose: Decision Tree with RMSE: 135.115


**Interpretation / diagnostics**

-  I use the RMSE and R² values to compare fit quality.
- Plotted actual vs predicted for the better baseline to inspect temporal errors and patterns.


## Part B: Bagging(Markdown) 

**Hypothesis**: Bagging reduces variance and improves R2 score by averaging many independent base learners (here: Decision Tree). 

Implementation details: 
- Base Estimator: DecisionTreeRegressor(max_depth = 6) 
- n_estimator = 50 
- Evaluate RMSE and R2 score on test set and compare with baseline decision tree. 



In [7]:
base_dt = DecisionTreeRegressor(max_depth=6,random_state=RND) 
bag = BaggingRegressor(estimator=base_dt,n_estimators=50,random_state=RND,n_jobs=-1) 
bag.fit(X_train_scaled,y_train) 
y_pred_bag = bag.predict(X_test_scaled) 
rmse_bag = rmse(y_test,y_pred_bag) 
r2_bag = r2_score(y_test,y_pred_bag)


print(f"Bagging Regressor -> RMSE: {rmse_bag:.3f}, R2: {r2_bag:.3f}")

Bagging Regressor -> RMSE: 130.485, R2: 0.650


**Dicussion of observed result**: 

- Bagging RMSE < Single tree RMSE and R2 Score Bagging > Single Tree RMSE, Bagging reduced variance and improved R2 Score successfully. 
- This confirms our hypothesis defined above. 

##  Gradient Boosting Regressor (bias reduction)

**Hypothesis:** Boosting reduces bias by sequentially fitting residuals; can often beat single models when the base estimator is weak.

Implementation details:
- Use `GradientBoostingRegressor` from scikit-learn with sensible defaults.
- We'll set a moderate number of estimators (e.g., 200) and a small learning rate (e.g., 0.05).
- Evaluate RMSE on the test set and compare with bagging and baseline.


In [8]:
gbr = GradientBoostingRegressor(n_estimators=200,learning_rate=0.05,max_depth=3,random_state=RND) 
gbr.fit(X_train_scaled,y_train) 
y_pred_gbr = gbr.predict(X_test_scaled) 

rmse_gbr = rmse(y_test,y_pred_gbr) 
r2_gbr = r2_score(y_test,y_pred_gbr) 

print(f"Gradient Boosting Regressor -> RMSE: {rmse_gbr:.3f}, R2: {r2_gbr: .3f}")


Gradient Boosting Regressor -> RMSE: 107.405, R2:  0.763


**Discussion** 

- As can be seen from the RMSE values the gradient boosting performed the best among the baseline and the bagging methods. 
- This points that the bias reduction which was aimed for while using Gradient Boosting is effective for this dataset. 

## Part C — Stacking

**Principle:**
Stacking (stacked generalization) trains multiple diverse base learners (level-0) and trains a meta-learner (level-1) on the predictions of those base learners. The meta-learner learns how to weight or combine base learners' predictions to improve generalization.

**Architecture for this assignment**
- Level-0 (base learners):
  1. KNeighborsRegressor
  2. BaggingRegressor (the same as used above)
  3. GradientBoostingRegressor (the same as used above)
- Level-1 (meta-learner): Ridge Regression

We will implement `sklearn.ensemble.StackingRegressor` with `cv` set to an appropriate time-series split ( `TimeSeriesSplit` or `5`-fold).


In [9]:
from sklearn.model_selection import KFold

estimators = [
    ('knn',KNeighborsRegressor(n_neighbors=8,n_jobs=-1)),
    ("bag",BaggingRegressor(estimator=DecisionTreeRegressor(max_depth=6),n_estimators=50,random_state=RND,n_jobs=-1)), 
    ('gbr',GradientBoostingRegressor(n_estimators=200,learning_rate=0.05,max_depth=3,random_state=RND))
]

meta_learner = Ridge(alpha=1.0)

cv = KFold(n_splits=5,shuffle=False) 

stack = StackingRegressor(estimators=estimators,final_estimator=meta_learner,cv=cv,n_jobs=-1,passthrough=False) 
stack.fit(X_train_scaled,y_train) 
y_pred_stack = stack.predict(X_test_scaled) 
rmse_stack = rmse(y_test,y_pred_stack) 
r2_stack= r2_score(y_test,y_pred_stack) 

print(f"Stacking Regressor -> RMSE: {rmse_stack:.3f}, R2: {r2_stack:.3f}") 


Stacking Regressor -> RMSE: 104.675, R2: 0.775


In [21]:
# Summarize results in a table
results = pd.DataFrame({
    'model': ['Decision Tree (max_depth=6)', 'Linear Regression', 'Bagging Regressor', 'Gradient Boosting Regressor', 'Stacking Regressor'],
    'RMSE': [rmse_dt, rmse_lr, rmse_bag, rmse_gbr, rmse_stack],
    'R2': [r2_dt, r2_lr, r2_bag, r2_gbr, r2_stack]
}).sort_values('RMSE').reset_index(drop=True)

results


Unnamed: 0,model,RMSE,R2
0,Stacking Regressor,104.67467,0.774615
1,Gradient Boosting Regressor,107.405467,0.762701
2,Bagging Regressor,130.484882,0.649762
3,Decision Tree (max_depth=6),135.115449,0.624463
4,Linear Regression,183.277847,0.309025


## Final Analysis & Conclusion

**1. Comparative summary**

- The table above lists RMSE and R² for all models.
- The best model with the lowest RMSE is Stacking Regressor
- A significant drop in RMSE from baseline is observed moving from Bagging Regressor to Gradient Boosting Regressor. 

**2. Why stacking (or the best ensemble) outperformed the single model**

- **Bagging** reduces variance by averaging many high-variance base learners (decision trees). If a single tree overfits particular temporal patterns, bagging smooths those idiosyncratic fits.
- **Boosting** reduces bias by sequentially fitting residuals; it can capture complex nonlinear relationships that linear models miss.
- **Stacking** benefits from model diversity: KNN captures local similarity structure, bagging yields low-variance tree ensembles, and gradient boosting captures complex nonlinear interactions. The Ridge meta-learner learns an optimal linear combination of these predictions, often producing better generalization than any individual model.

**3. Practical considerations & further improvements**

- Hyperparameter tuning (GridSearchCV / RandomizedSearchCV or using time-series-aware CV) for each model may further reduce RMSE.
- `XGBoost` / `LightGBM` can be used for improved boosting performance.

**4. Reproducibility**
- Random seeds set to `42`.
- Train/test split preserved temporal order.
