# DA5401 A8: Ensemble Learning for Complex Regression Modeling
**NAME:** Manish Nayak  
**ROLL NO:** CE22B069


### Part A: Data Preprocessing and Baseline

#### 1. Data Loading and Feature Engineering


In [2]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [40]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bike_sharing = fetch_ucirepo(id=275) 
  
# data (as pandas dataframes) 
X = bike_sharing.data.features 
y = bike_sharing.data.targets 
  
# metadata 
print(bike_sharing.metadata) 
  
# variable information 
print(bike_sharing.variables) 


{'uci_id': 275, 'name': 'Bike Sharing', 'repository_url': 'https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset', 'data_url': 'https://archive.ics.uci.edu/static/public/275/data.csv', 'abstract': 'This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.', 'area': 'Social Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 17389, 'num_features': 13, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['cnt'], 'index_col': ['instant'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2013, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C5W894', 'creators': ['Hadi Fanaee-T'], 'intro_paper': {'ID': 422, 'type': 'NATIVE', 'title': 'Event labeling combining ensemble detectors and background knowledge', 'authors': 'Hadi Fanaee-T, João Gama', 'venue': 'Progress

In [41]:
X.columns

Index(['dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed'],
      dtype='object')

In [42]:
X.drop(['dteday'],axis=1 , inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.drop(['dteday'],axis=1 , inplace=True)


In [43]:
X.columns

Index(['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed'],
      dtype='object')

#### Irrelevant columns were already not present in the dataset

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the dataset
try:
    df = pd.concat([bike_sharing.data.features, bike_sharing.data.targets], axis=1)
except FileNotFoundError:
    print("hour.csv not found. Please ensure the dataset is in the correct directory.")

# One-Hot Encode categorical features
categorical_features = ['season', 'weathersit', 'mnth', 'hr', 'weekday']
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

# Display the first few rows of the preprocessed data
print("Preprocessed Data Head:")
print(df.head())

Preprocessed Data Head:
   yr  holiday  workingday  temp   atemp   hum  windspeed  cnt  season_2  \
0   0        0           0  0.24  0.2879  0.81        0.0   16     False   
1   0        0           0  0.22  0.2727  0.80        0.0   40     False   
2   0        0           0  0.22  0.2727  0.80        0.0   32     False   
3   0        0           0  0.24  0.2879  0.75        0.0   13     False   
4   0        0           0  0.24  0.2879  0.75        0.0    1     False   

   season_3  ...  hr_20  hr_21  hr_22  hr_23  weekday_1  weekday_2  weekday_3  \
0     False  ...  False  False  False  False      False      False      False   
1     False  ...  False  False  False  False      False      False      False   
2     False  ...  False  False  False  False      False      False      False   
3     False  ...  False  False  False  False      False      False      False   
4     False  ...  False  False  False  False      False      False      False   

   weekday_4  weekday_5  weekday

In [21]:
df

Unnamed: 0,yr,holiday,workingday,temp,atemp,hum,windspeed,cnt,season_2,season_3,...,hr_20,hr_21,hr_22,hr_23,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,0,0,0,0.24,0.2879,0.81,0.0000,16,False,False,...,False,False,False,False,False,False,False,False,False,True
1,0,0,0,0.22,0.2727,0.80,0.0000,40,False,False,...,False,False,False,False,False,False,False,False,False,True
2,0,0,0,0.22,0.2727,0.80,0.0000,32,False,False,...,False,False,False,False,False,False,False,False,False,True
3,0,0,0,0.24,0.2879,0.75,0.0000,13,False,False,...,False,False,False,False,False,False,False,False,False,True
4,0,0,0,0.24,0.2879,0.75,0.0000,1,False,False,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17374,1,0,1,0.26,0.2576,0.60,0.1642,119,False,False,...,False,False,False,False,True,False,False,False,False,False
17375,1,0,1,0.26,0.2576,0.60,0.1642,89,False,False,...,True,False,False,False,True,False,False,False,False,False
17376,1,0,1,0.26,0.2576,0.60,0.1642,90,False,False,...,False,True,False,False,True,False,False,False,False,False
17377,1,0,1,0.26,0.2727,0.56,0.1343,61,False,False,...,False,False,True,False,True,False,False,False,False,False



#### 2. Train/Test Split


In [26]:
# Define features (X) and target (y)
X = df.drop('cnt', axis=1)
y = df['cnt']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")


Training set size: 13903 samples
Test set size: 3476 samples


#### 3. Baseline Model (Single Regressor)


In [30]:
# Initialize the models
decision_tree = DecisionTreeRegressor(max_depth=6, random_state=42)
linear_regression = LinearRegression()

# Train the models
decision_tree.fit(X_train, y_train)
linear_regression.fit(X_train, y_train)

# Make predictions on the test set
dt_predictions = decision_tree.predict(X_test)
lr_predictions = linear_regression.predict(X_test)

# Calculate RMSE for both models
dt_rmse = np.sqrt(mean_squared_error(y_test, dt_predictions))
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))

print(f"\nDecision Tree RMSE: {dt_rmse:.4f}")
print(f"Linear Regression RMSE: {lr_rmse:.4f}")

# Determine the baseline model
if dt_rmse < lr_rmse:
    baseline_rmse = dt_rmse
    baseline_model = "Decision Tree"
else:
    baseline_rmse = lr_rmse
    baseline_model = "Linear Regression"

print(f"\nBaseline Model: {baseline_model} with an RMSE of {baseline_rmse:.4f}")


Decision Tree RMSE: 118.4555
Linear Regression RMSE: 100.4449

Baseline Model: Linear Regression with an RMSE of 100.4449


Linear Regression worked better than Decision Tree throughout the range of 50 to 2000 estimators

### Part B: Ensemble Techniques for Bias and Variance Reduction

#### 1. Bagging (Variance Reduction)

**Hypothesis:** Bagging (Bootstrap Aggregating) is an ensemble technique that primarily aims to reduce the variance of a model.


In [36]:
# --- 1. Bagging (Variance Reduction) ---

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Initialize the base estimator (the same single Decision Tree from our baseline)
base_dt = DecisionTreeRegressor(max_depth=6, random_state=42)

# Initialize the Bagging Regressor
# n_estimators is the number of base models to train
bagging_reg = BaggingRegressor(
    base_estimator=base_dt,
    n_estimators=200,  # Using 100 estimators
    random_state=42,
    n_jobs=-1  # Use all available CPU cores
)

# Train the Bagging model
print("Training the Bagging Regressor...")
bagging_reg.fit(X_train, y_train)
print("Training complete.")

# Make predictions on the test set
bagging_predictions = bagging_reg.predict(X_test)

# Calculate and report the RMSE
bagging_rmse = np.sqrt(mean_squared_error(y_test, bagging_predictions))

print("\n--- Bagging Results ---")
print(f"Single Decision Tree RMSE (from Part A): 118.4555")
print(f"Bagging Regressor RMSE: {bagging_rmse:.4f}")

Training the Bagging Regressor...




Training complete.

--- Bagging Results ---
Single Decision Tree RMSE (from Part A): 118.4555
Bagging Regressor RMSE: 112.2620


### Bagging Results and Discussion

**Calculated RMSE:**

*   Single Decision Tree RMSE (from Part A): 118.4555
*   **Bagging Regressor RMSE: 112.2620**

**Discussion:**

The results clearly demonstrate that the bagging technique was effective in improving the model's performance. The Bagging Regressor achieved an RMSE of **112.2620**, which is a reduction of **6.1935** compared to the single Decision Tree's RMSE of 118.4555.

This supports the hypothesis that bagging primarily reduces variance. Here's the reasoning:

1.  **High Variance of Single Trees:** A single Decision Tree is known to be a "high variance" estimator. This means that if you were to train it on slightly different subsets of the data, the resulting tree structures could be quite different, leading to inconsistent predictions. This instability is a classic sign of high variance.

2.  **The Power of Averaging:** Bagging (Bootstrap Aggregating) mitigates this problem by creating many independent decision trees (100 in our case) on different random samples of the training data. While each individual tree might still have high variance and overfit its particular sample, their errors are diverse. By averaging the predictions of all these different trees, the individual errors tend to cancel each other out.

3.  **Smoother, More Stable Predictions:** This averaging process results in a "smoother" and more stable final prediction. The bagged model is less sensitive to the specific noise and fluctuations in the training data, making it a more generalized and robust model. The reduction in RMSE from 118.4555 to 112.2620 is direct evidence of this improved generalization and, therefore, a successful reduction in model variance.


#### 2. Boosting (Bias Reduction)

**Hypothesis:** Boosting is an ensemble technique that primarily aims to reduce a model's bias.

In [37]:
# --- 2. Boosting (Bias Reduction) ---

from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting Regressor
# We'll use some common hyperparameters to start
grad_boost_reg = GradientBoostingRegressor(
    n_estimators=150,      # Number of sequential trees to build
    learning_rate=0.1,     # How much each tree contributes to the final outcome
    max_depth=5,           # Maximum depth of the individual trees
    random_state=42
)

# Train the Gradient Boosting model
print("\nTraining the Gradient Boosting Regressor...")
grad_boost_reg.fit(X_train, y_train)
print("Training complete.")

# Make predictions on the test set
boosting_predictions = grad_boost_reg.predict(X_test)

# Calculate and report the RMSE
boosting_rmse = np.sqrt(mean_squared_error(y_test, boosting_predictions))

print("\n--- Boosting Results ---")
print(f"Baseline Linear Regression RMSE: 100.4449")
print(f"Bagging Regressor RMSE: {bagging_rmse:.4f}")
print(f"Gradient Boosting Regressor RMSE: {boosting_rmse:.4f}")


Training the Gradient Boosting Regressor...
Training complete.

--- Boosting Results ---
Baseline Linear Regression RMSE: 100.4449
Bagging Regressor RMSE: 112.2620
Gradient Boosting Regressor RMSE: 54.1999



### Boosting Results and Discussion

**Calculated RMSE:**

*   Baseline Linear Regression RMSE: 100.4449
*   Bagging Regressor RMSE: 112.2620
*   **Gradient Boosting Regressor RMSE: 54.1999**

**Discussion:**

The results show that the Gradient Boosting Regressor achieved a significantly better outcome than all previous models. With an RMSE of **54.1999**, it has nearly **halved the prediction error** of our best single model, the Linear Regression baseline (100.4449), and is substantially more accurate than the Bagging Regressor (112.2620).

This dramatic improvement strongly supports the hypothesis that **boosting effectively reduces bias**. Here's the reasoning:

1.  **Sequential Error Correction:** Unlike bagging, which builds independent models in parallel, boosting is a sequential process. It trains the first weak learner (a shallow decision tree) on the data. The second learner is then trained not on the original data, but on the *errors* (residuals) of the first. Each subsequent learner is specifically built to correct the mistakes made by the ensemble of all preceding learners.

2.  **Focusing on Weaknesses:** This sequential method forces the overall model to concentrate on the data points it finds most difficult to predict. By iteratively fixing its own weaknesses, the model's ability to capture the true underlying patterns in the data improves with each step. This fundamental inability to capture the true relationship is the model's **bias**.

3.  **Evidence of Bias Reduction:** The massive drop in RMSE is the key piece of evidence. The baseline models, with errors over 100, had a relatively high bias; they were fundamentally unable to map the complex relationships between weather, time, and bike rentals. The Gradient Boosting model, by relentlessly correcting its errors, created a much more accurate and complex function that better represents the ground truth, thereby drastically reducing this bias and achieving a much lower overall error.

In conclusion, the superior performance of the Gradient Boosting Regressor is a clear demonstration of its strength in bias reduction, allowing it to build a far more accurate and powerful predictive model than either the single baseline models or the variance-reducing bagging ensemble.


### Part C: Stacking for Optimal Performance

#### 1. Stacking Implementation

**Principle of Stacking:**

Stacking, or Stacked Generalization, is an advanced ensemble method that combines multiple different regression (or classification) models to produce a final, improved prediction. It operates on a two-level structure:

*   **Level-0 (Base Learners):** This level consists of several different models (e.g., KNN, Bagging, Boosting) that are all trained independently on the full training dataset. Their job is to learn the underlying patterns in the data from different perspectives. Diverse models are chosen intentionally because they will likely make different kinds of errors, which the next level can learn from.

*   **Level-1 (Meta-Learner):** Instead of simply averaging the predictions of the base learners (like in bagging), Stacking trains a new model, the Meta-Learner, to make the final prediction. The training data for this Meta-Learner is not the original feature set. Instead, it is the set of predictions made by all the Level-0 Base Learners.

**How the Meta-Learner Learns:**

The key insight of stacking is that the Meta-Learner learns the optimal way to combine the predictions from the base models. It effectively learns the strengths and weaknesses of each base learner. For example, it might learn that the Gradient Boosting model is very reliable in most cases but tends to over-predict during holidays, while the KNN model performs better in those specific situations. By training on the outputs of the base models, the Meta-Learner figures out what weights or combination rules to apply to their predictions to generate the most accurate final output. It essentially learns a sophisticated, data-driven way to "trust" each base model under different conditions.

In [38]:

# --- 1. Stacking Implementation ---

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np


# Define the Base Learners (Level-0)
# We will use the already trained Bagging and Boosting models.
# To ensure diversity, we add a K-Nearest Neighbors model.
base_learners = [
    ('knn', KNeighborsRegressor(n_neighbors=10)),
    ('bagging', bagging_reg),  # Using the trained model from Part B
    ('boosting', grad_boost_reg) # Using the trained model from Part B
]

# Define the Meta-Learner (Level-1)
# Ridge regression is a good choice as it's a simple, regularized linear model.
meta_learner = Ridge(alpha=1.0)

# Initialize the Stacking Regressor
# The 'passthrough=True' argument allows the meta-learner to see both the
# base model predictions AND the original features, which can sometimes improve performance.
# cv=5 means 5-fold cross-validation will be used to generate the predictions for the meta-learner.
stacking_reg = StackingRegressor(
    estimators=base_learners,
    final_estimator=meta_learner,
    passthrough=True,
    cv=5,
    n_jobs=-1
)

# Train the Stacking model
# This will take longer as it involves cross-validation and training multiple models.
print("Training the Stacking Regressor...")
stacking_reg.fit(X_train, y_train)
print("Training complete.")

# --- 2. Final Evaluation ---

# Make predictions on the test set
stacking_predictions = stacking_reg.predict(X_test)

# Calculate and report the RMSE
stacking_rmse = np.sqrt(mean_squared_error(y_test, stacking_predictions))

print("\n--- Stacking Results ---")
# Using the final values from your previous runs
print(f"Gradient Boosting Regressor RMSE: 54.1999")
print(f"Stacking Regressor RMSE: {stacking_rmse:.4f}")


Training the Stacking Regressor...




Training complete.

--- Stacking Results ---
Gradient Boosting Regressor RMSE: 54.1999
Stacking Regressor RMSE: 49.4370


### Part D: Final Analysis

#### 1. Comparative Table


| Model | Technique | RMSE | Performance vs. Baseline |
| :--- | :--- | :--- | :--- |
| **Linear Regression** | **Baseline Single Model** | **100.4449** | **Baseline** |
| Bagging Regressor | Bagging (Variance Reduction) | 112.2620 | -11.77% |
| Gradient Boosting Regressor| Boosting (Bias Reduction) | 54.1999 | +46.04% |
| **Stacking Regressor** | **Stacking (Optimal Combination)**| **49.4370** | **+50.78%** |

*Performance vs. Baseline is calculated as `(1 - (Model RMSE / Baseline RMSE)) * 100`. A positive percentage indicates improvement.*

---

#### 2. Conclusion

**Best-Performing Model:**

Based on the empirical results, the **Stacking Regressor is unequivocally the best-performing model**, achieving the lowest RMSE of **49.4370**. It reduced the prediction error by over 50% compared to the initial baseline, demonstrating a substantial improvement in forecasting accuracy.

**Explanation of Superior Performance:**

The significant outperformance of the ensemble models, particularly the Stacking Regressor, over the single baseline model can be explained by referencing the **bias-variance trade-off** and the principle of **model diversity**.

1.  **Addressing the Bias-Variance Trade-off:**
    *   The baseline Linear Regression model, while better than a simple Decision Tree, was a **high-bias** model. It was too simple (linear) to capture the complex, non-linear relationships between factors like the hour of the day, weather, and season on bike rental demand. Its high RMSE of 100.4449 reflects this fundamental inability to model the data's true complexity.
    *   The Gradient Boosting model made its greatest leap in performance by directly attacking this problem. As a **bias-reduction** technique, it sequentially built models to correct the errors of its predecessors, creating a highly complex and accurate function that dramatically lowered the bias, resulting in a much lower RMSE (54.1999).
    *   Stacking provided the final, optimal balance. It took the powerful, low-bias Gradient Boosting model and combined it with other models (like the lower-variance Bagging model). The meta-learner's role is to find the best possible compromise in the bias-variance trade-off, using the strengths of each base learner to compensate for the weaknesses of others.

2.  **The Power of Model Diversity:**
    *   The core strength of Stacking lies in its use of **diverse base learners**. We didn't just combine three similar models; we combined three different approaches to problem-solving:
        *   **Gradient Boosting:** A sequential, error-correcting approach.
        *   **Bagging:** A parallel approach that reduces variance by averaging.
        *   **K-Nearest Neighbors:** An instance-based, non-parametric approach that makes predictions based on local data points.
    *   Because these models have different underlying assumptions, they make different kinds of errors. The Stacking meta-learner is explicitly trained to learn these error patterns. It learns when to trust the Boosting model, when to lean more on the Bagging model's stability, and when the KNN model's local perspective is valuable. This "wisdom of the diverse crowd" allows it to make a final prediction that is more robust and accurate than any single "expert" could achieve on its own.

In summary, the Stacking Regressor won because it didn't rely on a single perspective. It intelligently synthesized the predictions from a diverse team of specialized models, creating a final, synergistic model that was far more powerful and accurate than the sum of its parts.