# üß™ Ensemble Learning Lab: Bagging, Boosting, and Stacking (Classification)

## üéØ Objective
In this task, you will **design, train, evaluate, and compare ensemble learning models** for a **classification problem** using real-world data.  
You will work with **Bagging**, **Boosting**, and **Stacking** techniques and analyze their performance using appropriate evaluation metrics.

---

## üìä Dataset
- **Source**: :contentReference[oaicite:0]{index=0}  
- **Competition**: *Playground Series ‚Äì Season 6, Episode 2*  
- **Problem Type**: **Supervised Classification**

üîó Dataset link:  
[URL](https://https://www.kaggle.com/competitions/playground-series-s6e2/overview)

> **Note**: You must download the dataset using the Kaggle API or manually upload it to Colab/Jupyter.

---

## üß† Models to Implement

### 1Ô∏è‚É£ Bagging
- **Model**: `BaggingClassifier`
- **Base Estimator**: Decision Tree (recommended)
- Key hyperparameters to explore:
  - `n_estimators`
  - `max_samples`
  - `max_features`
  - `bootstrap`

---

### 2Ô∏è‚É£ Boosting Models

#### a) AdaBoost
- **Model**: `AdaBoostClassifier`
- Base estimator: Decision Tree (stump recommended)
- Tune:
  - `n_estimators`
  - `learning_rate`

#### b) Gradient Boosting
- **Model**: `GradientBoostingClassifier`
- Tune:
  - `n_estimators`
  - `learning_rate`
  - `max_depth`
  - `subsample`

#### c) XGBoost
- **Model**: `XGBClassifier`
- Tune:
  - `n_estimators`
  - `learning_rate`
  - `max_depth`
  - `subsample`
  - `colsample_bytree`

> ‚ö†Ô∏è Handle class imbalance if present.

---

### 3Ô∏è‚É£ Stacking
- **Model**: `StackingClassifier`
- **Base learners** (example):
  - Logistic Regression
  - Random Forest
  - Gradient Boosting
- **Meta-learner**:
  - Logistic Regression (recommended)

---

## ‚öôÔ∏è Task Instructions

### üîπ Step 1: Data Preparation
- Load training data
- Separate features and target
- Handle missing values (if any)
- Encode categorical variables
- Perform train/validation split
- Apply feature scaling where necessary

---

### üîπ Step 2: Model Training
- Train **each ensemble model independently**
- Use **cross-validation** where appropriate
- Record training time and key hyperparameters

---

### üîπ Step 3: Evaluation Metrics
Evaluate all models using:
- **Accuracy**
- **Precision**
- **Recall**
- **F1-score**
- **ROC-AUC**
- **Confusion Matrix**

---

### üîπ Step 4: Model Comparison
Create a **comparison table** that includes:
- Model name
- Accuracy
- F1-score
- ROC-AUC
- Training time
- Key observations

---

## üìà Analysis & Discussion (Required)

Answer the following:
1. Which ensemble method performed best and why?
2. How does **bagging** differ from **boosting** in terms of bias and variance?
3. Did stacking outperform individual ensemble models?
4. Which model would you choose for deployment and why?
5. What are the computational trade-offs between these methods?

---

## üìù Deliverables
- Fully executable notebook
- Clean, well-documented code
- Final comparison table
- Written analysis and conclusions

---

## ‚≠ê Bonus (Optional)
- Perform **hyperparameter tuning** using GridSearchCV or RandomizedSearchCV
- Plot **ROC curves** for all models
- Analyze **feature importance** for boosting models

---

üéì **Learning Outcome**  
By completing this task, you will gain hands-on experience with advanced ensemble techniques and develop a strong intuition for **when and why to use bagging, boosting, or stacking in classification problems**.

In [2]:
!pip install -q kaggle

In [8]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle (1).json': b'{"username":"omamajalall","key":"4ace70b43f33e193aea7c79ee7aaf094"}'}

In [10]:
import json
import os
import shutil # Import shutil for moving files

# Define the target directory for Kaggle credentials
kaggle_config_dir = '/root/.kaggle'

# Ensure the .kaggle directory exists
os.makedirs(kaggle_config_dir, exist_ok=True)

# Define the expected full path for kaggle.json
destination_kaggle_json_path = os.path.join(kaggle_config_dir, 'kaggle.json')

# Check if 'kaggle (1).json' exists in the current directory from a previous upload,
# and if so, move it to 'kaggle.json' in the correct location.
# If 'kaggle.json' exists directly, move that.
source_kaggle_json_file = 'kaggle.json'
if os.path.exists('kaggle (1).json'):
    source_kaggle_json_file = 'kaggle (1).json'

if os.path.exists(source_kaggle_json_file):
    # Move the uploaded file to the .kaggle directory
    shutil.move(source_kaggle_json_file, destination_kaggle_json_path)
    # Set permissions for the kaggle.json file
    os.chmod(destination_kaggle_json_path, 0o600)
    print(f"Kaggle API key file moved from '{source_kaggle_json_file}' to '{destination_kaggle_json_path}' and permissions set.")
else:
    print("Error: 'kaggle.json' or 'kaggle (1).json' not found in the current directory.")
    print("Please make sure you have uploaded your kaggle.json file using the `files.upload()` command.")
    # The following lines will likely fail if the file is truly not found.

# Set the KAGGLE_CONFIG_DIR environment variable
os.environ['KAGGLE_CONFIG_DIR'] = kaggle_config_dir

# Now, open and load the credentials from the correct path
with open(destination_kaggle_json_path) as f:
    kaggle_creds = json.load(f)

os.environ['KAGGLE_USERNAME'] = kaggle_creds['username']
os.environ['KAGGLE_KEY'] = kaggle_creds['key']

print("Kaggle credentials successfully configured.")

Kaggle API key file moved from 'kaggle (1).json' to '/root/.kaggle/kaggle.json' and permissions set.
Kaggle credentials successfully configured.


In [11]:
!kaggle competitions download -c playground-series-s6e2

Downloading playground-series-s6e2.zip to /content
  0% 0.00/10.2M [00:00<?, ?B/s]
100% 10.2M/10.2M [00:00<00:00, 1.90GB/s]


In [12]:
!unzip playground-series-s6e2.zip

Archive:  playground-series-s6e2.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


# Data preparation

In [13]:
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


In [14]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       630000 non-null  int64  
 1   Age                      630000 non-null  int64  
 2   Sex                      630000 non-null  int64  
 3   Chest pain type          630000 non-null  int64  
 4   BP                       630000 non-null  int64  
 5   Cholesterol              630000 non-null  int64  
 6   FBS over 120             630000 non-null  int64  
 7   EKG results              630000 non-null  int64  
 8   Max HR                   630000 non-null  int64  
 9   Exercise angina          630000 non-null  int64  
 10  ST depression            630000 non-null  float64
 11  Slope of ST              630000 non-null  int64  
 12  Number of vessels fluro  630000 non-null  int64  
 13  Thallium                 630000 non-null  int64  
 14  Hear

In [15]:
X = train.drop(columns=["id", "Heart Disease"])
y = train["Heart Disease"]

In [16]:
y.value_counts()

Unnamed: 0_level_0,count
Heart Disease,Unnamed: 1_level_1
Absence,347546
Presence,282454


In [17]:
y = y.map({'Absence': 0, 'Presence': 1})

In [18]:
for col in X.columns:
    print(col, X[col].unique())

Age [58 52 56 44 38 59 60 48 41 42 53 50 65 46 62 57 54 66 51 55 43 71 63 61
 35 49 47 67 64 45 40 70 69 37 76 34 68 39 74 77 29 75]
Sex [1 0]
Chest pain type [4 1 2 3]
BP [152 125 160 134 140 138 130 120 150 108 110 178 124  94 112 128 118 100
 105 172 180 145 132 142 122 135 136 126 106 101 115 156 170 146 192 102
 117 148 104 200 165 129 174 123 144 158 133 103 147 155 149 109 168 111
 154 127 114 116 175 141 131 162  99  96  95 184]
Cholesterol [239 325 188 229 234 283 246 245 212 197 230 263 244 231 274 282 199 226
 204 185 177 250 211 303 201 266 256 219 222 249 235 295 258 271 304 277
 203 228 269 208 254 268 206 299 221 196 240 298 288 265 198 270 243 309
 233 330 255 315 261 294 223 214 273 286 267 260 236 289 252 275 302 224
 305 218 340 248 308 300 149 209 225 213 207 180 192 327 232 200 341 227
 322 220 311 210 172 247 360 306 318 215 335 205 178 182 242 168 564 353
 195 253 407 276 313 354 257 307 409 217 175 290 321 184 174 281 319 186
 417 193 293 164 166 160 216 183 326

Apply one hot encoding to categorical columns of 3-4 values

In [19]:
categorical_cols = ["Chest pain type", "EKG results", "Slope of ST", "Number of vessels fluro", "Thallium"]
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Feature scaling (for logistic regression)

In [21]:
from sklearn.preprocessing import StandardScaler
numerical_cols = ["Age", "BP", "Cholesterol", "Max HR", "ST depression"]
scaler = StandardScaler()

In [22]:
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# Model Training

In [23]:
import time
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

# 1.Bagging

In [24]:
start = time.time()
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.8, max_features=0.8,bootstrap=True, random_state=42)
bagging.fit(X_train, y_train)
bagging_time = time.time() - start

In [25]:
def print_metrics(y_test, y_pred, y_proba):
    print("Accuracy :", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall   :", recall_score(y_test, y_pred))
    print("F1-score :", f1_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("ROC-AUC :", roc_auc_score(y_test, y_proba))

In [26]:
bagging_pred = bagging.predict(X_test)
bagging_proba= bagging.predict_proba(X_test)[:,1]

In [27]:
print("Bagging Results")
print_metrics(y_test, bagging_pred, bagging_proba)
print("Time: ",bagging_time)

Bagging Results
Accuracy : 0.8843730158730159
Precision: 0.878057138734579
Recall   : 0.8617832929139155
F1-score : 0.8698441059543485
Confusion Matrix:
 [[62748  6761]
 [ 7808 48683]]
ROC-AUC : 0.9501420142808107
Time:  188.7931568622589


# 2.Boosting:
a) AdaBoost

In [28]:
start = time.time()
adaboost = AdaBoostClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
adaboost.fit(X_train, y_train)
adaboost_time = time.time() - start

In [29]:
adaboost_pred = adaboost.predict(X_test)
adaboost_proba = adaboost.predict_proba(X_test)[:,1]

In [30]:
print("AdaBoost Results")
print_metrics(y_test, adaboost_pred, adaboost_proba)
print("Time: ",adaboost_time)

AdaBoost Results
Accuracy : 0.8706746031746032
Precision: 0.8854030835314873
Recall   : 0.8173337345771893
F1-score : 0.8500078240779094
Confusion Matrix:
 [[63533  5976]
 [10319 46172]]
ROC-AUC : 0.9435729416091109
Time:  28.031412839889526


b) Gradient Boosting

In [31]:
start = time.time()
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=0.8, random_state=42)
gb.fit(X_train, y_train)
gb_time = time.time() - start

In [32]:
gb_pred = gb.predict(X_test)
gb_proba = gb.predict_proba(X_test)[:,1]

In [33]:
print("Gradient Boosting Results")
print_metrics(y_test, gb_pred, gb_proba)
print("Time: ", gb_time)

Gradient Boosting Results
Accuracy : 0.8875238095238095
Precision: 0.8821542740522675
Recall   : 0.8646333044201732
F1-score : 0.8733059181119256
Confusion Matrix:
 [[62984  6525]
 [ 7647 48844]]
ROC-AUC : 0.9541647004666187
Time:  54.490474462509155


c) XGBoost

In [34]:
start = time.time()
xgb = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=0.8, colsample_bytree=0.8, random_state=42)
xgb.fit(X_train, y_train)
xgb_time = time.time() - start

In [35]:
xgb_pred = xgb.predict(X_test)
xgb_proba = xgb.predict_proba(X_test)[:,1]

In [36]:
print("XGBoost Results")
print_metrics(y_test, xgb_pred, xgb_proba)
print("Time: ", xgb_time)

XGBoost Results
Accuracy : 0.8876825396825396
Precision: 0.8829157999457358
Recall   : 0.864066842505886
F1-score : 0.8733896364156886
Confusion Matrix:
 [[63036  6473]
 [ 7679 48812]]
ROC-AUC : 0.954419997440051
Time:  2.9481468200683594


# 3.Stacking

In [37]:
start = time.time()
base_learners= [
    ("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
    ("gb", GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ("lr", LogisticRegression(max_iter=1000))]
stacking= StackingClassifier(estimators=base_learners, final_estimator=LogisticRegression())
stacking.fit(X_train, y_train)
stacking_time = time.time() - start

In [38]:
stacking_pred = stacking.predict(X_test)
stacking_proba = stacking.predict_proba(X_test)[:,1]

In [39]:
print("Stacking Results")
print_metrics(y_test, stacking_pred, stacking_proba)
print("Time: ", stacking_time)

Stacking Results
Accuracy : 0.8879523809523809
Precision: 0.8826142704928395
Recall   : 0.8651466605299959
F1-score : 0.8737931774297361
Confusion Matrix:
 [[63009  6500]
 [ 7618 48873]]
ROC-AUC : 0.9542511824492755
Time:  656.3559060096741


Model Comparison

In [40]:
results=[]
def get_metrics(y_test, y_pred, y_proba, model_name, training_time):

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc = roc_auc_score(y_test, y_proba)

    results.append({
        "Model": model_name,
        "Accuracy": acc,
        "F1-score": f1,
        "ROC-AUC": roc,
        "Training Time (s)": training_time
    })

In [41]:
get_metrics(y_test, bagging_pred, bagging_proba, "Bagging", bagging_time)
get_metrics(y_test, adaboost_pred, adaboost_proba, "AdaBoost", adaboost_time)
get_metrics(y_test, gb_pred, gb_proba, "Gradient Boosting", gb_time)
get_metrics(y_test, xgb_pred, xgb_proba, "XGBoost", xgb_time)
get_metrics(y_test, stacking_pred, stacking_proba, "Stacking", stacking_time)

In [42]:
comparison=pd.DataFrame(results)
comparison

Unnamed: 0,Model,Accuracy,F1-score,ROC-AUC,Training Time (s)
0,Bagging,0.884373,0.869844,0.950142,188.793157
1,AdaBoost,0.870675,0.850008,0.943573,28.031413
2,Gradient Boosting,0.887524,0.873306,0.954165,54.490474
3,XGBoost,0.887683,0.87339,0.95442,2.948147
4,Stacking,0.887952,0.873793,0.954251,656.355906


1. **Which ensemble method performed best and why?**
- Stacking performed best because it achieved highest accuracy and F1-score and significantly high ROC-AUC
2. **Bagging vs Boosting**
- Bagging reduces variance by training models independently
- Boosting reduces bias by training models sequentially and focusing on previous errors
3. **Did stacking outperform individual ensemble models?**
- The difference in performance between stacking and individual ensemble models such as XGBoost is very small, so it didn't really outperform them significantly
4. **Which model would you choose for deployment and why?**
- XGBoost since it's performance is very close to stacking but much faster
5. **What are the computational trade-offs between these methods?**
- Stacking has the best performance but is computationally expensive
- XGBoost is very efficient, almost perfect performance in the shortest time
- Gradient boosting and Adaboost are faster than bagging, and all 3 have good performance

In [54]:
X_test_final = test.drop(columns=["id"])
X_test_final = pd.get_dummies(X_test_final, columns=categorical_cols, drop_first=True)
X_test_final = X_test_final.reindex(columns=X_train.columns, fill_value=0)
X_test_final[numerical_cols] = scaler.transform(X_test_final[numerical_cols])
xgb_pred = xgb.predict(X_test_final)

In [55]:
submission = pd.DataFrame({
    "id": test["id"],
    "Heart Disease": xgb_pred
})

submission.to_csv("submission.csv", index=False)

In [56]:
!kaggle competitions submit -c playground-series-s6e2 -f submission.csv -m "XGBoost model"

100% 2.32M/2.32M [00:01<00:00, 1.30MB/s]
Successfully submitted to Predicting Heart Disease

In [58]:
!kaggle competitions submissions -c playground-series-s6e2

fileName        date                        description    status                    publicScore  privateScore  
--------------  --------------------------  -------------  ------------------------  -----------  ------------  
submission.csv  2026-02-25 20:20:27.350000  XGBoost model  SubmissionStatus.PENDING                             
submission.csv  2026-02-25 19:46:51.753000  XGBoost model  SubmissionStatus.ERROR                               
