# Ensemble Learning - Bagging


Ensemble learning is a machine learning technique where multiple models are combined to improve the overall performance and generalization of the system. Bagging, which stands for Bootstrap Aggregating, is a popular ensemble learning method.

### Steps of Bootstrap Aggregation

1. Bootstrap Sampling: Divide the training dataset into multiple subsets with replacement. This means that some instances may be repeated in a subset while others may not be included at all.

2. Training base models: A base model(weak learner) is trained on each data subset independently. Each model captures different aspects present uniquely in that subeset.

3. Aggregation of predictions: For testing, predictions from each base model are combined to give the final prediction
    - Classification: Majiority vote is commonly used
    - Regression: Mean of outputs from each base model

### Applications 

1. Bioinformatics: can be used for gene and protein selection ( https://www.maths.usyd.edu.au/u/pengyi/publication/EnsembleBioinformatics-v6.pdf)

2. Remote sensing: can be used for image classification tasks to enhance accuracy of feature classification in sattelite images with each base model focusing on a different texture (https://www.mdpi.com/2072-4292/12/10/1683/htm)

In [88]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [89]:
data = pd.read_csv('heart_failure_clinical_records.csv')

In [90]:
data.shape

(299, 13)

In [91]:
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [92]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [93]:
X = data.drop('DEATH_EVENT', axis=1)

In [94]:
X.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8


In [95]:
y = data['DEATH_EVENT']

In [96]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [97]:
y_train.head()

6      1
183    1
185    1
146    0
30     1
Name: DEATH_EVENT, dtype: int64

## Expt 1: What are the ideal number of base estimators to be used? 

In [98]:
base = DecisionTreeClassifier(random_state=42)

In [99]:
print('accuracy scores')
for i in range(10, 150, 10):
    model = BaggingClassifier(base_estimator=base, n_estimators=num_trees, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f'base estimator - {i}: {acc}')

accuracy scores
base estimator - 10: 0.6833333333333333
base estimator - 20: 0.6833333333333333
base estimator - 30: 0.6833333333333333
base estimator - 40: 0.6833333333333333
base estimator - 50: 0.6833333333333333
base estimator - 60: 0.6833333333333333
base estimator - 70: 0.6833333333333333
base estimator - 80: 0.6833333333333333
base estimator - 90: 0.6833333333333333
base estimator - 100: 0.6833333333333333
base estimator - 110: 0.6833333333333333
base estimator - 120: 0.6833333333333333
base estimator - 130: 0.6833333333333333
base estimator - 140: 0.6833333333333333


more base estimators doesn't always imply better outputs. Using too many base estimators might lead to overfitting and increased computational cost.

## Expt 2: Can defining depth of base models affect outputs?

In [100]:
dc = DecisionTreeClassifier(max_depth=2, random_state=42) #creating weak base model with only 2 splits

In [101]:
bagging = BaggingClassifier(base_estimator=dc, n_estimators = 100, random_state=42)

In [102]:
cv_scores = cross_val_score(bagging, X, y, cv = 10)

In [103]:
print(cv_scores.mean())

0.7789655172413793


In [108]:
print('mean cross validation score')
for i in range(1, 10):
    dc = DecisionTreeClassifier(max_depth=i, random_state=42) #creating weak base model with only 2 splits
    bagging = BaggingClassifier(base_estimator=dc, n_estimators = 100, random_state=42)
    cv_scores = cross_val_score(bagging, X, y, cv = 10)
    print(f'max_depth {i}: {cv_scores.mean()}')

mean cross validation score
max_depth 1: 0.7789655172413793
max_depth 2: 0.7789655172413793
max_depth 3: 0.7489655172413793
max_depth 4: 0.7522988505747128
max_depth 5: 0.745632183908046
max_depth 6: 0.735632183908046
max_depth 7: 0.7389655172413793
max_depth 8: 0.7389655172413793
max_depth 9: 0.735632183908046


Weak base models with high number of base models are giving better outputs for this dataset. However, its important to keep in mind the complexity-performance tradeoff when defining ideal max_depth

### Appropriate evaluation

Cross-validation complements bagging by leveraging the entire dataset through repeated train-validation splits, ensuring each data point contributes to model training and validation at some point. It mitigates variance by aggregating scores from diverse data subsets, yielding a more reliable estimation of model performance.