# <span style="color:purple; font-weight:bold"> Bagging             

**Bagging, or Bootstrap Aggregation, is an ensemble learning method in machine learning designed to improve the stability and accuracy of prediction models.**  

**It achieves this by reducing variance, which often leads to better generalization on unseen data and helps prevent overfitting.**             


### <span style="color:purple; font-weight:bold">Key Concepts</span>            


- <span style="color:blue; font-weight:bold">Bootstrapping:</span> Creates random subsets of training data with replacement.           

- <span style="color:blue; font-weight:bold">Parallel Training:</span> Base models are trained independently and simultaneously.        

- <span style="color:blue; font-weight:bold">Aggregation:</span> Uses majority voting (classification) or averaging (regression) for the final prediction.               

    * For classification tasks, a majority voting scheme is typically used, where the class predicted by the most individual models is chosen as the final prediction.                  

    * For regression tasks, the predictions from the individual models are averaged to obtain the final prediction.                     


### <span style="color:purple; font-weight:bold">Use Cases ?</span>             

- Highly effective with high-variance models, especially decision trees.    
           
- Random Forest is a variation of bagging, it samples both rows as well as columns of datasets.             


--- 

<div style="
    display:flex;
    overflow-x:auto;
    overflow-y:hidden;
    gap:40px;
    padding:10px;
    border:1px solid #ccc;
    white-space:nowrap;
    align-items:center;
">

  <img src="./assets/images/Bagging_1.gif"
       style="
          height:650px;
          width:850px;
          max-width:none;
          max-height:none;
          margin:-10px;
          padding:10px;
          display:inline-block;
       ">

  <img src="./assets/images/DecisionTree_Vs_RandomForest.gif"
       style="
          height:650px;
          width:850px;
          max-width:none;
          max-height:none;
          margin:3px;
          padding:10px;
          display:inline-block;
       ">

   <img src="./assets/images/Sampled_Vs_NonSampled_Data.png"
       style="
          height:650px;
          width:850px;
          max-width:none;
          max-height:none;
          margin:3px;
          padding:10px;
          display:inline-block;
       ">    

</div> 




In [39]:
import pandas as pd
df = pd.read_csv('./assets/files/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [40]:
df.isnull().sum()   # Check how many entries are null in each column.

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [41]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [42]:
df.Outcome.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [43]:
X = df.drop('Outcome', axis='columns')
y = df.Outcome

**Scale features**

In [44]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled[:3]

array([[ 0.63994726,  0.84832379,  0.14964075,  0.90726993, -0.69289057,
         0.20401277,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.12339636, -0.16054575,  0.53090156, -0.69289057,
        -0.68442195, -0.36506078, -0.19067191],
       [ 1.23388019,  1.94372388, -0.26394125, -1.28821221, -0.69289057,
        -1.10325546,  0.60439732, -0.10558415]])

**Train-Test Split**

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)

In [46]:
print(f"Size of X_train: {X_train.shape} \nSize of X_test: {X_test.shape}")

Size of X_train: (576, 8) 
Size of X_test: (192, 8)


In [47]:
# Let's count the occurrences of unique values in the dataset
y_train.value_counts()

Outcome
0    375
1    201
Name: count, dtype: int64

**Let's use Decision Tree Classifier**

In [48]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.68181818, 0.68181818, 0.68181818, 0.79738562, 0.69934641])

In [49]:
# Let's consider the mean score as our best score
scores.mean()

np.float64(0.7084373143196674)

**Let's now use Bagging Classifier**

In [50]:
from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(
    estimator = DecisionTreeClassifier(),
    n_estimators = 100,
    max_samples = 0.8,
    oob_score = True,
    random_state = 0
)

bag_model.fit(X_train, y_train)
bag_model.oob_score_

0.7534722222222222

In [51]:
bag_model.score(X_test, y_test)

0.7760416666666666

**Let's evaluate Bagging Classifier using Cross validation**

In [52]:
bag_model = BaggingClassifier(
    estimator = DecisionTreeClassifier(),
    n_estimators = 100,
    max_samples = 0.8,
    oob_score = True,
    random_state = 0
)

scores = cross_val_score(bag_model, X, y, cv=5)
scores.mean()

np.float64(0.7578728461081402)

**Let's evaluate Random Forest using Cross Validation**

In [53]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(), X, y, cv=5)
scores.mean()

np.float64(0.7695866225277991)