<h2 align='center'>Ensemble Learning: Bagging Tutorial</h2>

**We will use pima indian diabetes dataset to predict if a person has a diabetes or not based on certain features such as blood pressure, skin thickness, age etc. We will train a standalone model first and then use bagging ensemble technique to check how it can improve the performance of the model**



In [7]:
import pandas as pd

df = pd.read_csv("/kaggle/input/pima-indians-diabetes-dataset/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72.0,35,169.5,33.6,0.627,50,1
1,1,85,66.0,29,102.5,26.6,0.351,31,0
2,8,183,64.0,32,169.5,23.3,0.672,32,1
3,1,89,66.0,23,94.0,28.1,0.167,21,0
4,0,137,40.0,35,168.0,43.1,2.288,33,1


In [8]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [9]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.677083,72.389323,29.089844,141.753906,32.434635,0.471876,33.240885,0.348958
std,3.369578,30.464161,12.106039,8.89082,89.100847,6.880498,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,25.0,102.5,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,28.0,102.5,32.05,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,169.5,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [10]:
df.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

There is slight imbalance in our dataset but since it is not major we will not worry about it!

<h3>Train test split</h3>

In [12]:
X = df.drop("Outcome",axis="columns")
y = df.Outcome

In [13]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[:3]

array([[ 0.63994726,  0.86462486, -0.03218035,  0.66518138,  0.31160394,
         0.16948251,  0.46849198,  1.4259954 ],
       [-0.84488505, -1.20472661, -0.52812374, -0.01011181, -0.44084303,
        -0.84854874, -0.36506078, -0.19067191],
       [ 1.23388019,  2.01426457, -0.69343821,  0.32753478,  0.31160394,
        -1.32847775,  0.60439732, -0.10558415]])

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=10)

In [16]:
X_train.shape

(576, 8)

In [17]:
X_test.shape

(192, 8)

In [18]:
y_train.value_counts()

0    375
1    201
Name: Outcome, dtype: int64

In [19]:
y_test.value_counts()

0    125
1     67
Name: Outcome, dtype: int64

<h3>Train using stand alone model</h3>

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

scores = cross_val_score(DecisionTreeClassifier(), X, y, cv=5)
scores

array([0.86363636, 0.8961039 , 0.88961039, 0.87581699, 0.88235294])

In [21]:
scores.mean()

0.8815041167982344

<h3>Train using Bagging</h3>

In [22]:
from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
bag_model.fit(X_train, y_train)
bag_model.oob_score_

0.8715277777777778

In [23]:
bag_model.score(X_test, y_test)

0.8854166666666666

In [24]:
bag_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(), 
    n_estimators=100, 
    max_samples=0.8, 
    oob_score=True,
    random_state=0
)
scores = cross_val_score(bag_model, X, y, cv=5)
scores

array([0.88311688, 0.85714286, 0.88311688, 0.90849673, 0.90196078])

In [25]:
scores.mean()

0.8867668279432985

We can see some improvement in test score with bagging classifier as compared to a standalone classifier

<h3>Train using Random Forest</h3>

In [26]:
from sklearn.ensemble import RandomForestClassifier

scores = cross_val_score(RandomForestClassifier(n_estimators=50), X, y, cv=5)
scores.mean()

0.8867753161870808