<h1>Bagging</h1>

<h3>Bootstrapping and Aggregation</h3> <br>

-> Unlike Voting the data provided to models is not same<br>
-> Random sampling is done from the population and provided to each model and this step is called bootstrapping <br>
-> The result is selected based on majority vote and this step is aggregation <br>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Datasets/diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [5]:
df.shape

(768, 9)

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

In [7]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=2)

In [9]:
dt = DecisionTreeClassifier(random_state=24)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

In [10]:
print('DT accuracy: ', accuracy_score(y_test, y_pred))

DT accuracy 0.7402597402597403


<h2>Bagging</h2>

In [16]:
bag = BaggingClassifier(
    base_estimator = DecisionTreeClassifier(),
    n_estimators = 200,
    max_samples = 0.50,
    bootstrap = True,
    random_state=24
)

In [17]:
bag.fit(X_train, y_train)

In [18]:
y_pred_bag = bag.predict(X_test)

In [19]:
print('Bagging score: ', accuracy_score(y_test, y_pred_bag))

Bagging score:  0.7597402597402597


<h3> As we can see, the score has improved compared to using single model</h3>

In [20]:
bag_svc = BaggingClassifier(
    base_estimator = SVC(),
    n_estimators = 200,
    max_samples = 0.5,
    bootstrap= True,
    random_state=24
)

In [21]:
bag_svc.fit(X_train, y_train)

In [22]:
y_pred_svc = bag_svc.predict(X_test)

In [23]:
print('SVC Bagging: ', accuracy_score(y_test, y_pred_svc))

SVC Bagging:  0.7662337662337663


<h2>Pasting</h2>

This is a bagging type wherein the rows are not replaced while sampling i,e,. bootstrap = False

In [24]:
bag_p = BaggingClassifier(
    base_estimator = DecisionTreeClassifier(),
    n_estimators=200,
    max_samples=0.25,
    bootstrap=False,
    random_state=24,
    verbose=1
)

In [25]:
bag_p.fit(X_train, y_train)

y_pred = bag_p.predict(X_test)
print("Pasting Bagging", accuracy_score(y_test, y_pred))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Pasting Bagging 0.7597402597402597


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


<h2>Random Subspaces</h2>

In this bagging technique, feature sampling is done and not row sampling. i.e., bootstrap = False but bootstrap_features=True

In [27]:
bag_rs = BaggingClassifier(
    base_estimator = DecisionTreeClassifier(),
    n_estimators=200,
    max_samples=1.0,
    bootstrap=False,
    random_state=24,
    max_features=0.5,
    bootstrap_features=True
)

In [28]:
bag_rs.fit(X_train, y_train)

y_pred = bag_rs.predict(X_test)
print("Random Subspaces", accuracy_score(y_test, y_pred))

Random Subspaces 0.7532467532467533


<h2>Random Patches</h2>

In this bagging technique both row and feature sampling is done

In [1]:
bag_rp = BaggingClassifier(
    base_estimator = DecisionTreeClassifier(),
    n_estimators=200,
    max_samples=0.5,
    bootstrap=True,
    random_state=24,
    max_features=0.5,
    bootstrap_features=True
)

NameError: name 'BaggingClassifier' is not defined

In [None]:
bag_rp.fit(X_train, y_train)

y_pred = bag_rp.predict(X_test)
print("Random Patches", accuracy_score(y_test, y_pred))