# Bagging Classifier Algorithm

- Machine Learning uses several techniques to build models and improve their performance. 

- Ensemble learning methods help improve the accuracy of classification and regression models.
    
    In this i can perform ' Bagging Classifier Algorithm '.

## What is Ensemble Learning?

- Ensemble learning is a widely-used and preferred Machine learning technique in which multiple individual models,
  often called Base models, are combined to produce an effective optimal prediction model.
  
- The ' Random Forest Algorithm ' is an example of Ensemble learning.

## What is Bagging Classifier?

- Bagging Classifier, is also known as ' Bootstrap Aggregating ', is an ensemble learning technique that helps to 
  improve the performance and accuracy of machine learning algorithms.
  
- It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model.

- Bagging avoids overfitting of data and is used for both regression and classification models, specifically for 
  ' Decision tree algorithm '.

<img src="https://dataaspirant.com/wp-content/uploads/2020/09/5-Bagging-ensemble-method.png"/>

### What is Bootstrapping?

- Bootstrapping is the method of randomly creating samples of data out of a population with replacement to estimate
  a population parameter.
  

<img src="https://sebastianraschka.com/images/blog/2016/model-evaluation-selection-part2/bootrap_concept.png"/>

## Steps to perform Bagging:

- Consider there are 'n' observations and 'm' features in the training set. You need to select a random sample from 
  the training dataset without replacement.
  
    
- A subset of m features is chosen randomly to create a model using sample observations.


- The feature offering the best split out of the lot is used to split the nodes.


- The tree is grown, so you have the best root nodes.


- The above steps are repeated n times. It aggregates the output of individual decision trees to give the best prediction.

## Advantages of Bagging in Machine Learning:

- Bagging minimizes the overfitting of data.


- It improves the model's accuracy.


- It deals with higher dimensional data efficiently.

In [1]:
# import required libraries..

import numpy as np
import pandas as pd

In [3]:
# load the heart disease dataset..

df = pd.read_csv("heart disease.zip")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.shape

(918, 12)

In [8]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


### Remove the outliers using 3 standard deviation method:

In [9]:
# find the mean value of cholesterol column..

df.Cholesterol.mean()

198.7995642701525

In [10]:
# find out the standard deviation of cholesterol column..

df.Cholesterol.std()

109.38414455220337

In [32]:
# find out the outliers of cholesterol column..

df[df.Cholesterol>(df.Cholesterol.mean()+3*df.Cholesterol.std())]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
76,32,M,ASY,118,529,0,Normal,130,N,0.0,Flat,1
149,54,M,ASY,130,603,1,Normal,125,Y,1.0,Flat,1
616,67,F,NAP,115,564,0,LVH,160,N,1.6,Flat,0


In [33]:
# before removing the outliers shape of dataframe..

df.shape

(918, 12)

In [34]:
# create the new dataframe without outliers..

df1 = df[df.Cholesterol<=(df.Cholesterol.mean()+3*df.Cholesterol.std())]

In [35]:
# after removing the outliers the shape of the new dataframe..

df1.shape

(915, 12)

In [36]:
# findout the outliers of RestingBP column..

df[df.RestingBP>(df.RestingBP.mean()+3*df.RestingBP.std())]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
109,39,M,ATA,190,241,0,Normal,106,N,0.0,Up,0
241,54,M,ASY,200,198,0,Normal,142,Y,2.0,Flat,1
365,64,F,ASY,200,0,0,Normal,140,Y,1.0,Flat,1
399,61,M,NAP,200,0,1,ST,70,N,0.0,Flat,1
592,61,M,ASY,190,287,1,LVH,150,Y,2.0,Down,1
732,56,F,ASY,200,288,1,LVH,133,Y,4.0,Down,1
759,54,M,ATA,192,283,0,LVH,195,N,0.0,Up,1


In [37]:
# remove the outliers of RestingBP column with creating new dataframe..

df2 = df1[df1.RestingBP<=(df1.RestingBP.mean()+3*df1.RestingBP.std())]

In [38]:
# after removing the outliers of RestingBP column..

df2.shape

(908, 12)

In [39]:
# findout the outliers of MaxHR..

df[df.MaxHR>(df.MaxHR.mean()+3*df.MaxHR.std())]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease


In [40]:
# findout the outliers of FastingBS column..

df[df.FastingBS>(df.FastingBS.mean()+3*df.FastingBS.std())]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease


In [41]:
# findout the outliers of Oldpeak column..

df[df.Oldpeak>(df.Oldpeak.mean()+3*df.Oldpeak.std())]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
166,50,M,ASY,140,231,0,ST,140,Y,5.0,Flat,1
702,59,M,TA,178,270,0,LVH,145,N,4.2,Down,0
771,55,M,ASY,140,217,0,Normal,111,Y,5.6,Down,1
791,51,M,ASY,140,298,0,Normal,122,Y,4.2,Flat,1
850,62,F,ASY,160,164,0,LVH,145,N,6.2,Down,1
900,58,M,ASY,114,318,0,ST,140,N,4.4,Down,1


In [42]:
# remove the outliers of oldpeak column with creating the new dataframe..

df3 = df2[df2.Oldpeak<=(df2.Oldpeak.mean()+3*df2.Oldpeak.std())]

In [43]:
df3.shape

(902, 12)

In [44]:
# findout the repeated values in chestpaintype column using unique() function..

df.ChestPainType.unique()

array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [45]:
# findout the unique values in RestingECG column...

df.RestingECG.unique()

array(['Normal', 'ST', 'LVH'], dtype=object)

In [47]:
# findout the unique values in st_slope column..

df.ST_Slope.unique()

array(['Up', 'Flat', 'Down'], dtype=object)

In [48]:
df.ExerciseAngina.unique()

array(['N', 'Y'], dtype=object)

### Remove dummy values using Onehat Encoding  and Label Encoding method:

In [54]:
df4 = df3.copy()
df4.ChestPainType.replace(
    {
        'ATA' :'1',
        'NAP' :'2',
        'ASY' :'3',
        'TA' :'4'
    },
    inplace=True)
        

df4.ExerciseAngina.replace(
    {
        'N' : '0',
        'Y' : '1'
    },
    inplace=True)

df4.ST_Slope.replace(
    {
        'Up':'1',
        'Flat':'2',
        'Down':'3'
    },
    inplace=True)

df4.RestingECG.replace(
    {
        'Normal':'1',
        'ST':'2',
        'LVH':'3'
    },
    inplace=True)

df4.Sex.replace(
    {
        'M':'1',
        'F':'2'
    },
    inplace=True)

df4.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,1,140,289,0,1,172,0,0.0,1,0
1,49,2,2,160,180,0,1,156,0,1.0,2,1
2,37,1,1,130,283,0,2,98,0,0.0,1,0
3,48,2,3,138,214,0,1,108,1,1.5,2,1
4,54,1,2,150,195,0,1,122,0,0.0,1,0


In [57]:
# findout the x & y values..

x = df4.drop("HeartDisease",axis='columns')
y = df4.HeartDisease

x.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,0,0.0,1
1,49,2,2,160,180,0,1,156,0,1.0,2
2,37,1,1,130,283,0,2,98,0,0.0,1
3,48,2,3,138,214,0,1,108,1,1.5,2
4,54,1,2,150,195,0,1,122,0,0.0,1


In [58]:
# apply standard scalar method..

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled

array([[-1.42896269, -0.51485643, -1.6990351 , ..., -0.82065181,
        -0.84676261, -1.0456339 ],
       [-0.47545956,  1.94228905, -0.52558207, ..., -0.82065181,
         0.14079864,  0.62072967],
       [-1.74679706, -0.51485643, -1.6990351 , ..., -0.82065181,
        -0.84676261, -1.0456339 ],
       ...,
       [ 0.37209878, -0.51485643,  0.64787097, ...,  1.21854359,
         0.33831089,  0.62072967],
       [ 0.37209878,  1.94228905, -1.6990351 , ..., -0.82065181,
        -0.84676261,  0.62072967],
       [-1.64085227, -0.51485643, -0.52558207, ..., -0.82065181,
        -0.84676261, -1.0456339 ]])

In [59]:
# train and test the model..

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_scaled,y,test_size=0.2,random_state=10)

In [60]:
# findout the shape of the train and test of the model..

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((721, 11), (181, 11), (721,), (181,))

## Train a model with different algorithms using Bagging to know the accuracy:

### Train a model using Standalone Support Vector Machine(SVM) and then using Bagging:

In [61]:
# use svm model..

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

cvs = cross_val_score(SVC(), x, y, cv=5)
cvs.mean()

0.6906445672191528

#### use Bagging with svm:

In [63]:
from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier(base_estimator=SVC(), n_estimators=100, max_samples=0.8, random_state=0)
cvs = cross_val_score(bag_model,x,y,cv=5)
cvs.mean()

0.6839656230816453

#### ! So, here using SVC() there is no improvement in the model accuracy.

### Train a model using Decision Tree and then using Bagging:

In [69]:
from sklearn.tree import DecisionTreeClassifier

cvs = cross_val_score(DecisionTreeClassifier(random_state=0), x, y, cv=5)
cvs.mean()

0.7449600982197667

#### Now using Bagging:

In [66]:
bag_model = BaggingClassifier(
    base_estimator = DecisionTreeClassifier(random_state=0),
    n_estimators = 100,
    max_samples = 0.8,
    oob_score = True,
    random_state = 0
)

cvs = cross_val_score(bag_model, x, y, cv=5)
cvs.mean()

0.8047943523634131

#### ! Here as we see that the model accuracy is increasing from 74% to 80% using Decision Tree with Bagging.

### Train a model using Random Forest with Bagging:

In [70]:
from sklearn.ensemble import RandomForestClassifier

cvs = cross_val_score(RandomForestClassifier(), x, y, cv=5)
cvs.mean()

0.8159177409453653

#### using Bagging with RandomForestClassifier:

In [73]:
bag_model = BaggingClassifier(base_estimator=RandomForestClassifier(), random_state=0)
cvs = cross_val_score(bag_model, x, y, cv=5)
cvs.mean()

0.8358440761203193

#### ! Here i using different algorithms to know that which one gives best accuracy score. So from the result Random Forest gives the best result with 83% accuracy.