# Bagging and Random Forests

Bagging is an ensemble method involving training the same algorithm many times using different subsets sampled from the training data. In this chapter, you'll understand how bagging can be used to create a tree ensemble. You'll also learn how the random forests algorithm can lead to further ensemble diversity through randomization at the level of each split in the trees forming the ensemble.

## Define the bagging classifier

In the following exercises you'll work with the Indian Liver Patient dataset from the UCI machine learning repository. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. You'll do so using a Bagging Classifier.

In [44]:
## Preprocessing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data=pd.read_csv("indian_liver_patient_preprocessed.csv",index_col=0)
data.head()

Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [35]:
data.shape
#data.Gender=data.Gender.map({'Male':1,'Female':0})

In [14]:
#data.Dataset.unique()

array([1, 2], dtype=int64)

In [45]:
# Set seed for reproducibility
SEED=1
X=data.drop('Liver_disease', axis=1)
y=data['Liver_disease']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20, stratify=y, random_state=SEED)

In [46]:
X_train.head()

Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std
150,0.692113,-0.356035,-0.353271,-0.457635,-0.27934,-0.236238,0.385998,0.833369,0.792118,1
377,-1.529043,-0.436391,-0.459878,-0.367231,-0.377687,-0.336376,-0.352211,-0.174507,-0.14739,0
473,-0.17167,-0.372106,-0.424343,-0.564476,-0.23563,-0.308752,0.293722,0.959354,1.105288,1
285,-1.960935,-0.291751,-0.353271,1.165532,-0.284804,-0.298393,1.308759,0.959354,-0.14739,1
358,-0.480164,-0.42032,-0.459878,-0.474072,-0.290267,-0.263863,-0.813592,-0.678445,-0.460559,1


In [53]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=100, random_state=1)

In the following exercise, you'll train bc and evaluate its test set performance.

In [57]:
# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

Test set accuracy of bc: 0.69


Using 1 decision tree

In [56]:
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)

acc_test_dt = accuracy_score(y_test, y_pred_dt)
print('Test set accuracy of dt: {:.2f}'.format(acc_test_dt))

Test set accuracy of dt: 0.63
