# Bagging

Its short for Bootstrap-Aggregation. The idea is to train weak classifiers on smaller windows of data (sample random rows and columns = Bootstraping). Then a voting mechanism aggregates results from these weak classifiers to make a much stronger one. A very famous example of such an algorithm is random forest, which builds multiple decision trees and aggregates them to create a stronger one.

In [6]:
import numpy as np
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import pandas as pd

In [4]:
iris = load_iris()
X = iris.data
Y = iris.target

In [5]:
X_fit, X_eval, Y_fit, Y_test = model_selection.train_test_split(X,Y,test_size = 0.3,random_state=1)

In [6]:
seed = 7
kfold = model_selection.KFold(n_splits=5,random_state=7)
kfold

KFold(n_splits=5, random_state=7, shuffle=False)

In [7]:
cart = DecisionTreeClassifier()
num_trees = 100

In [8]:
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees,random_state=seed)
model

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=7, verbose=0, warm_start=False)

In [9]:
results = model_selection.cross_val_score(model, X_fit, Y_fit, cv=kfold)
results

array([1.        , 0.95238095, 1.        , 0.9047619 , 0.85714286])

In [10]:
x = [print("model ",i," accuracy ",results[i]) for i in range(0,len(results))]

model  0  accuracy  1.0
model  1  accuracy  0.9523809523809523
model  2  accuracy  1.0
model  3  accuracy  0.9047619047619048
model  4  accuracy  0.8571428571428571


In [11]:
print("Average accuracy ",results.mean())

Average accuracy  0.9428571428571428


# Boosting

Its a sequential way of aggregating multiple weak classifiers. A weak classifier is built on the data, and then the misclassified samples are given higher weightage before training another weak classifier. This compensates for the weaknesses of the previous model and builds a stronger one. This continues till you have a powerful model.

In [12]:
from sklearn.ensemble import AdaBoostClassifier

In [13]:
iris = load_iris()
X = iris.data
Y = iris.target

In [15]:
X_fit, X_eval, Y_fit, Y_test = model_selection.train_test_split(X,Y,test_size = 0.20, random_state=1)

In [16]:
cart = DecisionTreeClassifier()
num_trees = 25

In [17]:
model = AdaBoostClassifier(base_estimator=cart, n_estimators=num_trees, learning_rate=0.1)
model

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=0.1, n_estimators=25, random_state=None)

In [30]:
model.staged_score(X_fit,Y_fit)

<generator object BaseWeightBoosting.staged_score at 0x107c5c620>

In [34]:
pred_label = model.predict(X_eval)
nnz = np.float(np.shape(Y_test)[0] - np.count_nonzero(pred_label - Y_test))
acc = 100*nnz/np.shape(Y_test)[0]

In [36]:
acc

96.66666666666667

# Stacking

Multiple weak/strong classifiers are created independently on the same data. Then their predictions are used as features to train a logistic classifier which weights each feature (predictions from various models) and creates a powerful model.

In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

In [40]:
iris = load_iris()

In [42]:
X,y = iris.data[:,1:3], iris.target

In [43]:
def CalculateAccuracy(y_test,pred_label):
    nnz = np.shape(y_test)[0] - np.count_nonzero(pred_label - y_test)
    acc = 100*nnz/float(np.shape(y_test)[0])
    return acc

In [44]:
clf1 = KNeighborsClassifier(n_neighbors=2)
clf2 = RandomForestClassifier(n_estimators=2, random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()

In [45]:
clf1.fit(X,y)
clf2.fit(X,y)
clf3.fit(X,y)

GaussianNB(priors=None)

In [51]:
f1 = clf1.predict(X)
acc1 = CalculateAccuracy(y,f1)
acc1

96.66666666666667

In [52]:
f2 = clf2.predict(X)
acc2 = CalculateAccuracy(y,f2)
acc2

94.66666666666667

In [53]:
f3 = clf3.predict(X)
acc3 = CalculateAccuracy(y,f3)
acc3

92.0

In [57]:
f = [f1,f2,f3]
f = np.transpose(f)

In [59]:
lr.fit(f,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [60]:
final = lr.predict(f)

In [61]:
acc4 = CalculateAccuracy(y,final)

In [62]:
acc4

97.33333333333333