### ===Task===

#### Out of Bag Evaluation

Well, it seems like our bagging technique is quite good.  Anyhow, one interesting observation is that each tree only see a subset of the dataset. Any data that a particular tree did not see is called **out of bag** (oob).  Note that oob is not the same for all predictors.

One interesting thing is that since oob is something that each tree never see, thus oob is somewhat a validation set.  Thus what we can do is after we fit each tree. We can ask each tree to test their accuracy with their own oob, and then we can average the accuracy from all trees.  

Let's modify the above scratch code to:
- Calculate for oob evaluation for each bootstrapped dataset, and also the average score
- Change the code to "without replacement"
- Put everything into a class <code>Bagging</code>.  It should have at least two methods, <code>fit(X_train, y_train)</code>, and <code>predict(X_test)</code>
- Modify the code from above to randomize features.  Set the number of features to be used in each tree to be <code>sqrt(n)</code>, and then select a subset of features for each tree.  This can be easily done by setting our DecisionTreeClassifier <code>max_features</code> to 'sqrt'

In [76]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                test_size=0.3, shuffle=True, random_state=42)

In [72]:
from sklearn.tree import DecisionTreeClassifier
import random
from scipy import stats
from sklearn.metrics import classification_report, accuracy_score

B = 5
m, n = X_train.shape
boostrap_ratio = 0.8
tree_params = {'max_depth': 2, 'criterion':'gini', 'min_samples_split': 5}
models = [DecisionTreeClassifier(**tree_params) for _ in range(B)]

#sample size for each tree
sample_size = int(boostrap_ratio * len(X_train))

xsamples = np.zeros((B, sample_size, n))
ysamples = np.zeros((B, sample_size))

x_oob = []
y_oob = []

wo_rpm = False
idx_list = []
#subsamples for each model
for i in range(B):
    ##sampling with replacement; i.e., sample can occur more than once
    #for the same predictor
    idx_list.append([])
    for j in range(sample_size):
        idx = random.randrange(m)   #<----with replacement #change so no repetition
        if wo_rpm:
            while idx in idx_list[i]:
                idx = random.randrange(m)
        idx_list[i].append(idx)
        xsamples[i, j, :] = X_train[idx]
        ysamples[i, j] = y_train[idx]
        #keep track of idx that i did not use for ith tree
    x_oob.append(np.delete(X_train, idx_list[i], axis=0))
    y_oob.append(np.delete(y_train, idx_list[i], axis=0))

#fitting each estimator
for i, model in enumerate(models):
    _X = xsamples[i, :]
    _y = ysamples[i, :]
    model.fit(_X, _y)

predictions = []
acc = np.zeros(B)
for i in range(B):
    yhat = models[i].predict(x_oob[i])
    predictions.append(yhat)
    acc[i]=(accuracy_score(y_oob[i], yhat))
avg_score = np.average(acc)
print(avg_score)



0.9031884057971015


In [104]:
class Bagging:
    def __init__(self, B=5, boostrap_ratio=0.8, wo_rpm=True, max_features='sqrt'):
        self.B = B
        self.boostrap_ratio = boostrap_ratio
        self.wo_rpm = wo_rpm
        self.max_features = max_features

    def fit(self, X_train, y_train):
        m, n = X_train.shape
        tree_params = {
            'max_depth': 2, 
            'criterion':'gini', 
            'max_features': self.max_features}
        self.models = [DecisionTreeClassifier(**tree_params) for _ in range(self.B)]
        sample_size = int(self.boostrap_ratio * len(X_train))
        xsamples = np.zeros((B, sample_size, n))
        ysamples = np.zeros((B, sample_size))
        # subsamples for each model
        x_oob = []
        y_oob = []
        idx_list = []
        for i in range(B):
            idx_list.append([])
            for j in range(sample_size):
                idx = random.randrange(m)
                if (self.wo_rpm):
                    while idx in idx_list[i]:
                        idx = random.randrange(m)
                idx_list[i].append(idx)
                xsamples[i, j, :] = X_train[idx]
                ysamples[i, j] = y_train[idx]
            x_oob.append(np.delete(X_train, idx_list[i], axis=0))
            y_oob.append(np.delete(y_train, idx_list[i], axis=0))
        # fit each model
        for i, model in enumerate(self.models):
            _X = xsamples[i, :]
            _y = ysamples[i, :]
            model.fit(_X, _y)
        # find the average score of OOB
        acc = np.zeros(B)
        for i in range(B):
            yhat = self.models[i].predict(x_oob[i])
            acc[i]=(accuracy_score(y_oob[i], yhat))
            print("Tree", i, ":", acc[i])
        avg_score = np.average(acc)
        print("Average score of OOB =",avg_score)

    def predict(self, X_test):
        predictions = np.zeros((B, X_test.shape[0]))
        for i, model in enumerate(self.models):
            yhat = model.predict(X_test)
            predictions[i, :] = yhat

        yhat = stats.mode(predictions)[0][0]
        return yhat

In [105]:
model = Bagging(B=5, boostrap_ratio=0.8, wo_rpm=True, max_features='sqrt')
model.fit(X_train, y_train)

Tree 0 : 0.9047619047619048
Tree 1 : 0.9523809523809523
Tree 2 : 0.9523809523809523
Tree 3 : 0.9523809523809523
Tree 4 : 0.8571428571428571
Average score of OOB = 0.9238095238095237


In [106]:
yhat = model.predict(X_test)
print(classification_report(y_test, yhat))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

