<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Bagging 
---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Import-the-car-evaluation-data." data-toc-modified-id="1.-Import-the-car-evaluation-data.-1">1. Import the car evaluation data.</a></span></li><li><span><a href="#2.-Encode-the-features-properly" data-toc-modified-id="2.-Encode-the-features-properly-2">2. Encode the features properly</a></span></li><li><span><a href="#3.-Create-a-train-test-split-and-cross-validate-a-KNN-classifier" data-toc-modified-id="3.-Create-a-train-test-split-and-cross-validate-a-KNN-classifier-3">3. Create a train-test split and cross-validate a KNN classifier</a></span></li><li><span><a href="#4.-Research-and-describe-the-max_samples-and-max_features-hyperparameters-of-the-bagging-classifier" data-toc-modified-id="4.-Research-and-describe-the-max_samples-and-max_features-hyperparameters-of-the-bagging-classifier-4">4. Research and describe the <code>max_samples</code> and <code>max_features</code> hyperparameters of the bagging classifier</a></span></li><li><span><a href="#5.-Fit-a-BaggingClassifier-with-a-KNN-base-estimator" data-toc-modified-id="5.-Fit-a-BaggingClassifier-with-a-KNN-base-estimator-5">5. Fit a <code>BaggingClassifier</code> with a KNN base estimator</a></span></li><li><span><a href="#6.-Cross-validate-a-decision-tree-classifier" data-toc-modified-id="6.-Cross-validate-a-decision-tree-classifier-6">6. Cross-validate a decision tree classifier</a></span></li><li><span><a href="#7.-Fit-a-BaggingClassifier-with-a-decision-tree-base-estimator" data-toc-modified-id="7.-Fit-a-BaggingClassifier-with-a-decision-tree-base-estimator-7">7. Fit a <code>BaggingClassifier</code> with a decision tree base estimator</a></span></li><li><span><a href="#8.--Of-the-Hypothesis-Space-problems-we-discussed-earlier.--Which-are-solved-by-bagging?" data-toc-modified-id="8.--Of-the-Hypothesis-Space-problems-we-discussed-earlier.--Which-are-solved-by-bagging?-8">8.  Of the Hypothesis Space problems we discussed earlier.  Which are solved by bagging?</a></span><ul class="toc-item"><li><span><a href="#--Statistical?" data-toc-modified-id="--Statistical?-8.1">- Statistical?</a></span></li><li><span><a href="#--Computational?" data-toc-modified-id="--Computational?-8.2">- Computational?</a></span></li><li><span><a href="#--Representational?" data-toc-modified-id="--Representational?-8.3">- Representational?</a></span></li></ul></li><li><span><a href="#Bonus:-Tune-the-bagging-classifiers-with-grid-search" data-toc-modified-id="Bonus:-Tune-the-bagging-classifiers-with-grid-search-9">Bonus: Tune the bagging classifiers with grid search</a></span></li></ul></div>

### 1. Import the car evaluation data.

Use `acceptability` as the target variable.

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
sns.set(font_scale=1.5)

In [2]:
df = pd.read_csv('../../../../../resource-datasets/car_evaluation/car.csv')
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


### 2. Encode the features properly

In [3]:
y = df.pop('acceptability')
X = pd.get_dummies(df, drop_first=True)

In [4]:
X.head()

Unnamed: 0,buying_low,buying_med,buying_vhigh,maint_low,maint_med,maint_vhigh,doors_3,doors_4,doors_5more,persons_4,persons_more,lug_boot_med,lug_boot_small,safety_low,safety_med
0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0
1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1
2,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0
3,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0
4,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1


### 3. Create a train-test split and cross-validate a KNN classifier

In [5]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, shuffle=True, random_state=1)

In [7]:
knn = KNeighborsClassifier()

print("KNN CV training score:\t", cross_val_score(knn, X_train, y_train, cv=5,
                                                  n_jobs=1).mean())
knn.fit(X_train, y_train)
print("KNN test score:\t", knn.score(X_test, y_test))

KNN CV training score:	 0.8222176075690472
KNN test score:	 0.8477842003853564


### 4. Research and describe the `max_samples` and `max_features` hyperparameters of the bagging classifier

The `BaggingClassifier` meta-estimator has several parameters.

Look at the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) for a detailed description of each and find out what `max_samples` and `max_features` do.

> Answer:
>
- `max_samples` is the number of samples to draw from X to train each base estimator, can be given as absolute number or fraction of the total
- `max_features` is the number of features to draw from X to train each base estimator, can be given as absolute number or fraction.

### 5. Fit a `BaggingClassifier` with a KNN base estimator

In [8]:
bagging = BaggingClassifier(base_estimator=knn,
                            max_samples=0.5, max_features=0.5)

print("Bagging CV training score:\t", cross_val_score(bagging, X_train, y_train,
                                                      cv=5, n_jobs=1).mean())

bagging.fit(X_train, y_train)
print("KNN bagging test score:\t", bagging.score(X_test, y_test))

Bagging CV training score:	 0.7286866785264664
KNN bagging test score:	 0.6994219653179191


### 6. Cross-validate a decision tree classifier 

In [9]:
from sklearn.tree import DecisionTreeClassifier

In [10]:
dt = DecisionTreeClassifier()

print("DT CV training score:\t", cross_val_score(dt, X_train, y_train, cv=5,
                                                 n_jobs=1).mean())
dt.fit(X_train, y_train)
print("DT test score:\t", knn.score(X_test, y_test))

DT CV training score:	 0.8991404245311575
DT test score:	 0.8477842003853564


### 7. Fit a `BaggingClassifier` with a decision tree base estimator

In [11]:
bagging = BaggingClassifier(base_estimator=dt,
                            max_samples=0.8, max_features=0.8, n_estimators=100)

print("DT Bagging CV training score:\t", cross_val_score(bagging, X_train, y_train,
                                                         cv=5, n_jobs=1).mean())

bagging.fit(X_train, y_train)
print("DT bagging test score:\t", bagging.score(X_test, y_test))

DT Bagging CV training score:	 0.8527453866521932
DT bagging test score:	 0.9113680154142582


### 8.  Of the Hypothesis Space problems we discussed earlier.  Which are solved by bagging?
#### - Statistical?
#### - Computational?
#### - Representational?

> Answer: all three

### Bonus: Tune the bagging classifiers with grid search

In [12]:
from sklearn.model_selection import GridSearchCV

In [13]:
model = BaggingClassifier(base_estimator=knn, n_estimators=100)
params = {'max_samples': np.linspace(0.8, 1.0, 3),
          'max_features': range(int(3/4.*X.shape[1]), X.shape[1]+1)}

grid = GridSearchCV(model, param_grid=params, cv=5)
grid.fit(X_train, y_train)
grid.best_estimator_

BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=14,
         max_samples=1.0, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [14]:
print(grid.score(X_train, y_train))
print(grid.score(X_test, y_test))

0.9338296112489661
0.8554913294797688


In [15]:
model.get_params()

{'base_estimator': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
            metric_params=None, n_jobs=1, n_neighbors=5, p=2,
            weights='uniform'),
 'base_estimator__algorithm': 'auto',
 'base_estimator__leaf_size': 30,
 'base_estimator__metric': 'minkowski',
 'base_estimator__metric_params': None,
 'base_estimator__n_jobs': 1,
 'base_estimator__n_neighbors': 5,
 'base_estimator__p': 2,
 'base_estimator__weights': 'uniform',
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 100,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [16]:
model = BaggingClassifier(base_estimator=dt, n_estimators=100)
params = {'max_samples': np.linspace(0.8, 1.0, 3),
          'max_features': range(int(3/4.*X.shape[1]), X.shape[1]+1)}

grid = GridSearchCV(model, param_grid=params, cv=5)
grid.fit(X_train, y_train)
grid.best_estimator_

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=15,
         max_samples=0.8, n_estimators=100, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [17]:
print(grid.score(X_train, y_train))
print(grid.score(X_test, y_test))

1.0
0.9344894026974951
