<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Extracting Base Estimators from Bagged Models 

---

In this lab, you will have to make use of the attributes available with sklearn's [BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html). In particular
you will need to investigate what you can do with 
- `.base_estimator_`
- `.estimators_`
- `.estimators_samples_`
- `.estimators_features_`

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Load-the-breast-cancer-data." data-toc-modified-id="1.-Load-the-breast-cancer-data.-1">1. Load the breast cancer data.</a></span></li><li><span><a href="#2.-Load-required-sklearn-packages." data-toc-modified-id="2.-Load-required-sklearn-packages.-2">2. Load required sklearn packages.</a></span></li><li><span><a href="#3.-Make-a-train-test-split." data-toc-modified-id="3.-Make-a-train-test-split.-3">3. Make a train-test split.</a></span></li><li><span><a href="#4.-Create-and-fit-a-BaggingClassifier-with-a-DecisionTreeClassifier-base-estimator." data-toc-modified-id="4.-Create-and-fit-a-BaggingClassifier-with-a-DecisionTreeClassifier-base-estimator.-4">4. Create and fit a <code>BaggingClassifier</code> with a <code>DecisionTreeClassifier</code> base estimator.</a></span></li><li><span><a href="#5.-Pull-out-the-base-estimator-from-the-ensemble-model." data-toc-modified-id="5.-Pull-out-the-base-estimator-from-the-ensemble-model.-5">5. Pull out the base estimator from the ensemble model.</a></span></li><li><span><a href="#6.-Pull-out-all-the-base-estimators." data-toc-modified-id="6.-Pull-out-all-the-base-estimators.-6">6. Pull out <em>all</em> the base estimators.</a></span></li><li><span><a href="#7.-Get-the-features-used-in-each-of-the-bagged-base-estimators." data-toc-modified-id="7.-Get-the-features-used-in-each-of-the-bagged-base-estimators.-7">7. Get the features used in each of the bagged base estimators.</a></span></li><li><span><a href="#8.-Create-a-list-of-the-features-used-in-the-first-base-estimator." data-toc-modified-id="8.-Create-a-list-of-the-features-used-in-the-first-base-estimator.-8">8. Create a list of the features used in the first base estimator.</a></span></li><li><span><a href="#9.-Get-out-the-samples-used-in-our-first-base-estimator." data-toc-modified-id="9.-Get-out-the-samples-used-in-our-first-base-estimator.-9">9. Get out the samples used in our first base estimator.</a></span></li><li><span><a href="#10.-Get-out-the-target-subsample-for-the-estimator." data-toc-modified-id="10.-Get-out-the-target-subsample-for-the-estimator.-10">10. Get out the target subsample for the estimator.</a></span></li><li><span><a href="#11.-Fit-a-decision-tree-equivalent-to-our-first-base-estimator." data-toc-modified-id="11.-Fit-a-decision-tree-equivalent-to-our-first-base-estimator.-11">11. Fit a decision tree equivalent to our first base estimator.</a></span></li><li><span><a href="#12.-Bonus:-Take-each-of-the-decision-trees-from-the-ensemble-above-and-obtain-its-predictions-for-the-target-variable-in-the-test-set.-Use-majority-voting-to-obtain-the-ensemble-prediction-for-the-target-label.-Compare-with-the-bagging-classifier-score." data-toc-modified-id="12.-Bonus:-Take-each-of-the-decision-trees-from-the-ensemble-above-and-obtain-its-predictions-for-the-target-variable-in-the-test-set.-Use-majority-voting-to-obtain-the-ensemble-prediction-for-the-target-label.-Compare-with-the-bagging-classifier-score.-12">12. Bonus: Take each of the decision trees from the ensemble above and obtain its predictions for the target variable in the test set. Use majority voting to obtain the ensemble prediction for the target label. Compare with the bagging classifier score.</a></span></li></ul></div>

### 1. Load the breast cancer data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

# Converting data into a dataframe structure
X = pd.DataFrame(data['data'], columns=data['feature_names'])
# Setting up our Y value as well
y = pd.Series(data['target'])

np.random.seed(1)

### 2. Load required sklearn packages.

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

### 3. Make a train-test split.

In [3]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)

### 4. Create and fit a `BaggingClassifier` with a `DecisionTreeClassifier` base estimator.

- Fit on the training data.
- Report the score on the test data.

In [4]:
# Create our classifier and our bag
DT = DecisionTreeClassifier()
BC = BaggingClassifier(base_estimator=DT, n_estimators=50, max_features=0.5,
                       max_samples=0.5, oob_score=True)

# Fitting the Bag
BC.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
         bootstrap=True, bootstrap_features=False, max_features=0.5,
         max_samples=0.5, n_estimators=50, n_jobs=1, oob_score=True,
         random_state=None, verbose=0, warm_start=False)

In [5]:
predictions = BC.predict(X_test)
accuracy_score(y_test, predictions)

0.9790209790209791

In [6]:
BC.score(X_test, y_test)

0.9790209790209791

In [7]:
BC.oob_score_

0.9507042253521126

### 5. Pull out the base estimator from the ensemble model.

In [8]:
# Getting our bag's base model
# We can only have one base model so our estimator models cannot have varying parameters
# The Random_state is a reference seed.
BC.base_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### 6. Pull out *all* the base estimators.

In [9]:
# Getting the rest of our bag models.
BC.estimators_[:5]

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False,
             random_state=1028862084, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, presort=False,
             random_state=870353631, splitter='best'),
 DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fr

### 7. Get the features used in each of the bagged base estimators.

In [10]:
# Getting the features in each of our bagged models,
# that is their index values of the list of feature names
BC.estimators_features_[:5]

[array([19,  0, 14,  6, 26, 16,  9, 15, 25, 20, 21,  5,  3, 24,  1]),
 array([ 7, 26, 21, 17, 18, 23,  6,  8, 22, 12, 28,  0, 13, 27, 15]),
 array([18, 16, 26, 10,  6, 14,  0, 23, 29,  4,  7, 12, 24, 28, 19]),
 array([15, 22, 12, 26,  2, 19, 28, 25,  0,  8, 21, 23, 27, 11,  3]),
 array([15, 24, 10,  7,  1, 28, 12, 16, 18,  5, 17, 19, 13,  9, 23])]

### 8. Create a list of the features used in the first base estimator.

In [11]:
# What are the parameters for the first decision tree in our bag?
BC.estimators_[0]

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False,
            random_state=1028862084, splitter='best')

In [12]:
# What are the features used in the first model
BC.estimators_features_[0]

array([19,  0, 14,  6, 26, 16,  9, 15, 25, 20, 21,  5,  3, 24,  1])

In [13]:
# Creating a list of the selected features.
sub_features = []
for feature in BC.estimators_features_[0]:
    sub_features.append(data['feature_names'][feature])
sub_features

['fractal dimension error',
 'mean radius',
 'smoothness error',
 'mean concavity',
 'worst concavity',
 'concavity error',
 'mean fractal dimension',
 'compactness error',
 'worst compactness',
 'worst radius',
 'worst texture',
 'mean compactness',
 'mean area',
 'worst smoothness',
 'mean texture']

### 9. Get out the samples used in our first base estimator.

In [14]:
# What are the samples used in the first model?
samples = BC.estimators_samples_[0]
len(samples)

426

In [15]:
# number of samples included in each ensemble member (remember max_samples)
[BC.estimators_samples_[i].sum() for i in range(len(BC.estimators_))][:10]

[178, 153, 167, 170, 176, 164, 174, 171, 173, 166]

In [16]:
# Using the True Samples from our DT to sub down to X_train
X_train_0 = X_train.loc[samples,sub_features]

In [17]:
X_train_0.head()

Unnamed: 0,fractal dimension error,mean radius,smoothness error,mean concavity,worst concavity,concavity error,mean fractal dimension,compactness error,worst compactness,worst radius,worst texture,mean compactness,mean area,worst smoothness,mean texture
373,0.001711,20.64,0.006211,0.1527,0.4159,0.02681,0.05478,0.01895,0.3055,25.37,23.17,0.1076,1335.0,0.1562,17.35
179,0.003351,12.81,0.008534,0.009193,0.02758,0.00618,0.06133,0.006364,0.05445,13.63,16.15,0.03774,508.8,0.1162,13.06
288,0.006517,11.26,0.01574,0.09274,0.1546,0.08099,0.06233,0.08262,0.1843,11.86,22.33,0.1181,394.1,0.1028,19.96
219,0.002256,19.53,0.005539,0.1145,0.3995,0.02664,0.05313,0.02644,0.4097,27.9,45.41,0.113,1223.0,0.1408,32.47
546,0.002606,10.32,0.007086,0.01012,0.04384,0.01012,0.06201,0.007247,0.08842,11.25,21.77,0.04994,324.9,0.1285,16.35


In [18]:
X_train_0.shape

(178, 15)

### 10. Get out the target subsample for the estimator.

In [19]:
# Getting the y_train sub sample used.
#target = pd.DataFrame(y_train)
y_train_0 = y_train[samples]

### 11. Fit a decision tree equivalent to our first base estimator.

In [20]:
# Setting the Decision Tree in our First base model of our bagged classifier.
DTC_0 = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                              max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
                              min_samples_split=2, min_weight_fraction_leaf=0.0,
                              presort=False, random_state=BC.estimators_[0].random_state, splitter='best')

# Setting the model's X and Y values
#X0 = data2[sub_features]
#Y0 = target2#[0]

# Fitting the model
DTC_0.fit(X_train_0, y_train_0)

# accuracy on the test set
predictions_DTC_0 = DTC_0.predict(X_test[sub_features])
accuracy_score(y_test, predictions_DTC_0)

0.951048951048951

### 12. Bonus: Take each of the decision trees from the ensemble above and obtain its predictions for the target variable in the test set. Use majority voting to obtain the ensemble prediction for the target label. Compare with the bagging classifier score.

In [21]:
accuracies = []
df_predictions = pd.DataFrame()
df_probabilities = pd.DataFrame()
for i, estimator in enumerate(BC.estimators_):
    # Creating a list of the selected features.
    sub_features = []
    for feature in BC.estimators_features_[i]:
        sub_features.append(data['feature_names'][feature])
    # sub_features

    # What are the samples used in the i-th model?
    samples = BC.estimators_samples_[i]
    data_current = X_train.loc[samples,sub_features]
    target_current = y_train[samples]

    # Setting the Decision Tree in our i-th base model of our bagged classifier.
    DTC0 = estimator
    
    # Setting the model's X and Y values
    #X0 = data_current[sub_features]
    #Y0 = target_current[0]

    # Fitting the model
    DTC0.fit(data_current,target_current)

    # accuracy on the test set
    predictions_DTC0 = DTC0.predict(X_test[sub_features])
    probabilities_DTC0 = DTC0.predict_proba(X_test[sub_features])
    df_predictions['predictions_'+str(i)] = predictions_DTC0
    df_probabilities['probabilities_0_'+str(i)] = probabilities_DTC0[:, 0]
    df_probabilities['probabilities_1_'+str(i)] = probabilities_DTC0[:, 1]
    accuracy = accuracy_score(y_test, predictions_DTC0)
    accuracies.append(accuracy)
    print(accuracy)

0.951048951048951
0.8811188811188811
0.9230769230769231
0.9230769230769231
0.8881118881118881
0.9020979020979021
0.9230769230769231
0.9440559440559441
0.9370629370629371
0.8811188811188811
0.9370629370629371
0.972027972027972
0.8951048951048951
0.951048951048951
0.958041958041958
0.9230769230769231
0.8811188811188811
0.9020979020979021
0.951048951048951
0.916083916083916
0.9090909090909091
0.9370629370629371
0.972027972027972
0.9300699300699301
0.9440559440559441
0.9370629370629371
0.965034965034965
0.951048951048951
0.8881118881118881
0.9370629370629371
0.9440559440559441
0.951048951048951
0.916083916083916
0.951048951048951
0.916083916083916
0.9440559440559441
0.9300699300699301
0.9020979020979021
0.916083916083916
0.9440559440559441
0.9440559440559441
0.8951048951048951
0.9370629370629371
0.9300699300699301
0.9020979020979021
0.9370629370629371
0.9440559440559441
0.916083916083916
0.8461538461538461
0.951048951048951


In [22]:
np.mean(accuracies)

0.9265734265734267

In [23]:
zeros = []
ones = []
for i in range(len(df_predictions)):
    counts = df_predictions.iloc[i].value_counts()
    try:
        zeros.append(counts[0])
    except:
        zeros.append(0)
    try:
        ones.append(counts[1])
    except:
        ones.append(0)

In [24]:
df_predictions['zeros'] = zeros
df_predictions['ones'] = ones
df_predictions['prediction'] = df_predictions[['zeros', 'ones']].apply(
    lambda x: 1 if x[1] > x[0] else 0, axis=1)

df_predictions.head()

Unnamed: 0,predictions_0,predictions_1,predictions_2,predictions_3,predictions_4,predictions_5,predictions_6,predictions_7,predictions_8,predictions_9,...,predictions_43,predictions_44,predictions_45,predictions_46,predictions_47,predictions_48,predictions_49,zeros,ones,prediction
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,50,0,0
1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,0,50,1
2,1,1,0,1,0,1,1,1,1,1,...,1,1,1,0,1,1,1,11,39,1
3,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,0,50,1
4,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,49,1


In [25]:
accuracy_score(y_test, df_predictions.prediction)

0.9790209790209791

In [26]:
accuracy_score(y_test, predictions)

0.9790209790209791

In [27]:
# Check agreement with bagging score
accuracy_score(y_test, df_predictions.prediction) - \
    accuracy_score(y_test, predictions)

0.0