# Bootstrap

Bootstrapping is not supported in scikit-learn anymore. There are other techniques we can use, and other sampling methods in the cross-validation function. A common one is addressed below.

## The data

First, loading data again:

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

X,y = make_classification(n_samples=20, n_features=10,
                               n_informative=2, n_redundant=0, n_repeated=0,
                               n_classes=2,
                               n_clusters_per_class=1,
                               weights=(0.7,0.3),
                               class_sep=0.99, random_state=14)


# You already know about training and test splits:
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42)

In [5]:
X

array([[-1.11633638, -0.13351048,  1.04829099, -0.46575996, -0.72766708,
         0.91304983, -0.93545142,  0.42757826, -0.29281364,  0.15777203],
       [-0.85622801, -0.72284241, -0.07385389, -0.44489146, -0.56010883,
         1.14506528,  0.32430639,  0.64581032, -1.38411962, -0.64943309],
       [-0.78693101, -0.20734668,  0.26283227, -0.25192131, -0.71594518,
         1.1833141 ,  0.36252151, -0.0571729 ,  1.47373936,  1.73949625],
       [-0.97246664, -0.98894727, -1.09472842, -0.01372335, -1.02233934,
         1.0218492 ,  0.09881028,  0.06351657,  0.32060564, -0.66158096],
       [-1.43134338,  0.84392356,  0.4221067 , -1.23712238,  0.69796872,
         0.50854392,  0.14364655,  1.27049168, -1.45193896, -0.29468653],
       [ 0.63184683,  0.22916528, -1.33947353, -0.34866842,  0.38542667,
        -0.14618963,  0.01712358, -0.13739208,  0.85666441,  0.11689612],
       [ 1.1400525 , -0.98631525,  1.34915239,  1.33973859, -0.0054752 ,
        -1.3641906 ,  1.38574576,  0.47089321

## The ShuffleSplit function

Now let's look at said function. ShuffleSplit is performing cross-validation, but shuffles the data after each iteration to avoid a deterministic training and test set. Hence, some training and test sets have overlapping instances:

In [8]:
from sklearn.model_selection import ShuffleSplit 
from sklearn.model_selection import cross_validate

metrics = ['accuracy']
ss = ShuffleSplit(n_splits = 10, test_size=0.3, random_state=10)

# Printing the indices:
for train_index, test_index in ss.split(X):
    print("Training indices:", train_index, "Test indices:", test_index)

Training indices: [13  2 14  8 17 16 19 12 11  1  0 15  4  9] Test indices: [ 7 10  5  6  3 18]
Training indices: [ 7 16 18 12  5  6  3 17 14  4  8  1 19 11] Test indices: [ 2  0  9 10 15 13]
Training indices: [ 3 17  1 15  7  6  0  2  9  8 10 11 13 18] Test indices: [ 5 16 19 12 14  4]
Training indices: [18 16  9 14 13  6 19  2 15  8  7  4  5 12] Test indices: [ 1  3  0 17 10 11]
Training indices: [ 7 14 18  2  1 11  9 13 10  8 12 15 17  0] Test indices: [ 3  4  6 16 19  5]
Training indices: [ 7 11 10 17  0  9  5  2  1  3 15 14 16 18] Test indices: [ 4 19 12 13  6  8]
Training indices: [ 3  2 12 10  6  4 18  9  5 15  0 16 19 17] Test indices: [ 1  8 11  7 14 13]
Training indices: [10  0  8 14 15  5  6  3  1 18 19 16 11 13] Test indices: [ 4 17  2 12  7  9]
Training indices: [11  3 12  7  4 15  9 17  5  1 14 13 19  8] Test indices: [10  2 16  6 18  0]
Training indices: [ 5 13  3 18 12 16  1  2  9 17 10 14  6  4] Test indices: [ 0  7 15 11 19  8]


In [10]:
from sklearn import cross_validation
bs = cross_validation.Bootstrap(n_bootstraps=10, test_size=0.3, random_state=10)
for train_index, test_index in bs:
    print("Training indices:", train_index, "Test indices:", test_index)

ImportError: cannot import name 'cross_validation' from 'sklearn' (C:\Users\leo_w\anaconda3\lib\site-packages\sklearn\__init__.py)

In [7]:
print("\nMetrics: ")
outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)
for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))


Metrics: 
fit_time value: [0.01284528 0.00229001 0.00200796 0.0011816  0.00099897 0.00100136
 0.00107646 0.00099754 0.0009582  0.00096202]
score_time value: [0.00306463 0.00095654 0.00132871 0.00099397 0.00100112 0.00099611
 0.0009675  0.0009985  0.0010426  0.0010612 ]
test_accuracy value: [0.8 0.6 0.8 1.  0.6 0.6 0.4 0.6 0.8 0.6]


Notice how some samples are returning in the test sets.

## Stratified shuffling

A stratified version exists as well:

In [4]:
from sklearn.model_selection import StratifiedShuffleSplit

metrics = ['accuracy']
ss = StratifiedShuffleSplit(n_splits = 10, test_size=0.3)

outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)

for metric in outcomes.keys():
    print(metric+" value: "+str(outcomes[metric]))

fit_time value: [0.00055051 0.00058889 0.00043964 0.00036931 0.00030208 0.00030017
 0.00029993 0.00032496 0.00030494 0.00030375]
score_time value: [0.00025678 0.0002501  0.00034833 0.00021386 0.00019121 0.00018954
 0.00019073 0.00019264 0.00018954 0.00018835]
test_accuracy value: [0.8 0.6 0.8 0.8 1.  0.8 0.6 0.6 0.8 0.8]
