# Bootstrap

bootstrapping is not supported in scikit-learn anymore. There are other techniques we can use, and other sampling methods in the cross-validation function. A common one is addressed below.

## The data

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

X,y = make_classification(n_samples=20, n_features=10, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1, weights=(0.7,0.3), class_sep=0.99, random_state=14)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

## The ShuffleSplit function

NNow let's look at said function. ShuffleSplit is performing cross-validation, but shuffles the data after each iteration to avoid a deterministic training and test set. Hence, some training and test sets have overlapping instances: 

In [10]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate

metrics = ['accuracy']
ss = ShuffleSplit(n_splits= 10, test_size=0.3)

for train_index, test_index in ss.split(X):
    print('Training indices:', train_index, 'Test indices:', test_index)

Training indices: [12  4  1 11 10  6  9  2  3 18 17 19  0 14] Test indices: [13  8  5 15  7 16]
Training indices: [ 3  2  5  9  8  0 15 13  7  4  6 11 17 18] Test indices: [ 1 19 16 12 10 14]
Training indices: [ 0 19  3  2  4 15 10 18 12 16 13 11 14  5] Test indices: [ 1  6 17  8  9  7]
Training indices: [17  3  4 15  7  8 19  1 16 10  6 12 11 13] Test indices: [ 0  2  5 14 18  9]
Training indices: [16 12 18  0  2 11 17  8  3  6 13 19  5  7] Test indices: [10 15 14  4  9  1]
Training indices: [13  3 10  1  2  8 19 12 15  7  0  9 18 16] Test indices: [ 6 14  4  5 17 11]
Training indices: [ 7  5  2  6 12  0  8 14  4  3  1 17 18 11] Test indices: [16 19 10 15 13  9]
Training indices: [10  5 15 19 18  1  4 14  3  6  8 17  7 16] Test indices: [ 2  9  0 13 12 11]
Training indices: [17  0 10  1 11 13 12  7  4  6 14 19  2 16] Test indices: [18  8  5  3 15  9]
Training indices: [ 9  7  1 13 12  0  6 11  8 15  4  2 17 16] Test indices: [ 3 19 18  5 10 14]


In [5]:
print('\nMetrics: ')
outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)
for metric in outcomes.keys():
    print(metric+' value: '+str(outcomes[metric]))


Metrics: 
fit_time value: [0.00199986 0.00100064 0.00099587 0.00100064 0.         0.00100517
 0.00100088 0.0010004  0.0009973  0.00099969]
score_time value: [0.00099659 0.00100541 0.         0.         0.00099921 0.00099325
 0.         0.00099921 0.00100017 0.        ]
test_accuracy value: [0.8 0.6 1.  0.8 1.  0.4 0.8 0.6 0.8 1. ]


## Stratified shuffling

In [7]:
from sklearn.model_selection import StratifiedShuffleSplit

metrics = ['accuracy']
ss = StratifiedShuffleSplit(n_splits=10, test_size=0.3)

outcomes = cross_validate(classifier, X_train, y_train, scoring=metrics, cv=ss, return_train_score=False)

for metric in outcomes.keys():
    print(metric+' value: '+str(outcomes[metric]))

fit_time value: [0.00100255 0.00200343 0.0010035  0.00200057 0.00099325 0.0009973
 0.         0.         0.         0.00100064]
score_time value: [0.00099969 0.         0.00099564 0.         0.         0.00100231
 0.00099754 0.00100994 0.00100493 0.00100255]
test_accuracy value: [0.8 0.8 1.  0.8 0.8 0.6 0.8 0.6 0.6 1. ]
