**Корректность проверена на Python 3.6:**
+ numpy 1.15.4
+ sklearn 0.20.2

# Sklearn

## sklearn.model_selection

документация: http://scikit-learn.org/stable/modules/cross_validation.html

In [1]:
from sklearn import model_selection, datasets

import numpy as np

### Разовое разбиение данных на обучение и тест с помощью train_test_split

In [2]:
iris = datasets.load_iris()

In [3]:
train_data, test_data, train_labels, test_labels = model_selection.train_test_split(iris.data, iris.target, 
                                                                                     test_size = 0.3)

In [4]:
#убедимся, что тестовая выборка действительно составляет 0.3 от всех данных
float(len(test_labels))/len(iris.data)

0.3

In [5]:
print('Размер обучающей выборки: {} объектов \nРазмер тестовой выборки: {} объектов'.format(len(train_data),
                                                                                            len(test_data)))

Размер обучающей выборки: 105 объектов 
Размер тестовой выборки: 45 объектов


In [6]:
print('Обучающая выборка:\n', train_data[:5])
print('\n')
print('Тестовая выборка:\n', test_data[:5])

Обучающая выборка:
 [[5.6 2.7 4.2 1.3]
 [5.5 2.4 3.8 1.1]
 [6.3 3.4 5.6 2.4]
 [6.5 2.8 4.6 1.5]
 [5.4 3.7 1.5 0.2]]


Тестовая выборка:
 [[5.6 2.5 3.9 1.1]
 [5.1 3.8 1.5 0.3]
 [5.  3.4 1.6 0.4]
 [4.8 3.4 1.9 0.2]
 [5.9 3.2 4.8 1.8]]


In [7]:
print('Метки классов на обучающей выборке:\n', train_labels)
print('\n')
print('Метки классов на тестовой выборке:\n', test_labels)

Метки классов на обучающей выборке:
 [1 1 2 1 0 1 0 0 1 1 0 1 2 2 0 2 2 1 2 0 2 1 0 1 0 2 0 1 0 0 0 0 0 0 2 1 0
 2 1 1 2 2 1 2 1 0 0 1 1 1 2 2 0 1 1 1 1 2 1 0 0 1 2 2 1 0 0 0 1 1 2 0 1 0
 2 2 1 0 2 1 2 1 2 2 2 1 1 1 0 1 1 0 1 0 2 2 2 0 1 2 2 0 2 2 0]


Метки классов на тестовой выборке:
 [1 0 0 0 1 1 2 2 2 1 2 0 2 2 0 2 0 2 2 2 1 2 0 2 1 0 0 1 0 0 0 1 2 2 0 1 0
 2 2 0 0 0 2 1 1]


### Стратегии проведения кросс-валидации

In [15]:
#сгенерируем короткое подобие датасета, где элементы совпадают с порядковым номером
X = range(0,10)

#### KFold

In [3]:
kf = model_selection.KFold(n_splits = 5)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)

[ 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [0 1 2 3]
[ 0  1  2  3  8  9 10 11 12 13 14 15 16 17 18 19] [4 5 6 7]
[ 0  1  2  3  4  5  6  7 12 13 14 15 16 17 18 19] [ 8  9 10 11]
[ 0  1  2  3  4  5  6  7  8  9 10 11 16 17 18 19] [12 13 14 15]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] [16 17 18 19]


In [15]:
kf = model_selection.KFold(n_splits = 2, shuffle = True)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)

[ 2  4  6  7  9 10 11 13 16 19] [ 0  1  3  5  8 12 14 15 17 18]
[ 0  1  3  5  8 12 14 15 17 18] [ 2  4  6  7  9 10 11 13 16 19]


In [21]:
kf = model_selection.KFold(n_splits = 2, shuffle = True, random_state = 1)
for train_indices, test_indices in kf.split(X):
    print(train_indices, test_indices)

[ 0  5  8  9 11 12 13 15 18 19] [ 1  2  3  4  6  7 10 14 16 17]
[ 1  2  3  4  6  7 10 14 16 17] [ 0  5  8  9 11 12 13 15 18 19]


#### StratifiedKFold

In [16]:
y = np.array([0] * 10 + [1] * 10)
print(y)

skf = model_selection.StratifiedKFold(n_splits = 2, shuffle = True, random_state = 0)
for train_indices, test_indices in skf.split(X, y):
    print(train_indices, test_indices)

[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1]


ValueError: Found input variables with inconsistent numbers of samples: [10, 20]

In [17]:
target = np.array([0, 1] * 5)
print(target)

skf = model_selection.StratifiedKFold(n_splits = 2,shuffle = True)
for train_indices, test_indices in skf.split(X, target):
    print(train_indices, test_indices)

[0 1 0 1 0 1 0 1 0 1]
[0 2 3 5 9] [1 4 6 7 8]
[1 4 6 7 8] [0 2 3 5 9]


#### ShuffleSplit

In [23]:
ss = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2)

for train_indices, test_indices in ss.split(X):
    print(train_indices, test_indices)

[4 0 8 1 5 2 6 3] [9 7]
[4 5 0 6 3 2 1 8] [9 7]
[2 0 5 7 6 8 4 9] [3 1]
[5 7 0 1 6 3 4 8] [9 2]
[8 9 2 7 6 1 5 0] [4 3]
[5 8 3 4 7 1 0 9] [2 6]
[4 8 0 6 9 5 3 2] [7 1]
[0 7 1 6 8 2 9 5] [4 3]
[8 1 7 9 3 0 5 6] [4 2]
[9 1 4 5 7 3 2 0] [6 8]


#### StratifiedShuffleSplit

In [19]:
target = np.array([0] * 5 + [1] * 5)
print(target)

sss = model_selection.StratifiedShuffleSplit(n_splits = 4, test_size = 0.2)
for train_indices, test_indices in sss.split(X, target):
    print(train_indices, test_indices)

[0 0 0 0 0 1 1 1 1 1]
[2 7 8 0 6 1 5 3] [9 4]
[6 0 2 1 5 4 9 8] [3 7]
[8 6 3 1 2 7 9 0] [4 5]
[3 0 5 4 8 6 9 2] [7 1]


#### Leave-One-Out

In [14]:
loo = model_selection.LeaveOneOut()

for train_indices, test_index in loo.split(X):
    print(train_indices, test_index)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [0]
[ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [1]
[ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [2]
[ 0  1  2  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [3]
[ 0  1  2  3  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [4]
[ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [5]
[ 0  1  2  3  4  5  7  8  9 10 11 12 13 14 15 16 17 18 19] [6]
[ 0  1  2  3  4  5  6  8  9 10 11 12 13 14 15 16 17 18 19] [7]
[ 0  1  2  3  4  5  6  7  9 10 11 12 13 14 15 16 17 18 19] [8]
[ 0  1  2  3  4  5  6  7  8 10 11 12 13 14 15 16 17 18 19] [9]
[ 0  1  2  3  4  5  6  7  8  9 11 12 13 14 15 16 17 18 19] [10]
[ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19] [11]
[ 0  1  2  3  4  5  6  7  8  9 10 11 13 14 15 16 17 18 19] [12]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 14 15 16 17 18 19] [13]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 15 16 17 18 19] [14]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 1

Больше стратегий проведения кросс-валидации доступно здесь: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators