https://medium.com/@julie.yin/understanding-the-data-splitting-functions-in-scikit-learn-9ae4046fbd26

### Test train split

In [6]:
import sklearn.model_selection as model_selection

In [5]:
# create list
X = list(range(10))
print(X)

# create squares list
y = [x*x for x in X]
print(y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [7]:
# To disable shuffling, set the shuffle parameter as False (default = True).

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]


 Because we only have ten data points, the program automatically rounded the ratio to 7:3. It’s okay to omit the test_size parameter, if you already got the train_size specified, and you don’t mind the annoying warning message
 

### Cross Validation

In [11]:
# import sklearn.model_selection.cross_validate as cross_validation

# X_train, X_test, y_train, y_test = cross_validate.train_test_split(X, y, train_size=0.75, random_state=101)
# print ("X_train: ", X_train)
# print ("y_train: ", y_train)
# print("X_test: ", X_test)
# print ("y_test: ", y_test)

import sklearn.cross_validation as cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.75, random_state=101)

ModuleNotFoundError: No module named 'sklearn.model_selection.cross_validate'

will generate exactly the same outputs as above, given that we assigned the same number to Random_state. If you want your results to be stochastic each time, simply leave it as the default value “None”.

In [13]:
from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=5)
X = np.array(X)
y = np.array(y)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_test: ", X_test)
    
#  the data points were not shuffled, default setting of the shuffle parameter here different from that in train_test_split 

X_test:  [0 1]
X_test:  [2 3]
X_test:  [4 5]
X_test:  [6 7]
X_test:  [8 9]


In [14]:
kf = KFold(n_splits=5, shuffle=True)
X = np.array(X)
y = np.array(y)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_test: ", X_test)
    
# same mixing effect for the original data sets

X_test:  [3 6]
X_test:  [1 9]
X_test:  [0 5]
X_test:  [2 8]
X_test:  [4 7]


In addition, scikit-learn provides useful built-in functions to calculate the error metrics of multiple folds of test sets to evaluate machine learning models. For example,
model_selection.cross_val_score(model, X, y, cv=kf, scoring=‘neg_mean_absolute_error’)