# Understanding SciKit-learn data splitting

We are going to work with scikit-learn to practice splitting data into training and test sets. First, make a list of integers from 0-9:

In [6]:
X =  list(range(10))
print (X)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Then, we create another list which contains the square values of numbers in X using list comprehension:

In [7]:
y = [x*x for x in X]
print (y)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


Next, we will import ```model_selection``` from scikit-learn, and use the function ```train_test_split()``` to split our data into two sets:

In [8]:
import sklearn.model_selection as model_selection

#Define train/test variables for X and y

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)


print ("X_train: ", X_train)
print ("y_train: ", y_train)
print ("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]


By specifying ```train_size=0.75```, we aim to put 75% of the data into our training set, and the rest of the data into the test set. Because we only have 10 data points, the program automatically rounded the ratio to 7:3. It's okay to omit the test_size parameter, if you already got the ```train_size``` specified.
<br>
Now we have to apply sk-learn's cross validation method to our data:

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=101)

The ```Random_state``` number is a random seed that is set so the algorithm can be replicated by other programmers. If you want your results to be stochastic each time, simply leave it as the default value “None”.
<br>
<br>
For cross-validation, we can use the KFold() function to split your dataset into n consecutive folds.

In [16]:
from sklearn.model_selection import KFold
import numpy as np

# Define KFold segments
kf = KFold(n_splits=5)

# Define data as X and 
X = np.array(X)
y = np.array(y)

#Run the KFold cross validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_test: ", X_test)

X_test:  [0 1]
X_test:  [2 3]
X_test:  [4 5]
X_test:  [6 7]
X_test:  [8 9]


By specifying the n_splits parameter as 5, both of the X and y sets were divided into five folds (the y sets now shown here). Notice that the program always picked two neighboring numbers from the original data sets.
<br>
<br>
Nevertheless, using kf = KFold(n_splits=5, shuffle=True) will give you the same mixing effect for the original data sets as what we’ve seen before.
<br>
<br>
In addition, scikit-learn provides useful built-in functions to calculate the error metrics of multiple folds of test sets to evaluate machine learning models. 
<br>
<br>
For example,
```
model_selection.cross_val_score(model, X, y, cv=kf, scoring=‘neg_mean_absolute_error’)
```
will report one score of mean absolute error.
