<h1>K-fold Cross Validation in Sklearn</h1>

<h3>kfold Class</h3>

<p>Scikit-learn has already implemented the code to break the dataset into k chunks and create k training and test sets.</p>

<p>For simplicity, let’s take a dataset with just 6 datapoints and 2 features and a 3-fold cross validation on the dataset. We’ll take the first 6 rows from the Titanic dataset and use just the Age and Fare columns.</p>

In [1]:
from sklearn.model_selection import KFold
import pandas as pd

df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
X = df[['Age', 'Fare']].values[:6]
y = df['Survived'].values[:6]

<p>
    We start by instantiating a KFold class object. It takes two parameters: 
</p>
<ul>
    <li>n_splits (this is k, the number of chunks to create)</li>
    <li>shuffle (whether or not to randomize the order of the data)</li>
</ul>
<strong>
    It’s generally good practice to shuffle the data since you often get a dataset that’s in a sorted order.
</strong>

In [2]:
kf = KFold(n_splits=3, shuffle=True, random_state=99)

<p>The KFold class has a split method that creates the 3 splits for our data.</p>
<p>This split method returns a generator, so we use the list function to turn it into a list.</p>

In [3]:
splits = list(kf.split(X))
for set_index in range(len(splits)):
    print(f"Training set nr. {set_index}: {splits[set_index][0]}")
    print(f"Test set nr. {set_index}: {splits[set_index][1]}", "\n")
    

Training set nr. 0: [0 1 3 5]
Test set nr. 0: [2 4] 

Training set nr. 1: [1 2 3 4]
Test set nr. 1: [0 5] 

Training set nr. 2: [0 2 4 5]
Test set nr. 2: [1 3] 



<p>
    As we can see, we have 3 training and testing sets as expected:
</p>

<ul>
    <li>The first training set is made up of datapoints 0, 1, 3, 5 and the test set is made up of datapoints 2, 4.</li>
    <li>The second training set is made up of datapoints 1, 2, 3, 4 and the test set is made up of datapoints 0, 5.</li>
    <li>The third training set is made up of datapoints 0, 2, 4, 5 and the test set is made up of datapoints 1, 3.</li>
</ul>

<h3>Creating Training and Test Sets with the Folds</h3>

<p>Now let's pull out the training and testing datapoints of the first split, which is done as below</p>

In [4]:
first_split = splits[0]
train_indices, test_indices = first_split
print("training set indices:", train_indices)
print("test set indices:", test_indices)

X_train = X[train_indices]
X_test = X[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]
print("X_train")
print(X_train)
print("y_train", y_train)
print("X_test")
print(X_test)
print("y_test", y_test)

training set indices: [0 1 3 5]
test set indices: [2 4]
X_train
[[22.      7.25  ]
 [38.     71.2833]
 [35.     53.1   ]
 [27.      8.4583]]
y_train [0 1 1 0]
X_test
[[26.     7.925]
 [35.     8.05 ]]
y_test [1 0]


<strong>At this point, we have training and test sets in the same format as we did using the train_test_split function.</strong>

<h3>Build a Model</h3>

<p>
    Now we can use the training and test sets to build a model and make a prediction like before. Let’s go back to using the entire dataset (since 4 datapoints is not enough to build a decent model).
</p>

In [5]:
from sklearn.linear_model import LogisticRegression

df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values

kf = KFold(n_splits=5, shuffle=True)

splits = list(kf.split(X))
train_indices, test_indices = splits[0]
X_train = X[train_indices]
X_test = X[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]

model = LogisticRegression()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.8370786516853933


<strong>So far, we’ve essentially done a single train/test split. In order to do a k-fold cross validation, we need to do use each of the other 4 splits to build a model and score the model.</strong>

<h3>Loop Over All the Folds</h3>

<p>We have been doing one fold at a time, but really we want to loop over all the folds to get all the values. We will put the code from the previous part inside our for loop.</p>

In [6]:
import numpy as np

scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model = LogisticRegression()
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))
print(scores)
print(np.mean(scores))
final_model = LogisticRegression()
final_model.fit(X, y)

[0.7696629213483146, 0.7921348314606742, 0.7966101694915254, 0.7853107344632768, 0.8418079096045198]
0.7971053132736622


<p>Since we have 5 folds, we get 5 accuracy values. Recall, to get a single final value, we need to take the mean of those values.</p>

In [7]:
print(np.mean(scores))

0.7971053132736622


<p>
    Now that we’ve calculated the accuracy, we no longer need the 5 different models that we’ve built. For future use, we just want a single model. 
</p>
<p>  
    To get the single best possible model, we build a model on the whole dataset.
</p>
<p>
   If we’re asked the accuracy of this model, we use the accuracy calculated by cross validation above (around 0.79) even though we haven’t actually tested this particular model with a test set.
</p>

<p>
    Expect to get slightly different values every time you run the code. The KFold class is randomly splitting up the data each time, so a different split will result in different scores, though you should expect the average of the 5 scores to generally be about the same.
</p>

<strong>Here is how to build the final method</strong>

In [8]:
final_model = LogisticRegression()
final_model.fit(X, y)

<strong>And here is a single prediction for a 45 years old male passenger in the second class, who has 1 Sibling/Spouse and no Parents/Children on board, and has paid a fare of 1000 for his ticket</strong>

In [9]:
prediction = final_model.predict([[2, True, 45, 1, 0, 1000]])

print(prediction)

[1]


<strong>The model predicts that he did survive!</strong>