<h1>k-fold Cross Validation</h1>

<h3>Concerns with Training and Test Set</h3>

<p>We are doing evaluation because we want to get an accurate measure of how well the model performs. If our dataset is small, our test set is going to be small. Thus it might not be a good random assortment of datapoints and by random chance end up with easy or difficult datapoints in our evaluation set.</p>

<p>
    If code below splits the data randomly between 75% for building a model, then evaluate it using the rest 25% of the original datset mulitple time in a loop.
</p>

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import numpy as np

df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv')
df['male'] = df['Sex'] == 'male'
X = df[['Pclass', 'male', 'Age', 'Siblings/Spouses', 'Parents/Children', 'Fare']].values
y = df['Survived'].values

for i in range(0, 5):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    # building the model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # evaluating the model
    y_pred = model.predict(X_test)
    print(f"*Run nr.: {i}")
    print(" accuracy: {0:.5f}".format(accuracy_score(y_test, y_pred)))
    print("precision: {0:.5f}".format(precision_score(y_test, y_pred)))
    print("   recall: {0:.5f}".format(recall_score(y_test, y_pred)))
    print(" f1 score: {0:.5f}".format(f1_score(y_test, y_pred)), "\n")

*Run nr.: 0
 accuracy: 0.81532
precision: 0.75000
   recall: 0.68000
 f1 score: 0.71329 

*Run nr.: 1
 accuracy: 0.79279
precision: 0.85526
   recall: 0.65000
 f1 score: 0.73864 

*Run nr.: 2
 accuracy: 0.81081
precision: 0.72840
   recall: 0.74684
 f1 score: 0.73750 

*Run nr.: 3
 accuracy: 0.81532
precision: 0.80723
   recall: 0.72826
 f1 score: 0.76571 

*Run nr.: 4
 accuracy: 0.79730
precision: 0.75342
   recall: 0.67073
 f1 score: 0.70968 



<p>
    You can see that each time we run it, we get different values for the metrics.
</p>
<ul>
    <li>The accuracy ranges from 0.79 to 0.82</li>
    <li>The precision ranges from 0.72 to 0.86</li>
    <li>The recall ranges from 0.65 to 0.74</li>
    <li>The f1 score ranges from 0.71 to 0.77</li>
</ul>
<p>
    These are wide ranges that just depend on how lucky or unlucky we were in which datapoints ended up in the test set.
</p>
<p>
    Since our goal is to get the best possible measure of our metrics (accuracy, precision, recall and F1 score), we can do a little better than just a single training and test set. So instead of doing a single train/test split, we’ll split our data into a training set and test set multiple times.
</p>

<strong>Conslusion! Splitting the dataset into a single training set and test set for evaluation purposes might yield an inaccurate measure of the evaluation metrics when the dataset is small.</strong>

<h3>Multiple Training and Test Sets</h3>

<p>
    We want to get a measure of how well our model does in general, not just a measure of how well it does on one specific test set. We approach this by doing the following:
</p>
<ul>
    <li>Let’s assume we have 200 datapoints in our dataset</li>
    <li>We break our dataset into 5 chunks</li>
    <li>Each of these 5 chunks will serve as a test set. When Chunk 1 is the test set, we use the remaining 4 chunks as the training set.</li>
</ul>

<p>Thus we have 5 training and test sets as follows.</p>

<table border="1">
    <tr>
        <th>Split nr.</th>
        <th></th>
        <th></th>
        <th></th>
        <th></th>
        <th></th>
        <th>Accuracy</th>
    </tr>
  <tr>
      <td style="background-color: white;">1</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: yellow;">Test</td>
      <td style="background-color: white;">0.83</td>
  </tr>
  <tr>
      <td style="background-color: white;">2</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: yellow;">Test</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: white;">0.79</td>
  </tr>
  <tr>
      <td style="background-color: white;">3</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: yellow;">Test</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: white;">0.78</td>
  </tr>
  <tr>
      <td style="background-color: white;">4</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: yellow;">Test</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: white;">0.80</td>
  </tr>
  <tr>
      <td style="background-color: white;">5</td>
      <td style="background-color: yellow;">Test</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: lightblue;">Train</td>
      <td style="background-color: white;">0.75</td>
  </tr>
</table>

<strong>In each of the 5 splits we have a test set of 20% (40 datapoints) and a training set of 80% (160 datapoints). And every datapoint is in exactly 1 test set.</strong>

<h3>Building and Evaluating with Multiple Training and Test Sets</h3>

<ul>
    <li>Now, for each training set, we build a model and evaluate it using the associated test set. Thus we build 5 models and calculate 5 scores.</li>
    <li>Then we report the accuracy as the mean of the 5 values:</li>
</ul>
$$Accuracy = \frac{0.83+0.79+0.78+0.80+0.75}{5} = 0.79$$

<p>If we had just done a single training and test set and had randomly gotten the first one, we would have reported an accuracy of 0.83.</p> <p>If we had randomly gotten the last one, we would have reported an accuracy of 0.75.</p><p>Averaging all these possible values helps eliminate the impact of which test set a datapoint lands in.</p>

<p>This process for creating multiple training and test sets is called <strong>k-fold cross validation</strong>. The k is the number of chunks we split our dataset into. The standard number is 5, as we did in our example above.</p>

<p>Our goal in cross validation is to get accurate measures for our metrics (accuracy, precision, recall). We are building extra models in order to feel confident in the numbers we calculate and report.</p>

<h3>Final Model Choice in k-fold Cross Validation</h3>

<p>
    These 5 models were built just for evaluation purposes, so that we can report the metric values. We don’t actually need these models and want to build the best possible model.
</p>

<p>
    The best possible model is going to be a model that uses all of the data. So we keep track of our calculated values for our evaluation metrics and then build a model using all of the data.
</p>

<p>
    This may seem incredibly wasteful, but computers have a lot of computation power, so it’s worth using a little extra to make sure we’re reporting the right values for our evaluation metrics. We’ll be using these values to make decisions, so calculating them correctly is very important.
</p>

<strong>
    Computation power for building a model can be a concern when the dataset is large. In these cases, we just do a train test split.
</strong>