<h2><font color="#004D7F" size=6>Module 3. Data Preprocessing</font></h2>

<h1><font color="#004D7F" size=5>3. Resampling Methods</font></h1>

<br><br>

<h2><font color="#004D7F" size=5>Index</font></h2>
<a id="indice"></a>

* [1. Introduction](#section1)
    * [1.1. Libraries and CSV](#section11)
* [2. Cross-Validation](#section2)
    * [2.1. _k_-fold Cross Validation](#section21)
    * [2.2. Repeated Cross-Validation](#section22)
    * [2.3. Leave One Out Cross-Validation](#section23)
* [3. Percentage Split](#section3)
    * [3.1. Train/Test Percentage Split](#section31)
    * [3.2. Random Repeated Train/Test Split](#section32)
* [4. Choosing a Technique](#section4)

---
<a id="section1"></a>
# <font color="#004D7F">1. Introduction</font>


The evaluation is an estimate that we can use to talk about how well we believe the algorithm can actually perform in practice. It's not a guarantee of performance. Once we estimate the performance of our algorithm, we can retrain the final algorithm on the entire training dataset and prepare it for operational use. Below, we'll look at four different techniques we can use to split our training dataset and create useful performance estimates for our Machine Learning algorithms:

    * How to split a dataset into subsets by percentage for training/validation.
    * How to assess model robustness using cross-validation, k-fold, with and without repetitions.
    * How to assess model robustness using leave-one-out cross-validation (LOOCV).
    * Random repeated train/test split.

<a id="section11"></a>
## <font color="#004D7F">1.1. Libraries and CSV</font>

As always, we load the CSV file that we are going to use. Additionally, we will load the main libraries that we will use in this section.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

filename = 'data/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names = names)
array = data.values

X = array[ : , 0:8] # All the characteristics of all the rows from the column 0 to 8
Y = array[ : , 8] # The target, all the rows in the last column
print(X)

[[  6.  148.   72.  ...  33.6 627.   50. ]
 [  1.   85.   66.  ...  26.6 351.   31. ]
 [  8.  183.   64.  ...  23.3 672.   32. ]
 ...
 [  5.  121.   72.  ...  26.2 245.   30. ]
 [  1.  126.   60.  ...  30.1 349.   47. ]
 [  1.   93.   70.  ...  30.4 315.   23. ]]


<a id="section2"></a>
# <font color="#004D7F">2. Cross-Validation</font>


Cross-validation is a process in which the dataset is divided into $K$ partitions or folds, and $K$ different evaluations are performed, so that all cases are in the test set at least once. Basically, in evaluation $i$, partition $i$ is the test cases, and the rest are the training cases. Finally, the results obtained in the different evaluations are averaged. The following image shows an example of this:
<img src="https://static.oschina.net/uploads/img/201609/26155106_OfXx.png" alt="cross-validation" width="500">

<a id="section21"></a>
## <font color="#004D7F">2.1. k-fold Cross Validation</font>


The k-fold cross-validation method involves splitting the dataset into k partitions, also called folds. Each subset is held out while the model is trained on all other partitions. This process is repeated until accuracy is determined for each instance in the dataset, providing an overall accuracy estimate. It is a robust method for estimating accuracy, and the size of k can adjust the amount of bias in the estimate, with popular values set at 5 and 10.

You can see that we report both the mean and the standard deviation of the performance measure. When summarizing performance measures, it's good practice to summarize the distribution of the measures, in this case assuming a Gaussian distribution of performance (a very reasonable assumption), and reporting the standard deviation and mean.

The _k_-fold cross-validation method involves splitting the dataset into _k_ partitions, also called folds. Each subset is held out while the model is trained on all other partitions. This process is repeated until accuracy is determined for each instance in the dataset, providing an overall accuracy estimate. It is a robust method for estimating accuracy, and the size of _k_ can adjust the amount of bias in the estimate, with popular values set at 5 and 10.

You can see that we report both the mean and the standard deviation of the performance measure. When summarizing performance measures, it's good practice to summarize the distribution of the measures, in this case assuming a Gaussian distribution of performance (a very reasonable assumption), and reporting the standard deviation and mean.

In [7]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=seed)
model = LogisticRegression(solver='lbfgs', max_iter=1000)

results = cross_val_score(model, X, Y, cv=kfold)
print(f"Accuracy: {results.mean()*100.0:.2f}% ({results.std()*100.0:.2f}%)")


Accuracy: 77.34% (4.90%)
