# Holdout Technique

- The simplest method to evaluate a classifier
- Divide the dataset into training and test randomly
- The problem with this is the ramdon state in the data, which can decrease the evaluation
- and it works well with a huge dataset

- The large part of the data belongs to training set and the remaining data belongs to test set. Usually: 70/30

<img src="http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png"/>

# Cross Validation Technique

- Avoid the random problem (high variance in results) that occurs in the holdout technique

- The objective is to train and validate the model with all the data training available. 

- The result is more robust (it generalizes well)
- Can have a low performance (due to calculation in a large amount of data, considering n times (k-fold) of training and validation)

- In cross validation a portion of the data is separated for training and a portion for validation.
- In the image bellow, the k-fold, number of different portions are 5

In each fold (iteration) the training and validation data changes. So we have k-fold models (n different) models.
The final result (overall result) is the mean of the results (considering metrics of evaluation).

This technique is good to check which ML algorithm is the best for the specific problem we want to solve

<img src="https://miro.medium.com/max/800/1*kkMtezwv8qj1t9uG4nw_8g.png" title="CrossValidationImage">


In [13]:
import numpy as np
from sklearn.model_selection import train_test_split # holdout technique
from sklearn.model_selection import cross_val_score # cross validation
from sklearn import datasets # dataset 
from sklearn import svm # suport vector machine

In [6]:
import pandas as pd

In [86]:
dataset = datasets.load_iris()

In [87]:
dataset.data.shape

(150, 4)

In [88]:
dataset.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [89]:
dataset.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [90]:
dataset.target 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

# Holdout technique

Separates data into training and testing (randomly). 

In [91]:
# test size defines the amount of data for test
# dataset.date are the variables (features) (X) and dataset target is the (Y)
x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.3) # 70 train, 30 - test

# SVM Classifier

In [92]:
# instantiate the classifier
sv = svm.SVC(gamma='auto')

In [93]:
# Train the algorithm to generate the model
sv.fit(x_train, y_train)

SVC(gamma='auto')

In [94]:
# evaluate the model - Score - Return the mean accuracy on the given test data and labels.
sv.score(x_test, y_test)

0.9333333333333333

In [95]:
# predicts the values
prediction = sv.predict(x_test)

In [96]:
right_predictions = (y_test == prediction).sum() # total of right predictions

In [97]:
(right_predictions / len(y_test)) * 100 # getting the percentage of right predictions

# the result is as the same the method score.

93.33333333333333

In [98]:
# if i re execute the process again
x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.3) # 70 train, 30 - test

In [99]:
# Train again the algorithm to generate the model
sv.fit(x_train, y_train)

SVC(gamma='auto')

In [100]:
# And evaluate again the model - Score - Return the mean accuracy on the given test data and labels.
sv.score(x_test, y_test)

# The score (accuracy) changes (because the split is done randomly)

# So this tecnique changes the result when the data for training and test changes (this is a high variance)

# If the results in train and test are approximately, then, it can be a good model, but if the data vary a lot, it can
# produce a bad model

0.9777777777777777

# Cross Validation

cross_val_score automatizes this process

In [101]:
# the function receives as parameters: 
# the instatiated model, the data (X), target(Y), number of folds, and the method considered for evaluation
# this methods separates the dataset into train and test.
scores = cross_val_score(sv, dataset.data, dataset.target, cv=5, scoring='accuracy')

In [102]:
scores # each value in the array shows the result for this k fold

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [103]:
# to discover the result of the model, gets the mean of the results in each fold
scores.mean()

0.9800000000000001