## Cross-validation

### Cross-validation

- Cross-Validation is a technique for testing the performance of a Machine Learning predictive model. 
- It is a statistical method used to estimate the performance (or accuracy) of machine learning models.
- It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. 
- In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate. 

Cross Validation is commonly used in Machine Learning to compare different models and select the most appropriate one for a specific problem. It is both easy to understand, easy to implement, and less biased than other methods. Now, let’s explore the main cross-validation techniques.


### The Train-Test Split technique for Cross Validation

- The Train-Test Split approach involves randomly splitting a dataset into two parts. One part is used for training the Machine Learning model, while the other part is used for testing and validation.

- Typically, 70% to 80% of the dataset is reserved for training, and the remaining 20% to 30% is used for Cross Validation.

In [14]:
# Libraries

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits



### The purpose is to classify the digits datasets which is in sklearn library. 
- The dataset contains 0-9 hand-written digits.
- we will be using different algrothim and evaluate which algrothim is performing well.



In [3]:
# creating a dataframe

digits = load_digits()
dir(digits)


['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

In [9]:
df = pd.DataFrame(digits.data)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [12]:
# x for training and y for testing.
x = df
y = digits.target

In [15]:
# Split the data set into training and testing dataset.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=42)

In [16]:
x_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
1734,0.0,0.0,3.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,...,11.0,0.0,0.0,0.0,3.0,11.0,16.0,13.0,4.0,0.0
855,0.0,0.0,9.0,9.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,16.0,14.0,3.0,0.0,0.0
1642,0.0,0.0,0.0,10.0,13.0,3.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,2.0,11.0,13.0,6.0,0.0,0.0
175,0.0,1.0,10.0,16.0,16.0,11.0,0.0,0.0,0.0,5.0,...,4.0,0.0,0.0,1.0,15.0,14.0,11.0,4.0,0.0,0.0
925,0.0,0.0,6.0,14.0,13.0,3.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,4.0,15.0,16.0,9.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1130,0.0,1.0,13.0,16.0,16.0,12.0,1.0,0.0,0.0,12.0,...,9.0,0.0,0.0,1.0,14.0,16.0,16.0,11.0,1.0,0.0
1294,0.0,3.0,15.0,16.0,15.0,3.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,3.0,16.0,5.0,0.0,0.0,0.0,0.0
860,0.0,0.0,9.0,16.0,16.0,13.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9.0,14.0,16.0,16.0,2.0,0.0
1459,0.0,0.0,1.0,13.0,16.0,10.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,15.0,7.0,0.0,0.0,0.0


In [18]:
x_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
1245,0.0,0.0,0.0,7.0,12.0,0.0,0.0,0.0,0.0,0.0,...,16.0,2.0,0.0,0.0,0.0,9.0,14.0,14.0,5.0,0.0
220,0.0,0.0,11.0,16.0,8.0,0.0,0.0,0.0,0.0,6.0,...,0.0,0.0,0.0,0.0,13.0,16.0,11.0,1.0,0.0,0.0
1518,0.0,0.0,8.0,15.0,12.0,4.0,0.0,0.0,0.0,5.0,...,7.0,0.0,0.0,0.0,13.0,16.0,15.0,8.0,0.0,0.0
438,0.0,0.0,2.0,12.0,12.0,12.0,9.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,15.0,3.0,0.0,0.0,0.0
1270,0.0,2.0,13.0,16.0,10.0,0.0,0.0,0.0,0.0,6.0,...,14.0,0.0,0.0,3.0,15.0,16.0,16.0,10.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1731,0.0,0.0,0.0,2.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.0,10.0,0.0,0.0,0.0
1630,0.0,0.0,6.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,16.0,16.0,12.0,1.0,0.0
1037,0.0,0.0,7.0,15.0,16.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,11.0,10.0,10.0,0.0,0.0
965,0.0,0.0,7.0,16.0,12.0,1.0,0.0,0.0,0.0,0.0,...,14.0,0.0,0.0,0.0,7.0,16.0,16.0,16.0,4.0,0.0


In [21]:
# function for all the algorithm

def algorithm(ml_model):
    model = ml_model.fit(x_train, y_train)
    print(f'Accuracy: {model.score(x_test, y_test)}')

### RandomForest

In [26]:
algorithm(RandomForestClassifier(n_estimators=40))

Accuracy: 0.9722222222222222


### Support vector machine

In [27]:
algorithm(SVC())

Accuracy: 0.9861111111111112


### Logistic Regression

In [28]:
algorithm(LogisticRegression())

Accuracy: 0.9694444444444444


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### KFold
- It is another technique for cross-validation.
-  Compared to other Cross Validation approaches, it generally results in a less biased model.
- The key benefit of K-Folds is that it ensures that all observations from the original dataset have the chance to appear in both the training and test sets. 
- The model is then trained using K-1 folds (K minus 1), and it is validated using the remaining K-fold. Scores and errors are recorded.
- This process is repeated until each K-fold has been used as the test set. The average of the recorded scores is the model’s performance metric.

In [36]:
#  libraries

from sklearn.model_selection import KFold

kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [37]:
# Apply kfold

for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


### Stratified K-Fold 
- Provides train/test indices to split data in train/test sets.
- This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.


### Notes

The implementation is designed to:

    Generate test sets such that all contain the same distribution of classes, or as close as possible.

    Be invariant to class label: relabelling y = ["Happy", "Sad"] to y = [1, 0] should not change the indices generated.

    Preserve order dependencies in the dataset ordering, when shuffle=False: all samples from class k in some test set were contiguous in y, or separated in y by samples from classes other than k.

    Generate test sets where the smallest and largest differ by at most one sample.


In [45]:
#  libraries

from sklearn.model_selection import StratifiedKFold

folds = StratifiedKFold(n_splits=3,shuffle=False)
folds

StratifiedKFold(n_splits=3, random_state=None, shuffle=False)

In [49]:
score_lr = []
score_svm = []
score_rf = []

for train_index, test_index in kf.split(digits.data):
    score_rf.append(algorithm(RandomForestClassifier(n_estimators=40)))
    score_svm.append(algorithm(SVC(gamma='auto')))
    score_lr.append(algorithm(LogisticRegression(solver='liblinear', multi_class='ovr')))


Accuracy: 0.975
Accuracy: 0.4666666666666667
Accuracy: 0.9611111111111111
Accuracy: 0.9694444444444444
Accuracy: 0.4666666666666667
Accuracy: 0.9611111111111111
Accuracy: 0.975
Accuracy: 0.4666666666666667
Accuracy: 0.9611111111111111


## cross_val_score function


In [51]:
from sklearn.model_selection import cross_val_score


In [53]:
# Random_forest
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)


array([0.93155259, 0.94991653, 0.92320534])

In [54]:
# SVC 
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)


array([0.38063439, 0.41068447, 0.51252087])

In [55]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)


array([0.89482471, 0.95325543, 0.90984975])