# Cross-Validation

## K-fold Cross Validation

To asset how a model will generally perform in practice, we use k-fold cross-validation. The process for k-fold cross validation is simple. Make sure that the date is shuffled. This will ensure that each fold is relatively similiar and as a result won't carry drastic degrees of variance between each cross validation score.


1. Split the training set into k equal parts or folds.
2. For each fold, take the current fold and treat it as the test set. Take the remaining k-1 folds and treat this as the training set.
3. Train and test your model using the current training and test sets.
4. Average the performance across all k trails as the summarized performance metric.


**Advantages with Cross-Validation**
1. The entire dataset is involved in both the training and evaluation process, so it is particularly beneficial working with a dataset limited in size.
2. Cross validation comes in as a distribution of scores, so that means obtain additional statistics such as the precision (standard deviation)

## Using Cross-validation

In [29]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

iris = load_iris()
X, y = iris.data, iris.target
X_train, y_train, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

In [30]:
from sklearn.linear_model import SGDClassifier


sgd_clf = SGDClassifier(random_state=42, max_iter=1000, tol=.20)
sgd_clf.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=0.2,
       validation_fraction=0.1, verbose=0, warm_start=False)

### Version 1 - Base

In [31]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.base import clone


skfolds = StratifiedKFold(n_splits=3, random_state=42)
for i, (train_index, test_index) in enumerate(skfolds.split(X_train, y_train)):
    clone_clf = clone(sgd_clf)
    X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
    X_test_fold, y_test_fold = X_train[test_index], y_train[test_index]
    clone_clf.fit(X_train_fold, y_train_fold)
    y_pred_fold = clone_clf.predict(X_test_fold)
    print(f'accuracy fold {i}: {accuracy_score(y_test_fold, y_pred_fold)}')

accuracy fold 0: 0.8285714285714286
accuracy fold 1: 0.6363636363636364
accuracy fold 2: 0.96875


### Version 2 - From Source

In [32]:
from sklearn.model_selection import cross_val_score


scores = cross_val_score(sgd_clf, X_train, y_train, scoring="accuracy", cv=3)
scores

array([0.82857143, 0.63636364, 0.96875   ])