K-Fold Cross Validation is a popular technique for evaluating the performance of a machine learning model on a dataset. It helps ensure that the model's performance is stable and not overly dependent on any particular part of the dataset. Here's an overview of K-Fold Cross Validation:

### 1. **Purpose of K-Fold Cross Validation**:
   - **Model Evaluation**: When training machine learning models, we want to test the model’s performance on data it hasn’t seen. Splitting data into training and testing sets helps, but a single train-test split might not provide a reliable estimate if the dataset is small or has high variability. K-Fold Cross Validation addresses this by repeatedly testing the model on different subsets of the data.
   - **Mitigates Overfitting and Underfitting**: By repeatedly testing the model on different segments of data, K-Fold Cross Validation reduces the risk of overfitting (where the model performs well on training data but poorly on unseen data) or underfitting (where the model performs poorly on both).

### 2. **How K-Fold Cross Validation Works**:
   - **Splitting the Data**: The dataset is divided into `k` equal-sized subsets, or "folds." For example, in 5-Fold Cross Validation, the dataset is split into 5 subsets.
   - **Training and Testing Process**: The model is trained `k` times, each time on a different combination of folds. Specifically:
      - In each iteration, one of the `k` folds is used as the test set, and the remaining `k-1` folds are used as the training set.
      - This process is repeated `k` times, each time with a different fold serving as the test set.
   - **Calculating the Performance Metric**: After each iteration, a performance metric (e.g., accuracy, precision, recall) is recorded on the test fold. Once all `k` iterations are completed, the model’s overall performance is assessed by averaging these scores. This final average gives a more robust estimate of model performance.

### 3. **Choosing the Number of Folds (`k`)**:
   - **Common Choices**: `k` is typically set between 5 and 10. For instance, 10-Fold Cross Validation is popular in many applications. A higher value of `k` (like 10) provides a more reliable performance estimate but is also more computationally expensive.
   - **Leave-One-Out Cross Validation (LOOCV)**: LOOCV is an extreme case where `k` is set equal to the number of data points, meaning each fold contains just one data point. While LOOCV uses almost all data for training in each iteration, it can be very computationally intensive and often leads to high variance in performance estimates.

### 4. **Advantages of K-Fold Cross Validation**:
   - **Efficient Use of Data**: K-Fold Cross Validation uses the entire dataset for both training and testing (at different points), making it especially useful when data is limited.
   - **More Reliable Estimates**: By averaging performance across multiple test sets, K-Fold Cross Validation provides a more reliable and less biased estimate of model performance compared to a single train-test split.
   - **Reduced Variance**: Since each data point is used in both training and testing (but in different folds), it reduces the variability of the performance metric, leading to a more stable and consistent evaluation.

### 5. **Disadvantages of K-Fold Cross Validation**:
   - **Computationally Intensive**: Training a model `k` times can be computationally expensive, especially for large datasets or complex models.
   - **Variance with Small Datasets**: With small datasets, different splits may yield significantly different results, and K-Fold Cross Validation may not fully eliminate the risk of overfitting.

### 6. **Applications of K-Fold Cross Validation**:
   - **Model Selection and Hyperparameter Tuning**: K-Fold Cross Validation is commonly used during model selection to compare different algorithms or during hyperparameter tuning to find the best configuration for a model.
   - **General Performance Evaluation**: In research or benchmarking, K-Fold Cross Validation is used to present a reliable, generalizable estimate of how a model is expected to perform on unseen data.

In summary, K-Fold Cross Validation is a robust method for model validation, providing insights into the model’s performance on various subsets of data. By using multiple folds, it helps ensure that the model's evaluation is fair, consistent, and not overly influenced by any single subset of data.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

**Logistic Regression**

In [3]:
lr = LogisticRegression(solver='liblinear',multi_class='ovr')
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9722222222222222

**SVM**

In [4]:
svm = SVC(gamma='auto')
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.362962962962963

In [5]:
svm = SVC()
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

0.9925925925925926

**Random Forest**

In [6]:
rf = RandomForestClassifier(n_estimators=50)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9833333333333333

**Now using the K-Fold cross validation**

In [7]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
kf

KFold(n_splits=3, random_state=None, shuffle=False)

In [8]:
for train_index, test_index in kf.split([1,2,3,4,5,6,7,8,9]):
    print(train_index, test_index)

[3 4 5 6 7 8] [0 1 2]
[0 1 2 6 7 8] [3 4 5]
[0 1 2 3 4 5] [6 7 8]


In [9]:
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [10]:
get_score(SVC(), X_train, X_test, y_train, y_test)

0.9925925925925926

In [11]:
get_score(LogisticRegression(), X_train, X_test, y_train, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9648148148148148

In [12]:
from sklearn.model_selection import StratifiedKFold
folds = StratifiedKFold(n_splits=3)
scores_logistic = []
scores_svm = []
scores_rf = []

for train_index, test_index in folds.split(digits.data, digits.target):
    X_train, X_test, y_train, y_test = digits.data[train_index], digits.data[test_index], \
                                       digits.target[train_index], digits.target[test_index]
    scores_logistic.append(get_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X_train, X_test, y_train, y_test))  
    scores_svm.append(get_score(SVC(gamma='auto'), X_train, X_test, y_train, y_test))
    scores_rf.append(get_score(RandomForestClassifier(n_estimators=50), X_train, X_test, y_train, y_test))

In [13]:
scores_logistic

[0.8948247078464107, 0.9532554257095158, 0.9098497495826378]

In [14]:
scores_svm

[0.3806343906510851, 0.41068447412353926, 0.5125208681135225]

In [15]:
scores_rf

[0.9449081803005008, 0.9532554257095158, 0.9265442404006677]

**cross_val_score function**

In [16]:
from sklearn.model_selection import cross_val_score

**Logistic regression model performance using cross_val_score**

In [17]:
cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), digits.data, digits.target,cv=3)

array([0.89482471, 0.95325543, 0.90984975])

**svm model performance using cross_val_score**

In [18]:
cross_val_score(SVC(gamma='auto'), digits.data, digits.target,cv=3)

array([0.38063439, 0.41068447, 0.51252087])

**random forest performance using cross_val_score**

In [19]:
cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target,cv=3)

array([0.92821369, 0.94824708, 0.92654424])

**Parameter tunning using k fold cross validation**

In [20]:
scores1 = cross_val_score(RandomForestClassifier(n_estimators=5),digits.data, digits.target, cv=10)
np.average(scores1)

0.8681160769708255

In [21]:
scores2 = cross_val_score(RandomForestClassifier(n_estimators=20),digits.data, digits.target, cv=10)
np.average(scores2)

0.9382184978274364

In [22]:
scores3 = cross_val_score(RandomForestClassifier(n_estimators=30),digits.data, digits.target, cv=10)
np.average(scores3)

0.9393513345747981

In [23]:
scores4 = cross_val_score(RandomForestClassifier(n_estimators=40),digits.data, digits.target, cv=10)
np.average(scores4)

0.9398789571694598

## Exercise
Use iris flower dataset from sklearn library and use cross_val_score against following models to measure the performance of each. In the end figure out the model with best performance,

Logistic Regression

SVM

Decision Tree

Random Forest

In [24]:
from sklearn.datasets import load_iris
iris = load_iris()

In [25]:
l_score = cross_val_score(LogisticRegression(), iris.data, iris.target)
np.average(l_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9733333333333334

In [26]:
l_score_ = cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), iris.data, iris.target,cv=3)
np.average(l_score_)

0.9533333333333333

In [27]:
cross_val_score(RandomForestClassifier(n_estimators=40),iris.data, iris.target,cv=3)

array([0.98, 0.94, 0.94])

In [28]:
svm_score = cross_val_score(SVC(), iris.data, iris.target)
np.average(svm_score)

0.9666666666666666

In [29]:
svm_score_ = cross_val_score(SVC(gamma='auto'), iris.data, iris.target,cv=3)
np.average(svm_score_)

0.9733333333333333

In [30]:
decision_score = cross_val_score(DecisionTreeClassifier(), iris.data, iris.target)
np.average(decision_score)

0.9666666666666668

In [31]:
decision_score_ = cross_val_score(DecisionTreeClassifier(criterion='entropy', random_state=0), iris.data, iris.target,cv=3)
np.average(decision_score_)

0.9533333333333333

In [32]:
rf_score = cross_val_score(RandomForestClassifier(),iris.data, iris.target)
np.average(rf_score)

0.96

In [36]:
rf_score_ = cross_val_score(RandomForestClassifier(n_estimators=40),iris.data, iris.target, cv=10)
np.average(rf_score_)

0.96