# Cross-Validation in Machine Learning

Cross-validation is a powerful technique used to evaluate the performance of a machine learning model.  
Instead of relying on just one train-test split, cross-validation uses **multiple splits** of the dataset, making the evaluation more **robust and reliable**.

In this notebook, we will cover:
1. Why Cross-Validation is needed  
2. Different types of Cross-Validation  
3. Practical implementation in Python (Scikit-learn)  

## Why Do We Need Cross-Validation?

- When we split data into **train (80%) and test (20%)**, the result depends heavily on **how the split is made**.  
- Maybe the test set is too easy or too difficult â†’ performance is misleading.  
- We want to check **how well our model generalizes** to unseen data.  

ðŸ‘‰ Cross-validation solves this problem by evaluating the model on **multiple different splits** and then taking the average score.  

In [1]:
# Importing required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
import numpy as np

In [2]:
# Load a sample dataset (Iris dataset)
X, y = load_iris(return_X_y=True)

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (150, 4)
Shape of y: (150,)


## Limitation of Single Train-Test Split

If we do only one split (say 80â€“20), the evaluation depends on which data points end up in training and testing.  
This can lead to **biased or unstable results**.  

## K-Fold Cross-Validation

- Dataset is divided into **K folds** (subsets).  
- Each time, one fold is used as the test set and the remaining **K-1 folds** are used for training.  
- Process is repeated K times, and performance is averaged.  

Example: In **5-Fold CV**, dataset is split into 5 parts.  
The model is trained and tested 5 times, each time with a different test fold.  

In [3]:
# Apply 5-Fold Cross Validation
model = DecisionTreeClassifier()

scores = cross_val_score(model, X, y, cv=5)

print("Cross-validation scores:", scores)
print("Average CV Score:", np.mean(scores))

Cross-validation scores: [0.96666667 0.96666667 0.9        1.         1.        ]
Average CV Score: 0.9666666666666668


## Stratified K-Fold Cross-Validation

- In classification, if classes are imbalanced, normal K-Fold may create folds with **unequal class distribution**.  
- **Stratified K-Fold** ensures that each fold has the same **class proportion** as the overall dataset.  

In [4]:
# Stratified K-Fold ensures class balance in each fold
skf = StratifiedKFold(n_splits=5)

scores = cross_val_score(model, X, y, cv=skf)

print("Stratified K-Fold CV scores:", scores)
print("Average Stratified CV Score:", np.mean(scores))

Stratified K-Fold CV scores: [0.96666667 0.96666667 0.9        1.         1.        ]
Average Stratified CV Score: 0.9666666666666668


## Leave-One-Out Cross-Validation (LOOCV)

- Extreme case of K-Fold CV â†’ where **K = number of samples**.  
- Each time, **1 sample** is used for testing, and the rest for training.  
- Very accurate, but computationally expensive for large datasets.  

In [6]:
from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

print("Number of iterations:", len(scores))
print("Average LOOCV Score:", np.mean(scores))

Number of iterations: 150
Average LOOCV Score: 0.9466666666666667


## Time Series Cross-Validation

- For time series, we **cannot shuffle data** (since order matters).  
- Use **forward chaining**:
  - Train on data up to time `t`, test on time `t+1`.  
  - Expand training set step by step.  

Scikit-learn provides `TimeSeriesSplit` for this purpose.  

In [7]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for train_index, test_index in tscv.split(X):
    print("Train:", train_index, "Test:", test_index)

Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24] Test: [25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49] Test: [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
 74]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74] Test: [75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
 99]
Train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74 75 76 77 78 79 80 81 82 83 84

#  Conclusion

- **Cross-validation** is a reliable method to evaluate models.  
- **K-Fold CV (k=5 or 10)** is most commonly used.  
- **Stratified K-Fold** is better for classification with imbalanced data.  
- **LOOCV** gives accurate results but is slow.  
- **Time Series CV** is best when order matters.  

ðŸ‘‰ Always combine **cross-validation with hyperparameter tuning** for best performance.  
