# Cross Validation in Machine Learning

* Cross-validation is a widely used model evaluation technique that helps assess how well a machine learning model is likely to perform on unseen data.

 * It works by dividing the dataset into multiple segments (or folds), where the model is trained on a portion of the data and tested on the remaining part.
 
 * This process is repeated multiple times, each time using a different fold as the test set. 
 * The performance metrics from each iteration are then averaged, providing a more reliable and unbiased estimate of the model’s ability to generalize to new data.


 **The main purpose of cross validation is to prevent overfitting.**

 If you want to make sure your machine learning model is not just memorizing the training data but is capable of adapting to real-world data cross-validation is a commonly used technique.

## Types of Cross-Validation

### 1. Holdout Validation


  Holdout Validation is a simple technique where the dataset is split into two parts—typically 50% for training and 50% for testing. The model is trained on the training set and evaluated on the testing set.

**How it works:**

  * Randomly divide the dataset into two portions.
  * Train the model on one portion.
  * Test the model on the other portion.

**Advantages:**

  * Easy to implement.
  * Fast and requires less computational time.

**Disadvantages:**

Might miss key patterns if important data is in the testing set.
Can result in higher bias as the model is trained on only half of the data.
The evaluation might vary significantly depending on how the data was split.



In [1]:
### 1. Holdout Validation

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Perform Holdout Validation: 50% training, 50% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

# Train a classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Holdout Validation Accuracy:", accuracy)


Holdout Validation Accuracy: 0.9866666666666667



### 2. Leave-One-Out Cross-Validation (LOOCV)

 LOOCV is an exhaustive cross-validation method where one data point is left out for testing, and the rest is used for training. This is repeated for every data point.

**How it works:**

  - For a dataset of n samples, train the model on n - 1 samples.
  - Test on the one left-out sample.
  - Repeat this process n times, each time leaving out a different sample.

**Advantages:**

  - Uses nearly the entire dataset for training, leading to low bias.
  - Effective when datasets are very small.

**Disadvantages:**

  - High variance in results, as testing is done on a single point at a time.
  - Extremely computationally expensive, especially with large datasets.



In [2]:
from sklearn.model_selection import LeaveOneOut
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
X, y = load_iris(return_X_y=True)

# Initialize Leave-One-Out cross-validator
loo = LeaveOneOut()

# Initialize model
model = RandomForestClassifier(random_state=42)

# Store predictions and true values
predictions = []
true_values = []

# Perform LOOCV
for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train model
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)
    predictions.append(y_pred[0])
    true_values.append(y_test[0])

# Evaluate
accuracy = accuracy_score(true_values, predictions)
print("LOOCV Accuracy:", accuracy)


LOOCV Accuracy: 0.9533333333333334


### 3. Stratified Cross-Validation

 Stratified Cross-Validation is a variation of K-Fold Cross-Validation designed for classification problems, especially imbalanced datasets. It ensures that each fold maintains the same class distribution as the entire dataset.

**How it works:**

  - The dataset is split into k folds, keeping the class proportions the same in each fold.
  - Train the model on k - 1 folds and test on the remaining fold.
  - Repeat the process k times, ensuring each fold is used once as the test set.

Advantages:

  - Ensures each fold is representative of the overall class distribution, reducing bias towards over-represented classes.
  - Improves model stability and generalization on imbalanced datasets.

Disadvantages:

  - More complex to implement than basic K-Fold.
  - Might still be computationally heavy for large datasets.



In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define StratifiedKFold with 5 splits
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Initialize model
model = LogisticRegression(max_iter=200)

# Track scores
scores = []

# Perform Stratified Cross-Validation
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    scores.append(accuracy)

# Print results
print("Stratified Cross-Validation Accuracy for each fold:", scores)
print("Average Accuracy:", np.mean(scores))


Stratified Cross-Validation Accuracy for each fold: [1.0, 0.9666666666666667, 0.9333333333333333, 1.0, 0.9333333333333333]
Average Accuracy: 0.9666666666666668


### 4. K-Fold Cross-Validation

K-Fold Cross-Validation is a widely used cross-validation method where the dataset is split into k equally sized folds. The model is trained on k - 1 folds and tested on the remaining fold. This process is repeated k times, each time with a different fold used as the test set.

**How it works:**

  * Divide the data into k folds.
  * In each iteration, use k - 1 folds for training and the remaining fold for testing.
  * Average the results from all k iterations for the final performance metric.

**Advantages:**

  * Provides a balanced trade-off between bias and variance.
  * More efficient than LOOCV.
  * Suitable for most general datasets.

**Disadvantages:**

  * Still requires k iterations, which can be time-consuming with large datasets.
  * Class imbalance might still be an issue unless used with stratification.

**Best Practice Tip:**

  * Typically, k = 10 is considered a good balance, offering reliable performance estimates without excessive computation.




In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define KFold with 5 splits
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize model
model = LogisticRegression(max_iter=200)

# Track scores
scores = []

# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    scores.append(accuracy)

# Print results
print("K-Fold Cross-Validation Accuracy for each fold:", scores)
print("Average Accuracy:", np.mean(scores))


K-Fold Cross-Validation Accuracy for each fold: [1.0, 1.0, 0.9333333333333333, 0.9666666666666667, 0.9666666666666667]
Average Accuracy: 0.9733333333333334


### 5. Out-of-Bag (OOB) Evaluation

Out-of-Bag (OOB) evaluation is a built-in validation technique used specifically with ensemble methods like Random Forests. It allows the model to validate itself using data samples that were not included in the bootstrap sample during training.

**How it works:**

 * In ensemble models like **Random Forest**, each tree is trained on a random sample (with replacement) of the training data — this is called **bootstrap sampling**.

 * On average, about 63% of the data is used to train each tree; the remaining 37% of the data (not included) is called the Out-of-Bag samples.

 * These OOB samples are used to test the tree’s predictions, effectively providing a validation set without explicitly splitting the dataset.

**Advantages:**

 * No need to set aside a separate validation set.
 * Efficient use of data — training and validation are done simultaneously.
 * Reduces computation by eliminating the need for manual cross-validation in many cases.

**Disadvantages:**

 * Only available with **bagging-based algorithms** (like Random Forest).
 * May not be as stable or reliable as k-fold cross-validation on small datasets.
 * Not suitable for all types of models (e.g., boosting, linear models).

**Best Practice Tip:**

Enable **oob_score=True** when training a Random Forest in scikit-learn to automatically calculate the OOB score.

In [5]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data (just for comparing OOB score with test accuracy)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest with OOB enabled
rf = RandomForestClassifier(n_estimators=100, oob_score=True, bootstrap=True, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Get the OOB score
print("OOB Score:", rf.oob_score_)

# Evaluate on test set
y_pred = rf.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


OOB Score: 0.9428571428571428
Test Accuracy: 1.0
