# Stratified K-Fold for Imbalanced Classification ⚖️

**K-Fold Cross-Validation** is a robust method for evaluating machine learning models. However, the standard K-Fold technique can be problematic when dealing with **imbalanced datasets**, where one class is much more frequent than the others.

### The Problem with Imbalanced Data

Standard K-Fold splits the data randomly without considering the class distribution. In an imbalanced dataset, this can lead to some "folds" (test sets) having a very different proportion of classes than the original dataset. In extreme cases, a fold might not contain any samples from the minority class at all! This can result in misleading and unstable performance scores.

### The Solution: Stratified K-Fold

**Stratified K-Fold Cross-Validation** is a variation of K-Fold designed specifically for this problem. It ensures that each fold has the **same percentage of samples for each class** as is present in the original dataset. This guarantees that every test set is a representative sample of the overall class distribution, leading to more reliable and trustworthy model evaluation.

---

## 1. Creating an Imbalanced Dataset

First, let's create a synthetic dataset where the minority class (class 1) makes up only 10% of the data. We use `make_classification` with the `weights` parameter to achieve this.


In [24]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [25]:
X, y = make_classification(
    n_features=10,
    n_samples=1000,
    n_informative=8,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    weights=[0.9, 0.1],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We can use `Counter` to confirm the imbalance.

In [26]:
from collections import Counter

Counter(y)

Counter({0: 897, 1: 103})

## 2. The Issue with Standard K-Fold

Now, let's see what happens when we apply a standard `KFold` cross-validator to our imbalanced data. We will check the class distribution in each of the 5 test folds.


In [27]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X, y):
    y_test = y[test_index]
    print(Counter(y_test))

Counter({0: 177, 1: 23})
Counter({0: 179, 1: 21})
Counter({0: 183, 1: 17})
Counter({0: 181, 1: 19})
Counter({0: 177, 1: 23})


**Observation:** Notice the significant variation in the count of the minority class (1) across the folds, ranging from just 17 to 23. Evaluating a model on these inconsistent test sets can lead to unreliable scores.


## 3. The Solution: Stratified K-Fold

`StratifiedKFold` ensures that the class ratio is preserved in each fold.


In [28]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    y_test = y[test_index]
    print(Counter(y_test))

Counter({0: 180, 1: 20})
Counter({0: 180, 1: 20})
Counter({0: 179, 1: 21})
Counter({0: 179, 1: 21})
Counter({0: 179, 1: 21})


**Observation:** With stratification, the number of minority class samples in each fold is much more consistent (either 20 or 21), providing a stable basis for model evaluation.


## 4. Evaluating Models with Stratified Folds

We can now use our `StratifiedKFold` object (`skf`) with `cross_val_score` to get a reliable performance estimate for different models.


### a) Logistic Regression

In [29]:
from sklearn.model_selection import cross_val_score

scores_logistic = cross_val_score(LogisticRegression(), X, y, cv=skf, scoring='accuracy')
np.average(scores_logistic)

0.9019999999999999

### b) Decision Tree Classifier

In [30]:
from sklearn.tree import DecisionTreeClassifier

scores_dt = cross_val_score(DecisionTreeClassifier(), X, y, cv=skf, scoring='accuracy')
np.average(scores_dt)

0.8940000000000001

### c) Random Forest Classifier

In [31]:
from sklearn.ensemble import RandomForestClassifier

scores_rf = cross_val_score(RandomForestClassifier(n_estimators=10), X, y, cv=skf, scoring='accuracy')
np.average(scores_rf)

0.9129999999999999

### A Quick Shortcut

For classification tasks, `scikit-learn` is smart! If you simply pass an integer to the `cv` parameter in `cross_val_score`, it will automatically use `StratifiedKFold` by default.


In [32]:
# Passing cv=5 automatically uses StratifiedKFold for classifiers
cross_val_score(RandomForestClassifier(n_estimators=10), X, y, cv=5, scoring="accuracy")

array([0.92 , 0.895, 0.905, 0.92 , 0.915])

## 5. Conclusion

When working with **imbalanced classification datasets**, you should always use **Stratified K-Fold Cross-Validation** instead of standard K-Fold. It ensures that each evaluation fold is a representative sample of the overall class distribution, leading to more reliable and trustworthy model performance estimates.