#Stratified Splits for Imbalanced Classification


In this notebook, we will:

- Start with a motivating example of classification with unbalanced classes

- Show how a standard train-test split can lead to different class distributions in the training and testing sets

- Introduce StratifiedKFold in sklearn to maintain consistent class distribution in cross-validation


## 1. Motivation: Classification with Unbalanced Classes

In many real-world classification problems, we encounter **unbalanced datasets**, where one class significantly outnumbers the other(s). This imbalance can pose challenges for training a model, as it may not learn enough from the underrepresented class, leading to poor performance on that class during testing.

To illustrate this, let’s create a simple dataset `y` representing our target variable with two classes: `0` and `1`. However, class `0` appears more frequently than class `1`, creating an imbalance.

In [10]:
# Imbalanced target variable
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]  # 70% zeros, 30% ones

Let’s start by making a simple train-test split and observe how the imbalance in the classes can lead to unequal class distributions in the training and testing sets.

In [11]:
from sklearn.model_selection import train_test_split


# First train-test split
y_train_1, y_test_1 = train_test_split(y, test_size=0.3, random_state=1)
print("First split:")
print("y_train_1:", y_train_1)
print("y_test_1:", y_test_1)

# Second train-test split with a different random state
y_train_2, y_test_2 = train_test_split(y, test_size=0.3, random_state=2)
print("\nSecond split:")
print("y_train_2:", y_train_2)
print("y_test_2:", y_test_2)

First split:
y_train_1: [0, 0, 0, 0, 1, 1, 0]
y_test_1: [0, 1, 0]

Second split:
y_train_2: [0, 1, 0, 0, 0, 1, 1]
y_test_2: [0, 0, 0]


Notice that in each split, the distribution of classes in the training and testing sets may vary, especially in the minority class (`1`). This instability in class proportions can lead to models that perform inconsistently, particularly if a test set lacks sufficient samples from the minority class.

**Why This Matters**:  
A major assumption in machine learning is that both training and testing data are drawn from the same underlying distribution. When class distributions vary between splits, it can impact the model's performance on minority classes, leading to poor generalization.

To address this, we use **stratified splits**. By stratifying, we can ensure that each split (both training and testing sets) has a similar class distribution as the original data. This is especially important when using cross-validation on imbalanced datasets.

## Implementing `StratifiedKFold` in `sklearn`

`StratifiedKFold` is a cross-validation technique that divides the data into `k` folds while preserving the original class distribution within each fold.

Let's see how it works. For the purpose of this notebook, we can't use our motivating example y since it is too small.

In [55]:
y = [0] * 14 + [1] * 6
# Define X to be data set for label set y
X = ["zero" if label == 0 else "one" for label in y]

print('X =', X)
print('y =', y)

X = ['zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'zero', 'one', 'one', 'one', 'one', 'one', 'one']
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


In [67]:
from sklearn.model_selection import StratifiedKFold
from collections import Counter

# Instantiate StratifiedKFold with 3 folds
skf = StratifiedKFold(n_splits=3)

# Demonstrate stratified splits
fold = 1
for train_index, test_index in skf.split(X, y):
    print(f"\nFold {fold}")

    # Get training and testing splits
    X_train, y_train = [X[i] for i in train_index], [y[i] for i in train_index]
    X_test, y_test = [X[i] for i in test_index], [y[i] for i in test_index]

    # Calculate and print class distribution in each split
    train_counts = Counter(y_train)
    test_counts = Counter(y_test)
    total_train = sum(train_counts.values())
    total_test = sum(test_counts.values())

    #print("X_train:", X_train)
    #print("y_train:", y_train)
    print("Train class distribution:", {k: v / total_train for k, v in train_counts.items()})

    #print("X_test:", X_test)
    #print("y_test:", y_test)
    print("Test class distribution:", {k: v / total_test for k, v in test_counts.items()})

    fold += 1



Fold 1
Train class distribution: {0: 0.6923076923076923, 1: 0.3076923076923077}
Test class distribution: {0: 0.7142857142857143, 1: 0.2857142857142857}

Fold 2
Train class distribution: {0: 0.6923076923076923, 1: 0.3076923076923077}
Test class distribution: {0: 0.7142857142857143, 1: 0.2857142857142857}

Fold 3
Train class distribution: {0: 0.7142857142857143, 1: 0.2857142857142857}
Test class distribution: {0: 0.6666666666666666, 1: 0.3333333333333333}


Note that in each fold, `StratifiedKFold` creates a different training and test set, preserving the approximate 70% of class 0 and 30% of class 1 distribution across both sets.
Consistent Distribution: Each test set includes samples from both classes (even the minority class 1), which would not be guaranteed with a regular KFold split on an imbalanced dataset.
This balanced distribution across splits helps models trained on these folds generalize better, especially for minority classes.

## Beer DataSet
Explor the given beer dataset. How many class does it have? Whats the distribution of each class? Use  `StratifiedKFold` to create a 5 training and test set, preserving the approximate distribution of each class.

In [69]:
beer = pd.read_csv("beer_df.csv")

In [57]:
beer.sample(5)

Unnamed: 0,IBU,ABV,Rating,Beer_Type
197,44,6.2,3.469,IPA
22,75,10.5,4.07,Stout
314,70,6.9,3.752,IPA
258,80,7.6,3.767,IPA
327,72,9.0,3.793,IPA


In [58]:
beer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   IBU        347 non-null    int64  
 1   ABV        347 non-null    float64
 2   Rating     347 non-null    float64
 3   Beer_Type  347 non-null    object 
dtypes: float64(2), int64(1), object(1)
memory usage: 11.0+ KB


In [59]:
beer.Beer_Type.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Beer_Type,Unnamed: 1_level_1
IPA,0.56196
Stout,0.43804


In [60]:
## Make the split
beer_train, beer_test = train_test_split(beer.copy(),
                                            shuffle=True,
                                            random_state=21,
                                            stratify=beer['Beer_Type'])

In [61]:
## look at the distribution for the training data

beer_train.Beer_Type.value_counts(normalize=True)


Unnamed: 0_level_0,proportion
Beer_Type,Unnamed: 1_level_1
IPA,0.561538
Stout,0.438462


In [35]:
## look at the distribution for the test data
beer_test.Beer_Type.value_counts(normalize=True)


Unnamed: 0_level_0,proportion
Beer_Type,Unnamed: 1_level_1
IPA,0.561905
Stout,0.438095


## Beer Classification

In [None]:
## Needed Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping #Stops training if validation loss
#doesn’t improve, which is useful for
#smaller or imbalanced datasets to prevent overfitting.




In [70]:
# Load the dataset
beer = pd.read_csv("beer_df.csv")

# Extract features and labels
X = beer[['IBU', 'ABV']].values
y = beer['Beer_Type'].values
# Encode labels to binary
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)  # 'IPA' becomes 1, 'Stout' becomes 0


In [71]:
# Normalize features
#scaler = StandardScaler()
#X = scaler.fit_transform(X)


In [72]:
# Use StratifiedKFold for cross-validation
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
fold_accuracies = []

for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"\nFold {fold+1}")

    # Split data into training and test sets for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Build the model
    model = Sequential([
        Dense(16, input_shape=(X_train.shape[1],), activation='relu'),
        Dense(8, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

    # Define early stopping to prevent overfitting
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

    # Train the model
    model.fit(X_train, y_train, epochs=50, batch_size=2, validation_split=0.2, callbacks=[early_stopping], verbose=0)

    # Evaluate the model on the test set for this fold
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"Fold {fold+1} Test Accuracy: {accuracy:.2f}")
    fold_accuracies.append(accuracy)

# Calculate the average accuracy across all folds
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Cross-Validation Accuracy: {average_accuracy:.2f}")


Fold 1


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Fold 1 Test Accuracy: 0.83

Fold 2
Fold 2 Test Accuracy: 0.91

Fold 3
Fold 3 Test Accuracy: 0.83

Average Cross-Validation Accuracy: 0.86
