# How to Handle Imbalanced Classes in Machine Learning?

Imbalanced classes put “accuracy” out of business. This is a surprisingly common problem in machine learning (specifically in classification), occurring in datasets with a disproportionate ratio of observations in each class.

Standard accuracy no longer reliably measures performance, which makes model training much trickier.

Imbalanced classes appear in many domains, including:

Fraud detection
Spam filtering
Disease screening
SaaS subscription churn
Advertising click-throughs

__Intuition: Disease Screening Example:__

Let’s say your client is a leading research hospital, and they’ve asked you to train a model for detecting a disease based on biological inputs collected from patients.

But here’s the catch… the disease is relatively rare; it occurs in only 8% of patients who are screened.

Unfortunately, that accuracy is misleading.

For patients who do not have the disease, you’d have 100% accuracy.

For patients who do have the disease, you’d have 0% accuracy.

Your overall accuracy would be high simply because most patients do not have the disease (not because your model is any good).

This is clearly a problem because many machine learning algorithms are designed to maximize overall accuracy. The rest of this guide will illustrate different tactics for handling imbalanced classes.

__Note:__  Not every technique below will work for every problem. However, 9 times out of 10, at least one of these techniques should do the trick.

__Balance Scale Dataset:__

For this guide, we’ll use a synthetic dataset called Balance Scale Data, which you can download from the UCI Machine Learning Repository "http://archive.ics.uci.edu/ml/datasets/balance+scale".

This dataset was originally generated to model psychological experiment results, but it’s useful for us because it’s a manageable size and has imbalanced classes.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
 
# Read dataset
df = pd.read_csv('balance-scale.data', 
                 names=['balance', 'var1', 'var2', 'var3', 'var4'])
 
# Display example observations
df.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


The dataset contains information about whether a scale is balanced or not, based on weights and distances of the two arms.

It has 1 target variable, which we've labeled balance .

It has 4 input features, which we've labeled var1  through var4 .

The target variable has 3 classes.

R for right-heavy, i.e. when var3 * var4 > var1 * var2

L for left-heavy, i.e. when var3 * var4 < var1 * var2

B for balanced, i.e. when var3 * var4 = var1 * var2

In [2]:
# Count of each class
df['balance'].value_counts()

R    288
L    288
B     49
Name: balance, dtype: int64

However, for our purpose, we're going to turn this into a binary classification problem.

We're going to label each observation as 1 (positive class) if the scale is balanced or 0 (negative class) if the scale is not balanced.

In [3]:
# Transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]
 
df.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,1,1,1,1,1
1,0,1,1,1,2
2,0,1,1,1,3
3,0,1,1,1,4
4,0,1,1,1,5


In [4]:
df['balance'].value_counts()

0    576
1     49
Name: balance, dtype: int64

As you can see, only about 8% of the observations were balanced. Therefore, if we were to always predict 0, we'd achieve an accuracy of 92%.

__The Danger of Imbalanced Classes:__

Now that we have a dataset, we can really show the dangers of imbalanced classes.

First, let's import the Logistic Regression algorithm and the accuracy metric from Scikit-Learn.

In [5]:
# Imports
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_0 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_0 = clf_0.predict(X)

In [6]:
# How's the accuracy?
print( accuracy_score(pred_y_0, y) )

0.9216


In [7]:
# Should we be excited?
print( np.unique( pred_y_0 ) )

[0]


Let's look at the technique for handling imbalanced classes: 

# 1. Up-sampling the minority class:

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

There are several heuristics for doing so, but the most common way is to simply resample with replacement.

First, we'll import the resampling module from Scikit-Learn:

In [8]:
# Module for resampling
from sklearn.utils import resample

Next, we'll create a new DataFrame with an up-sampled minority class. Here are the steps:

>1. First, we'll separate observations from each class into different DataFrames.

>2. Next, we'll resample the minority class with replacement, setting the number of samples to match that of the majority class.

>3. Finally, we'll combine the up-sampled minority class DataFrame with the original majority class DataFrame.

In [9]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.balance.value_counts()

1    576
0    576
Name: balance, dtype: int64

In [10]:
# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)
 
# Train model
clf_1 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_1 = clf_1.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_1) )

[0 1]
0.5147569444444444


# 2. Down-sample Majority Class:

Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

The most common heuristic for doing so is resampling without replacement.

The process is similar to that of up-sampling. Here are the steps:

>1. First, we'll separate observations from each class into different DataFrames.

>2. Next, we'll resample the majority class without replacement, setting the number of samples to match that of the minority class.

>3. Finally, we'll combine the down-sampled majority class DataFrame with the original minority class DataFrame.

In [11]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.balance.value_counts()

1    49
0    49
Name: balance, dtype: int64

In [12]:
# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)
 
# Train model
clf_2 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_2 = clf_2.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_2) )

[0 1]
0.5612244897959183


# 3. Penalize Algorithms (Cost-Sensitive Training):

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM:

In [13]:
from sklearn.svm import SVC

In [14]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
 
clf_3.fit(X, y)
 
# Predict on training set
pred_y_3 = clf_3.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_3 ) )
 
# How's our accuracy?
print( accuracy_score(y, pred_y_3) )

[0 1]
0.688


# 4. Use Tree-Based Algorithms

The final tactic we'll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we'll jump right into those:

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)
 
# Predict on training set
pred_y_4 = clf_4.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_4 ) )
 
# How's our accuracy?
print( accuracy_score(y, pred_y_4) )

[0 1]
1.0


Wow! 97% accuracy? Is this magic? A sleight of hand? Cheating? Too good to be true?

Well, tree ensembles have become very popular because they perform extremely well on many real-world problems. They are certainly recommended them wholeheartedly.

__However:__

While these results are encouraging, the model could be overfit, so you should still evaluate your model on an unseen test set before making the final decision.

Note: your numbers may differ slightly due to the randomness in the algorithm. You can set a random seed for reproducible results.

__Recommended Homework:__

# Create Synthetic Samples (Data Augmentation)

Creating synthetic samples is a close cousin of up-sampling, and some people might categorize them together. For example, the SMOTE algorithm is a method of resampling from the minority class while slightly perturbing feature values, thereby creating "new" samples.

You can find an implementation of SMOTE in the imblearn library.

# Happy Learning