# 5 tactics for handling imbalanced classes in machine learning:

## 1. Up-sample the minority class
## 2. Down-sample the majority class
## 3. Change your performance metric
## 4. Penalize algorithms (cost-sensitive training)
## 5. Use tree-based algorithms

In [1]:
import numpy as np
import pandas as pd

In [5]:
# Read dataset
df = pd.read_csv('balance-scale.data', names=['balance', 'var1', 'var2', 'var3', 'var4'])

# Display example observations
df.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [6]:
# Count of each class
df['balance'].value_counts()

R    288
L    288
B     49
Name: balance, dtype: int64

In [7]:
# Transform into binary classification (scale is either balanced or not)
df['balance'] = [1 if b=='B' else 0 for b in df.balance]
df['balance'].value_counts()

0    576
1     49
Name: balance, dtype: int64

In [8]:
### Only .08 were balanced so if we always guessed unbalanced (predicted 0), we would have .92 accuracy

# Example of dangers of imbalanced classes: Logistical regression with default parameters
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Separate into target and input variables
y = df.balance
X = df.drop('balance', axis = 1)

# Train model
clf0 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y0 = clf0.predict(X)

# Check the accuracy
print(accuracy_score(pred_y0, y))

0.9216


In [9]:
# Great accuracy! But wait... is the model predicting anything besides 0?
print(np.unique(pred_y0))

[0]


In [10]:
# :( :( :( 
# We need a better way of doing things

## 1) Up-sample Minority Class
### Randomly duplicate observations from minority class to reinforce its signal
#### 1. First, we'll separate observations from each class into different DataFrames.
#### 2. Next, we'll resample the minority class with replacement, setting the number of samples to match that of the majority class.
#### 3. Finally, we'll combine the up-sampled minority class DataFrame with the original majority class DataFrame.

In [12]:
# module for resampling
from sklearn.utils import resample

# Separate minority and majority classes
df_minority = df[df.balance==1]
df_majority = df[df.balance==0]

# Upsample minorty class
df_minority_upsampled = resample(df_minority,
                                replace = True, # Sample with replacement
                                n_samples = 576, # match majority class
                                random_state = 123) # reproducable results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.balance.value_counts()

1    576
0    576
Name: balance, dtype: int64

In [14]:
# Train model on upsampled dataset

y = df_upsampled.balance
X = df_upsampled.drop('balance', axis = 1)

clf1 = LogisticRegression().fit(X, y) # train

pred_y1 = clf1.predict(X) # predict

# Is our model still predicting only 0s?
print(np.unique(pred_y1))

# What is the accuracy?
print(accuracy_score(y, pred_y1))

[0 1]
0.5138888888888888


In [15]:
# Low accuracy, but predicting more than 1 class is a step in the right direction

## 2) Down-Sample Majority Class

### The most common heuristic for doing so is resampling without replacement.
### The process is similar to that of up-sampling. Here are the steps:

#### 1. First, we'll separate observations from each class into different DataFrames.
#### 2. Next, we'll resample the majority class without replacement, setting the number of samples to match that of the minority class.
#### 3. Finally, we'll combine the down-sampled majority class DataFrame with the original minority class DataFrame.

In [17]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.balance.value_counts()

1    49
0    49
Name: balance, dtype: int64

In [18]:
# Train model based on logistical regression

# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)
 
# Train model
clf_2 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_2 = clf_2.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )
 
# How's our accuracy?
print( accuracy_score(y, pred_y_2) )

[0 1]
0.5816326530612245


In [19]:
# Not just predicting one class, results seem slightly better

## 3) Change Your Performance Metric

#### For a general-purpose metric for classification, we recommend Area Under ROC Curve (AUROC). Intuitively, AUROC represents the likelihood of your model distinguishing observations from two classes. In other words, if you randomly select one observation from each class, what's the probability that your model will be able to "rank" them correctly?

In [23]:
from sklearn.metrics import roc_auc_score

In [21]:
# To calculate AUROC, you'll need predicted class probabilities instead of just the predicted classes. 
# You can get them using the .predict_proba() function like so:

# Predict class probabilities using downsampled data
prob_y_2 = clf_2.predict_proba(X)
 
# Keep only the positive class
prob_y_2 = [p[1] for p in prob_y_2]

prob_y_2[:5] # Example

[0.4541919722647962,
 0.4820596221328388,
 0.46862327066392495,
 0.4786837883268913,
 0.5814385682015971]

In [24]:
# How did this model (trained on the down-sampled dataset) do in terms of AUROC?
print(roc_auc_score(y, prob_y_2))

0.5680966264056644


In [26]:
# How does this compare to the original model trained on the imbalanced dataset?

prob_y0 = clf0.predict_proba(X)
prob_y0 = [p[1] for p in prob_y0]
 
print( roc_auc_score(y, prob_y0) )

0.47480216576426487


In [27]:
# Note: last value should be 0.530718537415. 
# if you got an AUROC of 0.47, it just means you need to invert the predictions because Scikit-Learn is 
# misinterpreting the positive class. AUROC should be >= 0.5.

## 4. Penalize Algorithms (Cost-Sensitive Training)

### The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class. A popular algorithm for this technique is Penalized-SVM:

In [29]:
# Support Vector Machine
from sklearn.svm import SVC

In [30]:
# During training, we can use the argument class_weight='balanced' 
# to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

# We also want to include the argument probability=True if we want to enable probability estimates for SVM algorithms.

In [31]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

# Train Model
clf_3 = SVC(kernel = 'linear', 
           class_weight = 'balanced', # Penalize
           probability = True)

clf_3.fit(X, y)

# Predict on training set
pred_y_3 = clf_3.predict(X)

# Is our model predicting more than 1 class?
print(np.unique(pred_y_3))

# What is our accuracy?
print(accuracy_score(y, pred_y_3))

[0 1]
0.688


In [33]:
# What about AUROC?
prob_y_3 = clf_3.predict_proba(X)
prob_y_3 = [p[0] for p in prob_y_3] # changed 1 to 0 because misinterpreting class
print( roc_auc_score(y, prob_y_3) )
# 0.5305236678

0.5305236678004536


## 5) Use Tree-Based Algorithms

#### Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes. In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we'll jump right into those.

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
# Train random forest on imbalanced dataset

y = df.balance
X = df.drop('balance', axis = 1)

# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)

# Predict
pred_y_4 = clf_4.predict(X)

# Is our model still predicting 1 class?
print(np.unique(pred_y_4))

# What is our accuracy?
print(accuracy_score(y, pred_y_4))

[0 1]
0.9744


In [37]:
# Wow!

# AUROC Score
# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y, prob_y_4) )

0.9954294217687076


In [None]:
# O_O

# Note: While these results are encouraging, the model could be overfit, 
# so you should still evaluate your model on an unseen test set before making the final decision.