# How to deal with Imbalanced Data

Here is [Tutorial webpage](https://elitedatascience.com/imbalanced-classes). In this note, our purpose is only to introduce approaches how we deal with imbalanced data, rather than comparing models. The data in this tutorial note we implement [data](http://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/).

In [3]:
import pandas as pd
import numpy as np
 
df = pd.read_csv('balance-scale.csv', names=['balance', 'var1', 'var2', 'var3', 'var4'])
 
df.head()

Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [4]:
df['balance'].value_counts()

R    288
L    288
B     49
Name: balance, dtype: int64

In [6]:
df['balance'] = pd.Series([1 if b=='B' else 0 for b in df.balance], index=df.index)
df['balance'].value_counts()

0    576
1     49
Name: balance, dtype: int64

## The Danger of Imbalanced Classes

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [11]:
y = df.balance
X = df.drop('balance', axis=1)
 
clf_0 = LogisticRegression().fit(X, y)

pred_y_0 = clf_0.predict(X)

print('accuracy =', round(accuracy_score(pred_y_0, y), 3) )

('accuracy =', 0.922)


In [12]:
print( np.unique( pred_y_0 ) )

[0]


As you can see, this model is only predicting 0, which means it's completely ignoring the minority class in favor of the majority class.

## 1. Up-sample Minority Class

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

In [13]:
from sklearn.utils import resample

In [18]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

print len(df_majority), len(df_minority)
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, replace=True, n_samples=576, random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.balance.value_counts()

576 49


ValueError: Cannot sample 576 out of arrays with dim 49

In [None]:
# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)
 
# Train model
clf_1 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_1 = clf_1.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_1) )

## 2 Down-sample Majority Class

Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

In [19]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.balance.value_counts()

1    49
0    49
Name: balance, dtype: int64

In [20]:
# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)
 
# Train model
clf_2 = LogisticRegression().fit(X, y)
 
# Predict on training set
pred_y_2 = clf_2.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_2) )

[0 1]
0.581632653061


The model isn't predicting just one class, and the accuracy seems higher.

We'd still want to validate the model on an unseen test dataset, but the results are more encouraging.

## 3 Change Your Performance Metric

So far, we've looked at two ways of addressing imbalanced classes by resampling the dataset. Next, we'll look at using other performance metrics for evaluating the models.

For a general-purpose metric for classification, we recommend Area Under ROC Curve (AUROC).
Intuitively, AUROC represents the likelihood of your model distinguishing observations from two classes.


In [21]:
from sklearn.metrics import roc_auc_score

In [22]:
# Predict class probabilities
prob_y_2 = clf_2.predict_proba(X)
 
# Keep only the positive class
prob_y_2 = [p[1] for p in prob_y_2]
 
prob_y_2[:5] # Example

[0.45419197226479668,
 0.48205962213283932,
 0.4686232706639249,
 0.4786837883268909,
 0.58143856820159612]

AUROC of model trained on downsampled dataset is 

In [23]:
print( roc_auc_score(y, prob_y_2) )

0.568096626406


On the other hand, AUROC of the baseline model (all negative) is

In [26]:
prob_y_0 = clf_0.predict_proba(X)
prob_y_0 = [p[1] for p in prob_y_0]
 
print(1-roc_auc_score(y, prob_y_0) )

0.526447313619


Remember, our original model trained on the imbalanced dataset had an accuracy of 92%, which is much higher than the 58% accuracy of the model trained on the down-sampled dataset.

However, the latter model has an AUROC of 57%, which is higher than the 53% of the original model (but not by much).

## 4 Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM:

In [27]:
from sklearn.svm import SVC

During training, we can use the argument `class_weight='balanced'`  to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True  if we want to enable probability estimates for SVM algorithms.

In [29]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
 
clf_3.fit(X, y)
 
# Predict on training set
pred_y_3 = clf_3.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_3 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_3) )
# 0.688
 
# What about AUROC?
prob_y_3 = clf_3.predict_proba(X)
prob_y_3 = [p[1] for p in prob_y_3]
print(1 - roc_auc_score(y, prob_y_3) )
# 0.5305236678

[0 1]
0.688
0.530576814059


Again, **our purpose here is only to illustrate this technique. To really determine which of these tactics works best for this problem, you'd want to evaluate the models on a hold-out test set.**

## 5 Use Tree-Based Algorithms

The final tactic we'll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we'll jump right into those:

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [31]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)
 
# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)
 
# Predict on training set
pred_y_4 = clf_4.predict(X)
 
# Is our model still predicting just one class?
print( np.unique( pred_y_4 ) )
# [0 1]
 
# How's our accuracy?
print( accuracy_score(y, pred_y_4) )
# 0.9744
 
# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y, prob_y_4) )
# 0.999078798186

[0 1]
0.9808
0.998742205215


Wow! 97% accuracy and nearly 100% AUROC? Is this magic? A sleight of hand? Cheating? Too good to be true?

Well, tree ensembles have become very popular because they perform extremely well on many real-world problems. We certainly recommend them wholeheartedly.

**However, while these results are encouraging, the model could be overfit, so you should still evaluate your model on an unseen test set before making the final decision.**

## 6 Reframe as Anomaly Detection

Anomaly detection, a.k.a. outlier detection, is for detecting outliers and rare events. Instead of building a classification model, you'd have a "profile" of a normal observation. If a new observation strays too far from that "normal profile," it would be flagged as an anomaly.