## Balance Scale Dataset


For this guide, we’ll use a synthetic dataset called Balance Scale Data, which you can download from the UCI Machine Learning Repository.

This dataset was originally generated to model psychological experiment results, but it’s useful for us because it’s a manageable size and has imbalanced classes.

In [81]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error



 
# Read dataset
df = pd.read_csv('/Users/nathanamar/Desktop/balance-scale.data', 
                 names=['balance', 'var1', 'var2', 'var3', 'var4'])
 
# Display example observations
df.head()


Unnamed: 0,balance,var1,var2,var3,var4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [2]:
df.shape

(625, 5)

In [3]:
df['balance'].value_counts()


R    288
L    288
B     49
Name: balance, dtype: int64

In [4]:
# Transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]

df['balance'].value_counts()


0    576
1     49
Name: balance, dtype: int64

In [5]:
# 0    576
# 1     49
# Name: balance, dtype: int64
# About 8% were balanced

## The Danger of Imbalanced Classes

Now that we have a dataset, we can really show the dangers of imbalanced classes.

In [6]:


# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:

 
# Train model
clf_0 = LogisticRegression().fit(X_train, y_train)
 
# Predict on training set
pred_y_0 = clf_0.predict(X_test)



As mentioned above, many machine learning algorithms are designed to maximize overall accuracy by default.

In [9]:

print( accuracy_score(pred_y_0, y_test) )

0.9042553191489362


In [10]:
print( np.unique( pred_y_0 ) )

[0]


As you can see, this model is only predicting 0, which means it's completely ignoring the minority class in favor of the majority class.

Next, we'll look at the first technique for handling imbalanced classes: up-sampling the minority class.

## 1. Up-sample Minority Class

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.

There are several heuristics for doing so, but the most common way is to simply resample with replacement.

First, we'll import the resampling module from Scikit-Learn:

In [11]:
from sklearn.utils import resample

Next, we'll create a new DataFrame with an up-sampled minority class. Here are the steps:
- First, we'll separate observations from each class into different DataFrames.
- Next, we'll resample the minority class with replacement, setting the number of samples to match that of the majority class.
- Finally, we'll combine the up-sampled minority class DataFrame with the original majority class DataFrame.

In [12]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]


In [13]:
df_majority.shape, df_minority.shape

((576, 5), (49, 5))

In [14]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results

In [15]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [16]:
# Display new class counts
df_upsampled.balance.value_counts()
# 1    576
# 0    576
# Name: balance, dtype: int64

1    576
0    576
Name: balance, dtype: int64

As you can see, the new DataFrame has more observations than the original, and the ratio of the two classes is now 1:1.

Let's train another model using Logistic Regression, this time on the balanced dataset:

In [17]:
# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [19]:
 
# Train model
clf_1 = LogisticRegression().fit(X_train, y_train)
 
# Predict on training set
pred_y_1 = clf_1.predict(X_test)



In [20]:
# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )

[0 1]


In [21]:
# How's our accuracy?
print( accuracy_score(y_test, pred_y_1) )

0.5202312138728323


In [22]:
mat_conf = confusion_matrix(y_test,pred_y_1)
mat_conf

array([[ 65, 118],
       [ 48, 115]])

In [23]:
accuracy = (mat_conf[0,0] + mat_conf[1,1]) /( mat_conf[0,0] + mat_conf[1,1] +  mat_conf[0,1] +  mat_conf[1,0])
accuracy

0.5202312138728323

Great, now the model is no longer predicting just one class. While the accuracy also took a nosedive, it's now more meaningful as a performance metric.



## 2. Down-sample Majority Class

Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm.

The most common heuristic for doing so is resampling without replacement.

The process is similar to that of up-sampling. Here are the steps:
- First, we'll separate observations from each class into different DataFrames
- Next, we'll resample the majority class without replacement, setting the number of samples to match that of the minority class.
- Finally, we'll combine the down-sampled majority class DataFrame with the original minority class DataFrame.

In [24]:
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

In [25]:
df_majority.shape , df_minority.shape

((576, 5), (49, 5))

In [26]:
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results

In [27]:
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.balance.value_counts()
# 1    49
# 0    49
# Name: balance, dtype: int64

1    49
0    49
Name: balance, dtype: int64

This time, the new DataFrame has fewer observations than the original, and the ratio of the two classes is now 1:1.



Again, let's train a model using Logistic Regression:



In [28]:
# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [30]:
# Train model
clf_2 = LogisticRegression().fit(X_train, y_train)
 
# Predict on training set
pred_y_2 = clf_2.predict(X_test)
 



In [31]:
# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )

[0 1]


In [32]:
# How's our accuracy?
print(accuracy_score(y_test, pred_y_2))

0.3


In [33]:
mat_conf = confusion_matrix(y_test,pred_y_2)
mat_conf

array([[ 5, 12],
       [ 9,  4]])

In [34]:
accuracy = (mat_conf[0,0] + mat_conf[1,1]) /( mat_conf[0,0] + mat_conf[1,1] +  mat_conf[0,1] +  mat_conf[1,0])
accuracy

0.3

## 3. Change Your Performance Metric

So far, we've looked at two ways of addressing imbalanced classes by resampling the dataset. Next, we'll look at using other performance metrics for evaluating the models.

For a general-purpose metric for classification, we recommend Area Under ROC Curve (AUROC).


To calculate AUROC, you'll need predicted class probabilities instead of just the predicted classes. You can get them using the .predict_proba()  function like so:

In [35]:
# Predict class probabilities
prob_y_2 = clf_2.predict_proba(X)
 
# Keep only the positive class
prob_y_2 = [p[1] for p in prob_y_2]

prob_y_2[:5]

[0.5181585727137468,
 0.549077509236976,
 0.3865150971472426,
 0.4557978266827516,
 0.6667465627257082]

In [36]:

print( roc_auc_score(y, prob_y_2) )

0.5385256143273636


In [39]:
prob_y_0 = clf_0.predict_proba(X)
prob_y_0 = [p[0] for p in prob_y_0]
 
print( roc_auc_score(y, prob_y_0) )

0.5481049562682214


## 4. Penalize Algorithms (Cost-Sensitive Training)

The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on the minority class.

A popular algorithm for this technique is Penalized-SVM:

During training, we can use the argument class_weight='balanced'  to penalize mistakes on the minority class by an amount proportional to how under-represented it is.

We also want to include the argument probability=True  if we want to enable probability estimates for SVM algorithms.

In [42]:
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)


In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [60]:
# Train model
clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)
 
clf_3.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [61]:
pred_y_3 = clf_3.predict(X_test)

In [62]:
# Is our model still predicting just one class?
print( np.unique( pred_y_3 ) )

['B' 'L' 'R']


In [63]:
# How's our accuracy?
print( accuracy_score(y_test, pred_y_3) )

0.8882978723404256


In [67]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(clf_3.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.89


In [64]:
mat_conf = confusion_matrix(y_test,pred_y_3)
mat_conf

array([[14,  2,  2],
       [ 6, 74,  0],
       [11,  0, 79]])

In [65]:
accuracy = (mat_conf[0,0] + mat_conf[1,1]) /( mat_conf[0,0] + mat_conf[1,1] +  mat_conf[0,1] +  mat_conf[1,0])
accuracy

0.9166666666666666

## 5. Use Tree-Based Algorithms

The final tactic we'll consider is using tree-based algorithms. Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

In modern applied machine learning, tree ensembles (Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we'll jump right into those:

Now, let's train a model using a Random Forest on the original imbalanced dataset.

In [83]:
df_upsampled.balance.value_counts()

1    576
0    576
Name: balance, dtype: int64

In [84]:
df_upsampled.shape

(1152, 5)

In [86]:
# Separate input features (X) and target variable (y)
y =df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)

In [87]:
y.value_counts()

1    576
0    576
Name: balance, dtype: int64

In [88]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [89]:
clf_4 = RandomForestClassifier()
clf_4.fit(X_train, y_train)




RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [90]:
pred_y_4 = clf_4.predict(X_test)

In [91]:
# Is our model still predicting just one class?
print( np.unique( pred_y_4 ) )

[0 1]


In [92]:
# How's our accuracy?
print( accuracy_score(y_test, pred_y_4) )

0.9508670520231214


In [93]:
print("Score:",round(accuracy_score(y_test,pred_y_4)*100,2))
print("MSE: ",mean_squared_error(y_test,pred_y_4))

Score: 95.09
MSE:  0.049132947976878616


In [94]:
# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y, prob_y_4) )

0.9991319444444444


Well, tree ensembles have become very popular because they perform extremely well on many real-world problems. We certainly recommend them wholeheartedly.

