# Dealing with Imbalanced Data

In machine learning, *imbalanced data* refers to training data where one class is vastly more common than the other. This can cause a problem, because we can create a learner that is very accurate (i.e., most of the predictions it makes are correct) because it simply always predicts the most common class, without really learning anything about the less common one. 

This is the case with our Kaggle data, where we have several thousands of examples of the 0 case (hospital not in distress) versus a couple hundred example of the 1 case (hospital in distress). A learner can perform very well (looks like about 95% correct predictions) just by predicting 0 every time and ignoring the 1 cases. It will get those wrong, but overall it is a lower proportion of the score, so it doesnt matter for the accuracy. Our AUC measure combats this slighly, as it is based on balancing the false negative rate as well, but we can still learn a better model if we have more balanced data. 

One way to do this is to artificially create a balanced dataset by "oversampling" the underrepresented class or by "undersampling" the very common class. This notebook will show two examples of doing this using scikit learn.

**Note: You should always try to build a model with the original data distribution first, since this is the most accurate representation of how your testing data will look out in the wild.**

To do this we will be using a new library called Imbalanced Learn that has a bunch of different data preprocessing tools that may be useful. You can get Imbalanced Learn by activating your anaconda environment and running:

`conda install -c conda-forge imbalanced-learn`

In [1]:
# Import all required libraries
from __future__ import division # For python 2.*

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE 
from imblearn.under_sampling import NearMiss 


# load in the data--I will do this with the Kaggle data
X = np.genfromtxt('data/X_train.txt', delimiter=None) 
Y = np.genfromtxt('data/Y_train.txt', delimiter=None) 

# remove the header rows; focus only on the classification value for Y (column 1)
X,Y =(X[1:],Y[1:,1:])

# Tells numpy not to print everything in scientific notation 
np.set_printoptions(suppress=True)

# print out the number of each class
(y_class, counts) = np.unique(Y, return_counts=True)
class_frequencies = np.asarray((y_class, counts)).T

print(class_frequencies)

[[   0. 4542.]
 [   1.  234.]]


So, out of our dataset we have 4,542 0s and only 234 1s--I'd say this is pretty imbalanced! 

Let's learn a quick logistic regression model on this just to see how bad it is...

In [2]:
# split into training and validation data
Xtr, Xval, Ytr, Yval = train_test_split(X, Y, test_size = 0.3, random_state = 0)

# scale the data so our different X features and their ranges don't confuse the learner
scale = StandardScaler().fit(Xtr)
Xtr_scaled = scale.transform(Xtr)
Xval_scaled = scale.transform(Xval)

# logistic regression object 
lr = LogisticRegression() 
  
# train the model on training data 
lr.fit(Xtr_scaled, Ytr.ravel()) 

# see how well we do on the training data
pred_tr = lr.predict(Xtr_scaled)
  
# see how well we do on the validation data
pred_val = lr.predict(Xval_scaled) 
  
# print classification reports
print("Performance on Training Data:")
print(classification_report(Ytr,pred_tr))

print("\n Performance on Validation Data:")
print(classification_report(Yval, pred_val))

Performance on Training Data:
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.98      3196
         1.0       0.77      0.48      0.59       147

    accuracy                           0.97      3343
   macro avg       0.87      0.73      0.79      3343
weighted avg       0.97      0.97      0.97      3343


 Performance on Validation Data:
              precision    recall  f1-score   support

         0.0       0.97      0.99      0.98      1346
         1.0       0.75      0.51      0.60        87

    accuracy                           0.96      1433
   macro avg       0.86      0.75      0.79      1433
weighted avg       0.96      0.96      0.96      1433





Looking at these prediction reports, the two things we want to look at are the precision (which is the prediction accuracy--of the predictions I made, how many were right?) and the recall (of the times the class should have been 1, how often was I right?). The learner currently has very, very high precision (overall it is 0.95 for the validation data), but very low recall for the class 1 because it is skewed toward predicting 0. 

<hr>

## Oversampling

The first method we will use for trying to balance out our data and get better at predicting the 1s is called oversampling. This is when we simply try and bump up the number of 1s relative to the number of 0s by either weighting the 1s higher or by physically resampling them to increase their numbers in the dataset. 

We are going to use a very popular resampling method called SMOTE (Synthetic Minority Oversampling Technique). SMOTE aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesizes new minority instances by creating new examples *between* existing minority instances. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class, learning a point in between them (i.e., a regression line), and generating a new point along that line. These new points are added to the training dataset, increasing the proportion of 1s represented.

SMOTE is included in the Imbalanced Learn module, which you can get by running:

`conda install -c conda-forge imbalanced-learn`

in your Anaconda environment.

In [4]:
sm = SMOTE(random_state = 2) 

Xtr_res, Ytr_res = sm.fit_sample(Xtr, Ytr.ravel()) 

# print out the number of each class
(y_class, counts) = np.unique(Ytr_res, return_counts=True)
class_frequencies = np.asarray((y_class, counts)).T

print(class_frequencies)

[[   0. 3196.]
 [   1. 3196.]]


Now we have an equal number of training examples for each possible class. This gives our learner more to work with in terms of understanding what is important to predicting distress. Let's retrain the learner and see how it does.

In [5]:
# scale the data so our different X features and their ranges don't confuse the learner
scale = StandardScaler().fit(Xtr_res)
Xtr_res_scaled = scale.transform(Xtr_res)
Xval_scaled = scale.transform(Xval)

# logistic regression object 
lr_smote = LogisticRegression() 
  
# train the model on training data 
lr_smote.fit(Xtr_res_scaled, Ytr_res.ravel()) 

# see how well we do on the training data
pred_tr = lr_smote.predict(Xtr_res_scaled)
  
# see how well we do on the validation data
pred_val = lr_smote.predict(Xval_scaled) 
  
# print classification reports
print("Performance on Training Data:")
print(classification_report(Ytr_res,pred_tr))

print("\n Performance on Validation Data:")
print(classification_report(Yval, pred_val))



Performance on Training Data:
              precision    recall  f1-score   support

         0.0       0.97      0.93      0.95      3196
         1.0       0.93      0.97      0.95      3196

    accuracy                           0.95      6392
   macro avg       0.95      0.95      0.95      6392
weighted avg       0.95      0.95      0.95      6392


 Performance on Validation Data:
              precision    recall  f1-score   support

         0.0       0.99      0.94      0.96      1346
         1.0       0.47      0.89      0.61        87

    accuracy                           0.93      1433
   macro avg       0.73      0.91      0.79      1433
weighted avg       0.96      0.93      0.94      1433



Now our accuracy is still pretty good (96% on the validation data!), but our recall on class 1 has gone way, way up. This is a much better model than before since it is actually predicting instances of distress.

<hr>

## Undersampling

Another way we can do this preprocessing to rebalance the data is called undersampling, where we are reducing the prevalance of the more common class. For undersampling, we can use a technique called the NearMiss algorithm. NearMiss aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the separation between the two classes. To prevent problem of information loss in most under-sampling techniques, nearest-neighbor methods are widely used to average over the nearby majority instances.

NearMiss is also available as part of the Imbalanced Learning package.



In [6]:
# NearMiss instance
nr = NearMiss() 

# Apply NearMiss to our data
Xtr_miss, Ytr_miss = nr.fit_sample(Xtr, Ytr.ravel()) 

# print out the number of each class
(y_class, counts) = np.unique(Ytr_miss, return_counts=True)
class_frequencies = np.asarray((y_class, counts)).T

print(class_frequencies)

[[  0. 147.]
 [  1. 147.]]


We now have a more balanced dataset, though we have made our data much smaller, which may lead to underfitting on the testing data. Let's see how it does when we build a model.

In [7]:
# scale the data so our different X features and their ranges don't confuse the learner
scale = StandardScaler().fit(Xtr_miss)
Xtr_miss_scaled = scale.transform(Xtr_miss)
Xval_scaled = scale.transform(Xval)

# logistic regression object 
lr_miss = LogisticRegression() 
  
# train the model on training data 
lr_miss.fit(Xtr_miss_scaled, Ytr_miss.ravel()) 

# see how well we do on the training data
pred_tr = lr_miss.predict(Xtr_miss_scaled)
  
# see how well we do on the validation data
pred_val = lr_miss.predict(Xval_scaled) 
  
# print classification reports
print("Performance on Training Data:")
print(classification_report(Ytr_miss,pred_tr))

print("\n Performance on Validation Data:")
print(classification_report(Yval, pred_val))

Performance on Training Data:
              precision    recall  f1-score   support

         0.0       0.97      0.95      0.96       147
         1.0       0.95      0.97      0.96       147

    accuracy                           0.96       294
   macro avg       0.96      0.96      0.96       294
weighted avg       0.96      0.96      0.96       294


 Performance on Validation Data:
              precision    recall  f1-score   support

         0.0       0.99      0.91      0.95      1346
         1.0       0.39      0.90      0.55        87

    accuracy                           0.91      1433
   macro avg       0.69      0.90      0.75      1433
weighted avg       0.96      0.91      0.93      1433





This also improved our recall on our data. Either of these might be useful preprocessing methods when trainig models for your project.