# Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python


## What is imbalanced data?

- __Imbalanced Data Distribution__, generally happens when observations in _one of the class_ are much higher or lower than the other classes. 

- As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class distribution. 

- This problem is prevalent in examples such as __Fraud Detection, Anomaly Detection, Facial recognition__ etc.

## Why is it a problem?

- Standard ML techniques, such as Decision Tree and Logistic Regression, have a bias towards the __majority class__, and they tend to ignore the minority class. 

  - They tend only to predict the majority class, hence, having _major misclassification_ of the minority class in comparison with the majority class. 
  
- In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very __lesser__ recall.

  - where you end up with models with high accuracy but biased

## How to handle imbalanced data in theory?

- In general, there are two techniques: undersampling and oversampling
  - suppose we have two classes (`0: 200, 1:1000`)
  - Undersampling means down sample the majority class to match the minority class (e.g. random sample `200` out `1,000`)
  - Oversampling means up sample the minority class to match the majority class (e.g. duplicate the `200` five times to match `1,000`) 

<!--## How to handle imbalanced data in theory?-->

- Undersampling and oversampling both have their pros and cons
  - Undersampling: you lose part of your dataset, if you have a __small__ dataset, then you should not use that;
  - Oversampling: if we do _simple_ oversampling, by duplicating the data, we introduce error in the dataset
    - Emphasize on patterns that might not be truly important in the dataset

## How to handle imbalanced data in practice?

- Syntheic Minority Oversampling TEchnique (SMOTE)
  - it is an oversampling technique
  - proposed by This Nitesh Chawla, et al. in [“SMOTE: Synthetic Minority Over-sampling Technique.”](https://arxiv.org/abs/1106.1813)
- Near Miss Algorithm
  - it is an undersampling technique


## How SMOTE works?

- SMOTE works by selecting examples that are close in the feature space, 
  - drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
- It is recommended that "The combination of SMOTE and under-sampling performs better than plain under-sampling."

<!--## How SMOTE works?-->

- generates the synthetic training records by linear interpolation 
  - such as `s1-5` points below

<img src = 'https://www.researchgate.net/profile/Christina_Bogner/publication/322701982/figure/fig1/AS:586787038715904@1516912340346/Illustration-of-SMOTE-Synthetic-points-crosses-denoted-s1-through-s5-generated-by.png' alt = 'SMOTE Illustration' width="300" height="300" />

## The SMOTE Algorithm

- __Step 1__: Setting the minority class set $A$, for each $x \in A$, the __k-nearest neighbors__ of $x$ are obtained by calculating the __Euclidean distances__ between $x$ and every other sample in set $A$.
- __Step 2__: The __sampling rate__ $n$ is set according to the imbalanced proportion. For each $x \in A$, $n$ examples (i.e $x_1, x_2, …, x_n$) are randomly selected from its k-nearest neighbors, and they construct a new set $A_1$ .
- __Step 3__: For each example $x_k \in A_1$ (k=1, 2, 3…n), the following formula is used to generate a new example:
$$x' = x + rand(0, 1) \times \mid x - x_k \mid$$
in which $rand(0, 1)$ represents the random number between 0 and 1.

## SMOTE in Action

Now that we are familiar with the technique, let’s look at a worked example for an imbalanced classification problem.

We will need the Imbalanced-learn (`imblearn`) Library - which is already installed for you. If you want to install that on your own, you can do that by:
```shell
  sudo pip install imbalanced-learn
```

In [0]:
# skip this 
# import warnings
# warnings.filterwarnings("ignore")

In [0]:
# skip this
# ! pip install imbalanced-learn==0.5.0

In [0]:
# check version number
import imblearn
print(imblearn.__version__)

In [0]:
# import necessary packages  
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss 

#
%matplotlib inline
plt.style.use('ggplot')

### Dataset Used 

<!--The dataset consists of transactions made by credit cards. This dataset has `492` __fraud__ transactions out of `284, 807` transactions. That makes it highly unbalanced, the positive class (frauds) account for __0.172%__ of all transactions.
The dataset can be downloaded from [here](https://www.kaggle.com/mlg-ulb/creditcardfraud).-->

In this section, we will develop an intuition for the SMOTE by applying it to an __imbalanced binary classification problem__.

First, we can use the `make_classification()` function from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) to create a synthetic binary classification dataset with `50,000` examples and a `5:95` class distribution, with `15` features.

In [0]:

# define dataset
X, y = make_classification(n_samples=50000, n_features=15, n_informative = 10, n_redundant=2, n_classes=2,
	n_clusters_per_class=1, weights=[0.95], flip_y=0.3, random_state=2019) 

We can then make them into DataFrames (`feature_df` and `target_df`) to review them.

In [0]:
feature_df = pd.DataFrame(X)
feature_df.shape

In [0]:
feature_df.head()

In [0]:
target_df = pd.Series(y)
target_df.shape

In [0]:
target_df.value_counts()

We can plot it in a bar chart to see how much the data is imbalanced.

In [0]:
target_df.value_counts().plot(kind='bar');

We can also look at the `info` of the data

In [0]:
feature_df.info()

Let's look at the descriptive statistics of the data.

In [0]:
feature_df.describe()

Let's scale te data so that all the features are in the same range.

In [0]:
scaled_df = pd.DataFrame(StandardScaler().fit_transform(feature_df))
scaled_df.describe()

Now to split the data into training and testing.

In [0]:
# split into 80:20 ration 
X_train, X_test, y_train, y_test = train_test_split(scaled_df.values, target_df.values, test_size = 0.2, random_state = 2019) 
  
# describes info about train and test set 
print("Number transactions X_train dataset: ", X_train.shape) 
print("Number transactions y_train dataset: ", y_train.shape) 
print("Number transactions X_test dataset: ", X_test.shape) 
print("Number transactions y_test dataset: ", y_test.shape) 

Now train the model without handling the imbalanced class distribution.

In [0]:
# logistic regression object 
lr = LogisticRegression(max_iter=1000) 
  
# train the model on train set 
lr.fit(X_train, y_train.ravel()) 
  
predictions = lr.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 


The accuracy comes out to be `84%` but did you notice something strange?

The _recall_ of the minority class (`1`) in very low. It proves that the model is more biased towards majority class (`0`). So, it proves that this is not the best model.

Now, we will apply the SMOTE techniques and see the changes in the model performance.

__NOTE__: we only performs SMOTE, or any other balancing technique, on the __training__ dataset. You should __never__ touch the test data.

First, let's look at the original distribution again.


In [0]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 

In [0]:
sm = SMOTE(random_state = 123) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train) 
  
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0))) 


Look that SMOTE Algorithm has oversampled the instances in the minority and made it equal to majority class (`32,581`). Both categories have equal amount of records. Accordingly, we see the overall size the training set increased from `32,581` to `65,162`. _Note_ that the additional `32,581` instances are synthesized and they are not in the actual data - hence, we introduce error in the dataset and then in our model.

Now see the accuracy and recall results after applying SMOTE algorithm (Oversampling).

In [0]:
lr1 = LogisticRegression(max_iter=1000) 
lr1.fit(X_train_res, y_train_res) 
y_pred = lr1.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, y_pred)) 


Wait, what? We see a huge descrease in model accuracy from `84%` to `66%`. Nobody likes that kind of drop, correct?

That is simply not true. When we had `84%` as the accuracy of the initial model, we are comparing against a threshold of `95%`. Why? 

Since if we assign all the instances in the testing set to the majority class (`0`), or simply by __random guessing__, we get an accuracy of `95%`. In that sense, our model is <span style="color:red"> __11% worse__ </span> than random guessing. Clearly that is not a good model.

Now let's look at the modeling results after resampling. We are comparing against the threshold of `50%` (random guessing, since we have equal amounts of instances in the two classes). Thus, we result in a <span style="color:blue"> __16% increase__ </span> by applying SMOTE. Also, we notice that the recall in the minority class increased by `0.26`, which means our new model is more capable of capturing the pattern(s) in the minority class.

SMOTE clearly works in our favor. But still, we introduced error in our dataset, you can observe that in the declined performance in the results toward the majority class.

That is why we need to try the NearMiss (undersampling) method.

## How NearMiss works?

- NearMiss is an under-sampling technique. It aims to balance class distribution by __randomly eliminating__ majority class examples. 

- To prevent problem of information loss in most under-sampling techniques, __near-neighbor__ methods are widely used.

- The basic intuition about the working of near-neighbor methods is as follows:

  - __Step 1__: The method first finds the distances between all instances of the majority class and the instances of the minority class. Here, __majority class__ is to be under-sampled.
  - __Step 2__: Then, $n$ instances of the majority class that have the __smallest distances__ to the minority class are selected.
  - __Step 3__: If there are $k$ instances in the minority class, the nearest method will result in $k \times n$ instances of the majority class.

Key difference to SMOTE: NearMiss works on both the majority and minority classes, where as SMOTE only works on the minority class.

<!-- # How NearMiss works? -->
- For finding $n$ closest instances in the majority class, there are several variations of applying NearMiss Algorithm :

  - Version 1: It selects samples of the majority class for which _average_ distances to the $k$ __closest__ instances of the minority class is _smallest_.
  - Version 2: It selects samples of the majority class for which _average_ distances to the $k$ __farthest__ instances of the minority class is _smallest_.
  - Version 3: It works in 2 steps. 
    - Firstly, for each minority class instance, their $M$ nearest-neighbors will be stored. 
    - Secondly, the majority class instances are selected for which the average distance to the $N$ nearest-neighbors is the _largest_.

Let's look at the original class distribution again.

In [0]:
print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0))) 

In [0]:
# apply near miss 

nr = NearMiss(random_state=123) 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0))) 

The NearMiss Algorithm has undersampled the majority instances and made it equal to majority class. Here, the majority class has been reduced to the total number of minority class (`14,838`), so that both classes will have equal number of records. __Note__ that by doing so, we lost $ 32,581 - 14,838 = 17,743 $ intances in the majority class. Hence, we will observe a loss in the model performance.

Let's check out the modeling results after applying the NearMiss technique on the data.

In [0]:
# train the model on train set 
lr2 = LogisticRegression(max_iter=1000) 
lr2.fit(X_train_miss, y_train_miss) 
y_pred = lr2.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, y_pred)) 

Similarly, we observe a decline in the modeling performance (e.g. acc: `84%` -> `64%`). But using the same logic as we analyzing the SMOTE results, we know that now we have an edge of $+14\%$ over random guessing. 

In addition, by comparing the SMOTE and NearMiss results, we know that the SMOTE leads to better results - it is because we did not lose much data as we do in the NearMiss scenario. Thus, we can go with the SMOTE technique here.

## Conclusion

Now we understand how oversampling (__SMOTE__) and undersampling (__NearMiss__) work, and how do we use them in balancing the imbalanced dataset. 

They both have their own pros and cons, so you should choose them wisely. Additionally, researchers and data scientists are working on merging the two, or in general oversampling and undersampling together, to battle with imbalanced datasets. If you are interested, please [let me know](mailto:jtao@fairfield.edu).

## References

1. [Imbalance-Learn examples](http://glemaitre.github.io/imbalanced-learn/auto_examples/index.html)
2. [ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python](https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/)
3. [SMOTE Oversampling for Imbalanced Classification with Python](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
4. [Survey of resampling techniques for improving classification performance in unbalanced datasets - research article](https://arxiv.org/pdf/1608.06048.pdf)
5. [Undersampling Algorithms for Imbalanced Classification](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/)