## How to Handle Imbalanced Data in ML Classification using Python?

While doing binary classification, almost every data scientist might have encountered the problem of handling Imbalanced Data. Generally Imbalanced data occurs when the datasets are distributed unequally i.e. when the frequency of data points or the number of rows in one class is much more than in other classes, then the data is imbalanced.

For example, suppose we have a covid Dataset, and our target class is whether a person is having covid or not, if the positive ratio is 10% in our class and the negative ratio is 90%, then we can say that our Data is imbalanced.

![image.png](attachment:image.png)

## Problem with Imbalanced Data

Most machine learning algorithms are designed in a way to improve accuracy and reduces errors. In this process, they don’t consider the distribution of classes. 

Also, standard machine learning algorithms like Decision trees and Logistic Regression have a bias toward Majority classes and tend to ignore minority classes. So in these cases, even though the model has 95% accuracy, it cannot be said as a perfect model as the frequency of the number of classes in testing data may be 95%, and 5% wrongly predicted data must be from the minority class.

## Accuracy pitfall

Before diving into the handling of Imbalanced Datasets, Let’s understand the metrics we should use while evaluating the models. Generally, accuracy_score is calculated as the ratio of the number of correct predictions to the total number of predictions.

    Accuracy = Number of Correct Predictions / Total Number of Predictions.

So we can see that accuracy_score will not consider the distribution of classes. It only focuses on the Number of Correct Predictions. So Even though we get 95+ accuracy, as shown in the above example, we can’t guarantee the performance of the model and its prediction of the minority class.

So for classification techniques, instead of accuracy_score, it is recommended to use Confusion Matrix, precision_score, recall_score, and Area under the ROC Curve(AUC) as Evaluation Metrics.

## Handling Imbalanced Data

A technique that is widely used while handling imbalanced data is Sampling. 

**There are two types of Sampling:**

- Under Sampling
- Over Sampling

In Under Sampling, samples are removed from the majority class, whereas, in Over Sampling, samples are added to the minority class.

To demonstrate the usage of the above techniques, initially, we will consider an example without Handling Imbalanced Data. Dataset used can be found here.

## Importing Libraries

In [None]:
import pandas  as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [None]:
## Loading Data

## Handling Imbalanced Data using Under Sampling

Under Sampling involves the removal of records from the majority class to balance out with the minority class.

![image.png](attachment:image.png)

The simplest technique involved in under-sampling is Random under-sampling. This technique involves the removal of random records from the majority class. But there will be a loss of important information if we randomly remove the rows. So various techniques are implemented for undersampling the data. One such import technique is NearMiss Undersampling.

## NearMiss Undersampling

In this technique, data points are selected based on the distance between the majority and minority classes. It has 3 different versions, and each version considers the different data points from the majority class.

- Version 1 — It selects data points of the majority class whose average distances to the K closest instances of minority class is the smallest
- Version 2 — It selects data points of the majority class whose average distances to the K farthest instances of minority class is the smallest
- Version 3 — It works in 2 steps. Firstly, for each minority class instance, their M nearest neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest neighbors is the largest.

In short, Version 3 is the more accurate version as it will remove the tomek links and makes the classification process easy as it forms a decision boundary.

![image-2.png](attachment:image-2.png)