# <p style='text-align: center;'> Handle Imbalanced Data For a Classification Problem </p>

### Introduction
- Classification problems are quite common in the machine learning world. As we know in the classification problem we try to predict the class label by studying the input data or predictor where the target or output variable is a categorical variable in nature.


- If you have already dealt with classification problems, you must have faced instances where one of the target class labels’ numbers of observation is significantly lower than other class labels. This type of dataset is called an imbalanced class dataset which is very common in practical classification scenarios. Any usual approach to solving this kind of machine learning problem often yields inappropriate results.


- In this article, I’ll discuss the imbalanced dataset, the problem regarding its prediction, and how to deal with such data more efficiently than the traditional approach.


### What is imbalanced data and how to handle imbalanced dataset ?
- Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. We can better understand imbalanced dataset handling with an example.

- Let’s assume that XYZ is a bank that issues a credit card to its customers. Now the bank is concerned that some fraudulent transactions are going on and when the bank checks their data they found that for each 2000 transaction there are only 30 Nos of fraud recorded. So, the number of fraud per 100 transactions is less than 2%, or we can say more than 98% transaction is “No Fraud” in nature. Here, the class “No Fraud” is called the majority class, and the much smaller in size “Fraud” class is called the minority class.

![image.png](attachment:image.png)


More such example of imbalanced data is –

- Disease diagnosis

- Customer churn prediction

- Fraud detection

- Natural disaster

Class imbalanced is generally normal in classification problems. But, in some cases, this imbalance is quite acute where the majority class’s presence is much higher than the minority class.

### Approach to deal with the imbalanced dataset problem
- In rare cases like fraud detection or disease prediction, it is vital to identify the minority classes correctly. So model should not be biased to detect only the majority class but should give equal weight or importance towards the minority class too. Here I discuss some of the few techniques which can deal with this problem. There is no right method or wrong method in this, different techniques work well with different problems.


<b> 1. Choose Proper Evaluation Metric:
    
The accuracy of a classifier is the total number of correct predictions by the classifier divided by the total number of predictions. This may be good enough for a well-balanced class but not ideal for the imbalanced class problem. The other metrics such as precision is the measure of how accurate the classifier’s prediction of a specific class and recall is the measure of the classifier’s ability to identify a class.

For an imbalanced class dataset F1 score is a more appropriate metric. It is the harmonic mean of precision and recall and the expression is :
    
                            Precision * Recall
           F1-Score = 2 * -----------------------
                            Precision + Recall  
    
    
So, if the classifier predicts the minority class but the prediction is erroneous and false-positive increases, the precision metric will be low and so as F1 score. Also, if the classifier identifies the minority class poorly, i.e. more of this class wrongfully predicted as the majority class then false negatives will increase, so recall and F1 score will low. F1 score only increases if both the number and quality of prediction improves.
    
    
**Note:** F1 score keeps the balance between precision and recall and improves the score only if the classifier identifies more of a certain class correctly.

<b> 2. Resampling (Oversampling and Undersampling):
    
This technique is used to upsample or downsample the minority or majority class. When we are using an imbalanced dataset, we can oversample the minority class using replacement. This technique is called **oversampling**. Similarly, we can randomly delete rows from the majority class to match them with the minority class which is called **undersampling**. After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes.
    
![image.png](attachment:image.png)
    
<b> Advantages and Disadvantages of Under-Sampling:
    
**Advantages**
    
- It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.
    
    
**Disadvantages**
    
- It can discard potentially useful information which could be important for building rule classifiers.
    
    
- The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.
    
    
<b> Advantages and Disadvantages of Over-Sampling:
    
**Advantages**
    
- Unlike under sampling this method leads to no information loss.
    
    
- Outperforms under sampling
    
    
**Disadvantages**
    
- It increases the likelihood of overfitting since it replicates the minority class events.
    
    

<b> 3. SMOTE (Synthetic Minority Oversampling Technique or SMOTE):

Synthetic Minority Oversampling Technique or SMOTE is another technique to oversample the minority class. Simply adding duplicate records of minority class often don’t add any new information to the model. In SMOTE new instances are synthesized from the existing data. If we explain it in simple words, SMOTE looks into minority class instances and use k nearest neighbor to select a random nearest neighbor, and a synthetic instance is created randomly in feature space.
    
    
This technique is followed to avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.
    
    
<b> SMOTE Algorithm Working Procedure:
    
**Stage 1:** Minority class Setting is done, set A, for each, the k-closest neighbors of x are gotten by working out the Euclidean distance among x and every example in set A.

**Stage 2:** The testing rate N is set by the imbalanced extent. For each, N models (x1, x2, … xn) are arbitrarily chosen from their k-closest neighbors, and they build the set.

**Stage 3:** For every model (k= 1, 2, 3 .......N), the accompanying equation is utilized to produce another model: rand(0, 1) addresses the irregular number somewhere in the range of 0 and 1.

### 4. Ensemble Techniques:
The main objective of ensemble methodology is to improve the performance of single classifiers. The approach involves constructing several two stage classifiers from the original data and then aggregate their predictions.


In this techniques multiple trees are created as equal to number of minority class respectuvely and finally combine (essemble) all the trees results, based on the majority it will gives the result.

![image.png](attachment:image.png)

### 5. Near Miss:
Near Miss is an under-inspecting method. It means to adjust class appropriation by arbitrarily wiping out larger part class models. At the point when cases of two unique classes are extremely near one another, we eliminate the occasions of the larger part class to build the spaces between the two classes. This assists in the order with handling.

Close neighbor techniques are generally utilized to forestall the issue of data misfortune in most under-examining procedures.

<b> The fundamental instinct about the working of close neighbor strategies is as per the following:
    
**Stage 1:** The technique first finds the distances between all occurrences of the larger part class and the occasions of the minority class. Here, the greater part class is to be under-tested.

**Stage 2:** Then, "n" no. of cases of the larger part class with the littlest distances to those in the minority class are chosen.

**Stage 3:** If there are k cases in the minority class, the closest technique will result in k*n occasions of the greater part class.