## In this article we are going to Handle Imbalanced Data For a Classification Problem.
<b> In this article, we will use the dataset "random_sample.csv", which contains two variables. This dataset contains Imbalanced Data For a Classification Problem, i.e. target ("y") variable.

<b> importing all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd

<b> In the above, i have imported all the necessary libraries.

<b> Load the dataset by using read_csv() to read the dataset and save it to the 'df' variable and take a look at the first 5 lines using the head() method.

In [2]:
# Load the dataset.
data = pd.read_csv("random_sample.csv")

# Display the first 5 lines using the head() method.
data.head()

Unnamed: 0,X,y
0,-0.941008,1
1,1.225229,1
2,1.822382,1
3,1.94094,1
4,0.913569,1


<b> Creating independent variables i.e. data["X"] as "x" variable and dependent variable i.e. data["y"] as "y" variable.

In [3]:
# Create independent variables as "x" and dependent variable as "y" object.

# Create the independent variable as x.
x = data[["X"]]

# target variable to predict/ dependant variable as y
y = data["y"]    

<b> Let's have a look at the distribution of dependent variable (y) by using value_counts method:

In [4]:
# print as "original dataset shape" for display purpose.
print("original dataset shape")
print("----------------------")

# Apply the value_counts method on target vatriable (y).
y.value_counts()


original dataset shape
----------------------


1    900
0    100
Name: y, dtype: int64

<b> As we can see, there are 2 unique categories in the target variable (y), which contains 900 records as 1 and 100 records as 0 category respectively.
    
<b> Based on this result, It has been observed that our target class has an imbalance.
    
<b> Hence now we are going to handling Imbalanced Data by using different techniques (Approches).

<b> Installing imblearn library using jupiter notebook.

In [5]:
# Installing imblearn library using jupiter notebook.
!pip install imblearn



### Approch - 1 : Resampling - Undersampling of majority class:
Under-sample the majority class(es) by randomly picking samples with or without replacement, i.e. Class to perform random under-sampling.

Parameters:

- sampling_strategy (default='auto'): Sampling information to resample the data set.
- random_state (default=None): Control the randomization of the algorithm.
- replacement (default=False): Whether the sample is with or without replacement.
- n_features_in_ : Number of features in the input dataset.
- feature_names_in_ : Names of features seen during fit. Defined only when X has feature names that are all strings.

In [6]:
# import the RandomUnderSampler from imblearn.under_sampling library.
from imblearn.under_sampling import RandomUnderSampler

# Create the classifier object as "rs" and pass the random_state=42.
rs = RandomUnderSampler(random_state=42)

# Apply the fit_resample() to returns the Resample the dataset.
x_new,y_new = rs.fit_resample(x, y)

# print as "After undersampling dataset shape" for display purpose.
print("After undersampling dataset shape")
print("----------------------------------")

# Apply the value_counts method on target vatriable (y).
y_new.value_counts()

After undersampling dataset shape
----------------------------------


0    100
1    100
Name: y, dtype: int64

In the above, we have fitted the Undersampling of majority class by using fit_resample(). First we have imported the RandomUnderSampler() class of imblearn.under_sampling library. After importing the class, we have created the Classifier object of the class. The Parameter of this class are as :

- X : Matrix containing the data which have to be sampled.
- y : Corresponding label for each sample in X.
- rs.fit_resample means Resample the dataset. It will returns the two parameters as:
- X_resampled : The array containing the resampled data.
- y_resampled : The corresponding label of `X_resampled`.

After that it will returns the dataset with undersampling dataset shape, i.e 100 records as 1 and 100 records as 0 category respectively.

### Approch - 2 : Resampling - Oversampling of minority class by duplication:
Object to over-sample the minority class(es) by picking samples at random with replacement. The bootstrap can be generated in a smoothed manner, i.e. Class to perform random over-sampling.

Parameters:

- sampling_strategy (default='auto'): Sampling information to resample the data set.
- random_state (default=None): Control the randomization of the algorithm.
- replacement (default=False): Whether the sample is with or without replacement.
- shrinkage (default=None): Parameter controlling the shrinkage applied to the covariance matrix.

In [7]:
# import the RandomOverSampler from imblearn.over_sampling library.
from imblearn.over_sampling import RandomOverSampler

# Create the classifier object as "ROS" and pass the random_state=42.
ROS = RandomOverSampler(random_state=42)

# Apply the fit_resample() to returns the Resample the dataset.
x_new,y_new = ROS.fit_resample(x, y)

# print as "After oversampling dataset shape" for display purpose.
print("After oversampling dataset shape")
print("----------------------------------")

# Apply the value_counts method on target vatriable (y).
y_new.value_counts()

After oversampling dataset shape
----------------------------------


1    900
0    900
Name: y, dtype: int64

In the above, we have fitted the Oversampling of minority class by using fit_resample(). First we have imported the RandomOverSampler() class of imblearn.over_sampling library. After importing the class, we have created the Classifier object of the class. The Parameter of this class are as :

- X : Matrix containing the data which have to be sampled.
- y : Corresponding label for each sample in X.
- rs.fit_resample means Resample the dataset. It will returns the two parameters as:
- X_resampled : The array containing the resampled data.
- y_resampled : The corresponding label of `X_resampled`.

After that it will returns the dataset with oversampling dataset shape, i.e 900 records as 1 and 900 records as 0 category respectively.

### Approch - 3 : Resampling - Oversampling of minority class by SMOTE:
This object is an implementation of SMOTE - Synthetic Minority Over-sampling Technique,i.e. Class to perform over-sampling using SMOTE.

Parameters:
- sampling_strategy (default='auto'): Sampling information to resample the data set.
- random_state (default=None): Control the randomization of the algorithm.
- k_neighbors (default=5): The nearest neighbors used to define the neighborhood of samples to use to generate the synthetic samples.
- n_jobs (default=None): Number of CPU cores used during the cross-validation loop.
- n_features_in_ : Number of features in the input dataset.
- feature_names_in_ : Names of features seen during `fit`. Defined only when `X` has feature names that are all strings.

In [8]:
# import the RandomOverSampler from imblearn.over_sampling library.
from imblearn.over_sampling import SMOTE

# Create the classifier object as "SMOTE" and pass the random_state=42.
SMOTE = SMOTE(random_state=42)

# Apply the fit_resample() to returns the Resample the dataset.
x_new,y_new = SMOTE.fit_resample(x, y)

# print as "After SMOTE dataset shape" for display purpose.
print("After SMOTE dataset shape")
print("----------------------------------")

# Apply the value_counts method on target vatriable (y).
y_new.value_counts()

After SMOTE dataset shape
----------------------------------


1    900
0    900
Name: y, dtype: int64

In the above, we have fitted the Oversampling of minority class by SMOTE by using fit_resample(). First we have imported the SMOTE() class of imblearn.over_sampling library. After importing the class, we have created the Classifier object of the class. The Parameter of this class are as :

- X : Matrix containing the data which have to be sampled.
- y : Corresponding label for each sample in X.
- rs.fit_resample means Resample the dataset. It will returns the two parameters as:
- X_resampled : The array containing the resampled data.
- y_resampled : The corresponding label of `X_resampled`.

After that it will returns the dataset with oversampling dataset shape, i.e 900 records as 1 and 900 records as 0 category respectively.

### Approch - 4 : Undersampling of majority class by Near Miss:
Class to perform under-sampling based on NearMiss methods.

Parameters:

- sampling_strategy (default='auto'): Sampling information to resample the data set.
- n_neighbors (default=3): size of the neighbourhood to consider to compute the average distance to the minority point samples.
- n_jobs (default=None): Number of CPU cores used during the cross-validation loop.
- feature_names_in_ : Names of features seen during fit. Defined only when X has feature names that are all strings.

In [9]:
# import the NearMiss from imblearn.under_sampling library.
from imblearn.under_sampling import NearMiss

# Create the classifier object as "NM" and pass the random_state=42.
NM = NearMiss()

# Apply the fit_resample() to returns the Resample the dataset.
x_new,y_new = NM.fit_resample(x, y)

# print as "After SMOTE dataset shape" for display purpose.
print("After undersampling dataset shape")
print("----------------------------------")

# Apply the value_counts method on target vatriable (y).
y_new.value_counts()

After undersampling dataset shape
----------------------------------


0    100
1    100
Name: y, dtype: int64

In the above, we have fitted the Undersampling of majority class by using fit_resample(). First we have imported the NearMiss() class of imblearn.under_sampling library. After importing the class, we have created the Classifier object of the class. The Parameter of this class are as :

- X : Matrix containing the data which have to be sampled.
- y : Corresponding label for each sample in X.
- rs.fit_resample means Resample the dataset. It will returns the two parameters as:
- X_resampled : The array containing the resampled data.
- y_resampled : The corresponding label of `X_resampled`.

After that it will returns the dataset with undersampling dataset shape, i.e 100 records as 1 and 100 records as 0 category respectively.