# Introduction :

Hi there, in this notebook we will talk about a common problem that may face you in your joureny as a Data Scientist which is ```Imbalced Data```. But first let's define what's imbalnced data.

Imbalanced data refers to a situation in which the distribution of classes in a dataset is not equal, with one class having significantly fewer samples than the other.

It's common in DS word to face imbalced data, specially in classification problems, like fraud detection and genes classification

![image.png](attachment:770a132f-d3ea-4609-8a2d-ac1aa2c760e3.png)![image.png](attachment:a85e8544-c377-40b8-be48-6b9684cb3cad.png)

# Why Imbalnced Data is a Bad Thing : 

Imbalanced data can pose challenges in machine learning because many algorithms are designed to assume a balanced class distribution. 

When a dataset is imbalanced, the algorithm may have difficulty learning patterns in the minority class, leading to poor performance, biased predictions, and inaccurate results.

Sooo unless you want to your classification model to predict that Josh is pregnent, you need to deal with imbalnced data before thinking of modeling phase


# How to Deal With Imbalnced Data :

To bypass this issue, there's multiple soltuion like :

- OverSampling / UnderSampling
- Cost-Sensitive Technique
- Feature selection
- Ensemble methods

We will focus in this notebook on Resampling, for that we're gonna use a python library called ```Imbalnced Learn```

# 1-Oversampling : 

Oversampling involves increasing the number of instances in the minority class, either by duplicating existing instances or by creating synthetic instances

This is done to balance the class distribution and make the dataset more representative of the real-world problem being modeled.

**Now let's apply it using ```Imbalnced learn```**

In [None]:
!pip install imbalanced-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### let's make some imbalnced data using ```make_classification```

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=2, n_informative=1, n_redundant=0, n_clusters_per_class=1,class_sep=1.5, weights=[0.9, 0.1], random_state=42)
print(pd.Series(y).value_counts())

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.show()

#### Now time for Oversampling

In [None]:
from imblearn.over_sampling import RandomOverSampler

#### Note that there's a lot of Oversampling technics in ```imblearn```, here we will use ```RandomOverSampler``` 

In [None]:
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

In [None]:
print(pd.Series(y_resampled).value_counts())               # Number of values in class 1 increased 

In [None]:
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled, cmap='coolwarm')
plt.show()

#### Not bad at all, like this we can make much better predictions. now time for ```Undersampling```

# 2-Undersampling :

Undersampling, on the other hand, involves decreasing the number of instances in the majority class, either by randomly selecting instances to remove or by using more sophisticated methods.

**Time for practice** : 

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

In [None]:
print(pd.Series(y_resampled).value_counts())                     # Number of values in class 0 decreased

In [None]:
plt.scatter(X_resampled[:, 0], X_resampled[:, 1], c=y_resampled, cmap='coolwarm')
plt.show()

#### Not bad too, undersampling performs better when majority class is too big, try playing with the make_classification cell to test different shapes and distributions

# Last Thing :

Both oversampling and undersampling have their own advantages and disadvantages, and the choice of which technique to use depends on the specific characteristics of the dataset and the problem being solved. 


Hope you enjoyed reading this notebook, if you liked it don't fogrget to upvote. 