## 🔹 What is an Imbalanced Dataset?

An imbalanced dataset is one where the distribution of classes (or target labels) is not equal.

For example:

Binary Classification (fraud detection):

Legitimate transactions: 98,000

Fraudulent transactions: 2,000

Here, one class (legitimate) is much more frequent than the other (fraud). That makes the dataset imbalanced.

## 🔹 Why is it a problem?

Because most machine learning models assume that the data is fairly balanced. If not handled properly:

The model may become biased towards the majority class.

It may show high accuracy but poor performance on the minority class.

Example: If the model always predicts "legit", it will be 98% accurate but totally useless at catching fraud.

Metrics like accuracy become misleading. We need other metrics (precision, recall, F1, AUC-ROC).

## 🔹 Why do we handle it?

We handle imbalanced datasets because in many real-world problems, the minority class is the one we care about:

Fraud detection (frauds are rare but important)

Medical diagnosis (diseases are rare but critical to detect)

Spam detection (spam is less frequent but needs to be caught)

If we don’t handle imbalance:

Model ignores rare but important events.

Business/real-world impact can be huge (missed frauds, missed cancer diagnoses, etc.).

### There are 2 ways of Handling Imbalanced Dataset:
**1. Up Sampling** 

**2.Down Sampling**

In [1]:
import pandas as pd
import numpy as np

In [None]:
# Set a seed for reproducibility
np.random.seed(123)

# create an dataset with 2 classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0, n_class_1

(900, 100)

In [7]:
## Create dataframe with imbalance
class_0 = pd.DataFrame({
    'feature_1' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2' : np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [8]:
df = pd.concat([class_0, class_1]).reset_index(drop=True)

In [9]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0


In [10]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,2.677156,1.092048,1
996,2.963404,0.181955,1
997,1.621476,1.877267,1
998,3.429559,3.794486,1
999,3.532273,1.67949,1


In [11]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

### What is Upsampling?
Upsampling is a technique used t handle imbalanced datasets. It means increasing the number of samples in the minority class so that both classes have roughly the same amount data.

In [12]:
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [13]:
from sklearn.utils import resample

df_minority_upsampled = resample(df_minority,
                               replace=True,
                               n_samples=len(df_majority),
                               random_state=42)



In [14]:
df_minority_upsampled.shape

(900, 3)

In [16]:
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [17]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [18]:
df_upsampled

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0
...,...,...,...
952,1.766644,1.532225,1
965,1.527330,2.182477,1
976,2.463277,0.795616,1
942,2.930412,1.067353,1


### What is Downsampling?
Downsampling is another technique for handling imbalanced datasets, but instead of increasing minority samples (like upsampling), we reduce the number of majority class samples.

In [19]:
# Create imbalanced dataset as the first one is balanced now
class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2':np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2':np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [None]:
df_majority = df[df['target'] ==0]
df_minotiry = df[df['target'] ==1]

In [21]:
from sklearn.utils import resample

df_majority_downsampled = resample(df_majority,
                                   replace=False,
                                   n_samples=len(df_minority),
                                   random_state=42)

In [23]:
df_majority_downsampled.shape

(100, 3)

In [25]:
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

In [26]:
df_downsampled.shape

(200, 3)

In [27]:
df_downsampled['target'].value_counts()

target
0    100
1    100
Name: count, dtype: int64