Handling Imbalanced Dataset
Up Sampling
Down Sampling

An imbalanced dataset is one where the classes (or categories) in the target variable are not represented equally. In classification problems, this means one class has significantly more samples than the other(s).

For example:

Class 0: 900 samples

Class 1: 100 samples
This results in a 90:10 class imbalance.

Machine learning models trained on imbalanced datasets tend to be biased toward the majority class, which can lead to poor performance, especially on the minority class.

How to Handle Imbalanced Datasets?
To improve model performance, especially for the minority class, we can use resampling techniques to balance the dataset.

1. Upsampling (Oversampling)
Increases the number of samples in the minority class
Done by randomly duplicating existing minority samples (with replacement)
Can lead to overfitting if not used carefully

2. Downsampling (Undersampling)
Reduces the number of samples in the majority class
Done by randomly removing samples from the majority class
Risk: Loss of potentially useful data

In [1]:
##Creating a synthetic dataset
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [5]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

A dataframe with 2 classes means, your dataframe(i.e., your dataset) has a column that labels each row into one of 2 categories or classes, this is called binary classification problem.

In [7]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [9]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [11]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [13]:
## upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [15]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )

In [17]:
df_minority_upsampled.shape

(900, 3)

In [19]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [21]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [23]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

Down Sampling

In [25]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [27]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [29]:
from sklearn.utils import resample
df_majority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )