### Handling Imbalanced Dataset

We work with datasets.  
Suppose we are trying to solve a classification problem that is supervised machine learning problem, meaning the output is in the form of categories. If we have two categories, we say binary classification.  

Suppose we have 1000 data points in the dataset, the output feature is a binary classification problem, output can be like yes/no.      
We have 900 yes data points and 100 no data points.     
Here maximum are saying yes and very minimal number saying no.      
Ratio = 9:1     
So we have an imbalanced dataset    

The problem with having an imbalanced dataset in machine learning is that the model we are creating for predicting will get biased towards the maxiumum number of data points.  
Thus it becomes necessary that we fix this imbalanced dataset and try to make that both the data points become equal.   


There are two techniques to do that:    
1. Up Sampling: Increase the number of data points of the minority
2. Down Sampling: Decrease the number of data points of the majority

In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000 # Taking 1000 data points for instance, we can take 10,000 as well
class_0_ratio = 0.9 # 90% class zero ratio
n_class_0 = int(n_samples * class_0_ratio) # 900 data points
n_class_1 = n_samples - n_class_0 # 100 data points

In [2]:
n_class_0,n_class_1

(900, 100)

In [None]:
## Creating DataFrame with imbalanced dataset
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0), #normal distribution
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0), #normal distribution
    'target': [0] * n_class_0 #900 zeroes will be created
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1), #normal distribution
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1), #normal distribution
    'target': [1] * n_class_1 #100 ones will be created

})

# - feature_1 and feature_2:
# These are two features generated from a normal (Gaussian) distribution.
# - loc=0: mean of the distribution is 0
# - scale=1: standard deviation is 1
# - size=n_class_0: number of data points generated
# - target:
# This creates a binary target label for classification. Since this is class 0, all labels are 0:

# What scale=1 Means
# - You're generating numbers with a mean of 0 (loc=0) and a standard deviation of 1.
# - This results in data that mostly falls within:
# - ±1 around the mean → about 68% of values
# - ±2 → about 95%
# - ±3 → about 99.7%



In [4]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [5]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [6]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [8]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [9]:
## Upsampling
# Create two dataframes
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [None]:
from sklearn.utils import resample
# resample can be used to create more points for the minority and try to equalize it with the majority 
df_minority_upsampled = resample(df_minority, replace=True, n_samples = len(df_majority), random_state = 42) #First parameter is for which sample we want to perform upsampling
# replace = True allows for new datapoints to be created
# Third parameter is how much we want to increase it to
# random_state = Any value so that the seed is fixed

In [None]:
df_minority_upsampled.shape # Now we have 900 datapoints

(900, 3)

In [15]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [16]:
## Combining two datasets
df_upsampled = pd.concat([df_majority,df_minority_upsampled])

In [17]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [22]:
## Down Sampling

import pandas as pd
import numpy as np

np.random.seed(123)

n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0,class_1]).reset_index(drop=True)
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [23]:
## Down Sampling
## Creating two datasets
df_majority = df[df['target']==0]
df_minority = df[df['target']==1]

In [25]:
from sklearn.utils import resample
df_majority_downsampled = resample(df_majority, replace=False, n_samples = len(df_minority), random_state = 42)
# replace = False . This is because we don't want to increase the data points but reduce it.

In [26]:
df_majority_downsampled.shape

(100, 3)

In [27]:
## Combining the datasets
df_downsampled = pd.concat([ df_minority, df_majority_downsampled])

In [28]:
df_downsampled.head()

Unnamed: 0,feature_1,feature_2,target
900,1.699768,2.139033,1
901,1.367739,2.025577,1
902,1.795683,1.803557,1
903,2.213696,3.312255,1
904,3.033878,3.187417,1


In [29]:
df_downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64

In [30]:
## Down Sampling is bad because we are losing data points