## Handling Imbalaned Dataset

Basically based on categorical data or we can say that binary data.

Lets say one dataset is having yes or no
            900 Yes     100 No
            So the ratio is 9:1

Which means our module will be biased towards 900 data valie to handle this there is two technique :-


1. Up Sampling
2. Down Sampling

In [13]:
import numpy as np
import pandas as pd

# Set the seed for reproducibility -- iif choosing any value that will not change and can use any value in place of 123
np.random.seed(123)

# Create a dataframe with two classes
n_samples=1000
class_0_ratio=0.9
n_class_0=int(n_samples*class_0_ratio)
n_class_1=n_samples-n_class_0

In [14]:
n_class_0,n_class_1

(900, 100)

In [21]:
## Create my dataframe with imbalanced dataset

# loc: The mean of the normal distribution (the center around which the data is distributed).
# scale: The standard deviation (spread or width of the distribution).
# size: The number of random values to generate.

class_0=pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0  #900 0 is created in this dataframe
})


class_1=pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})

In [22]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [23]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-0.471276,0.328462,0
1,1.084072,1.03823,0
2,-0.379223,1.147064,0
3,-0.362274,0.638254,0
4,-0.681071,-1.075766,0


In [24]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,2.72069,1.978489,1
996,1.949078,3.709793,1
997,2.709784,3.324917,1
998,1.817689,1.549237,1
999,2.83891,2.240773,1


In [25]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [28]:
## Upsampling

## Saggregating the data or accumulate the data

df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [35]:
pip install scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   -- ------------------------------------- 0.8/11.0 MB 5.6 MB/s eta 0:00:02
   -------- ------------------------------- 2.4/11.0 MB 6.7 MB/s eta 0:00:02
   -------------- ------------------------- 3.9/11.0 MB 7.1 MB/s eta 0:00:01
   ------------------- -------------------- 5.2/11.0 MB 6.6 MB/s eta 0:00:01
   ----------------------- ---------------- 6.6/11.0 MB 6.6 MB/s eta 0:00:01
   -------------------------- -------

In [37]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,              # The minority class dataframe to be upsampled
    replace=True,             # Allow sampling with replacement
    n_samples=len(df_majority), # Match the number of samples to the majority class
    random_state=42           # Set seed for reproducibility
)

In [40]:
df_minority_upsampled.shape

(900, 3)

In [41]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,2.701255,0.056481,1
992,2.322342,3.28917,1
914,2.500618,1.33817,1
971,1.922944,2.907951,1
960,0.630218,0.982405,1


In [43]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [44]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [45]:
## DownSampling

import numpy as np
import pandas as pd

# Set the seed for reproducibility -- iif choosing any value that will not change and can use any value in place of 123
np.random.seed(123)

# Create a dataframe with two classes
n_samples=1000
class_0_ratio=0.9
n_class_0=int(n_samples*class_0_ratio)
n_class_1=n_samples-n_class_0

class_0=pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0  #900 0 is created in this dataframe
})


class_1=pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})


df=pd.concat([class_0,class_1]).reset_index(drop=True)

## Check the class Distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [46]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [47]:
from sklearn.utils import resample

df_majority_downsampled = resample(
    df_majority,              # The majority class dataframe to be downsampled
    replace=False,            # Sampling without replacement (to avoid duplicates)
    n_samples=len(df_minority), # Reduce to match the size of the minority class
    random_state=42           # Set seed for reproducibility
)


In [48]:
df_majority_downsampled.shape

(100, 3)

In [49]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [50]:
df_downsampled.target.value_counts

<bound method IndexOpsMixin.value_counts of 900    1
901    1
902    1
903    1
904    1
      ..
398    0
76     0
196    0
631    0
751    0
Name: target, Length: 200, dtype: int64>

Downsampling is bad because we are loosing the dataset