# Handling Imbalanced Dataset

An imbalanced dataset is one where certain classes or outcomes have much fewer instances than others, leading to unequal representation. 
This can make it hard for models to learn about the minority class, potentially causing poor predictive performance.

**1. Up-sampling**: This technique increases the number of samples in the minority class by duplicating or creating synthetic samples, helping the model learn more about the minority class.

**2. Down-sampling**: This technique reduces the number of samples in the majority class by randomly removing instances, balancing the classes without adding new data.

In [5]:
#creating imbalanced dataset
import pandas as pd
import numpy as np

# Create an imbalanced dataset with 90% of Class 0 and 10% of Class 1
data = pd.DataFrame({
    'Feature_1': np.random.normal(0, 1, 1000),
    'Feature_2': np.random.normal(0, 1, 1000),
    'Target': [0] * 900 + [1] * 100  # 900 samples of Class 0, 100 samples of Class 1
})

# Shuffle the dataset
data = data.sample(frac=1, random_state=1).reset_index(drop=True)

# Display the class distribution
print(data['Target'].value_counts())


Target
0    900
1    100
Name: count, dtype: int64


In [6]:
data.head()

Unnamed: 0,Feature_1,Feature_2,Target
0,0.902587,0.439114,0
1,1.088333,0.630227,0
2,-0.977628,0.693316,0
3,0.606214,-0.297328,0
4,0.521564,1.079311,0


In [7]:
data.tail()

Unnamed: 0,Feature_1,Feature_2,Target
995,-0.442188,-1.96362,0
996,-0.846634,0.722629,0
997,0.642657,-0.52371,1
998,0.583024,0.236159,0
999,0.050832,-0.291033,0


## Up sampling Technique

In [8]:
df_minority=data[data['Target']==1]

In [9]:
df_majority=data[data['Target']==0]

In [10]:
!pip install scikit-learn




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [11]:
from sklearn.utils import resample

In [13]:
df_minority_upsampled= resample(df_minority,replace=True, #sample with replacement
          n_samples=len(df_majority),
          random_state=42                     
)

In [14]:
df_minority_upsampled.shape

(900, 3)

In [15]:
df_minority_upsampled.head()

Unnamed: 0,Feature_1,Feature_2,Target
565,-0.63662,-0.907073,1
939,-1.197124,0.442088,1
147,-2.111818,0.612107,1
735,1.00474,-0.738081,1
638,1.634685,-0.889353,1


In [16]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [19]:
df_upsampled['Target'].value_counts()

Target
0    900
1    900
Name: count, dtype: int64

## Down Sampling Technique

In [20]:
#creating imbalanced dataset
import pandas as pd
import numpy as np

# Create an imbalanced dataset with 90% of Class 0 and 10% of Class 1
data = pd.DataFrame({
    'Feature_1': np.random.normal(0, 1, 1000),
    'Feature_2': np.random.normal(0, 1, 1000),
    'Target': [0] * 900 + [1] * 100  # 900 samples of Class 0, 100 samples of Class 1
})

# Shuffle the dataset
data = data.sample(frac=1, random_state=1).reset_index(drop=True)

# Display the class distribution
print(data['Target'].value_counts())

Target
0    900
1    100
Name: count, dtype: int64


In [21]:
df_minority=data[data['Target']==1]
df_majority=data[data['Target']==0]

In [22]:
df_majority_downsampled= resample(df_majority,replace=False, #sample with replacement
          n_samples=len(df_minority),
          random_state=42                     
)

In [24]:
df_majority_downsampled.shape

(100, 3)

In [25]:
df_downsampled=pd.concat([df_minority,df_majority_downsampled])

In [27]:
df_downsampled.Target.value_counts()

Target
1    100
0    100
Name: count, dtype: int64