# What is an Imbalanced Dataset?
An imbalanced dataset happens when one class has many more examples than the other(s).

Example:
| **Class**    | **Count** |
| ------------ | --------- |
| **Spam**     | 900       |
| **Not Spam** | 100       |

# Problem:
If you train a model on this, it will predict "Spam" all the time just to get high accuracy,but it won’t learn properly about "Not Spam" cases.

# Why is it a problem?

Accuracy becomes misleading

Model will ignore the smaller class

Poor recall/precision for minority class



In [1]:
import numpy as np

n_samples = 1000
class_0_ratio = 0.9

n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0
print(f'n_class_0: {n_class_0}')
print(f'n_class_1: {n_class_1}')

n_class_0: 900
n_class_1: 100


In [2]:
# Creating a dataframe..
import pandas as pd
class_0 = pd.DataFrame({
    'Feature 1': np.random.normal(loc=1,scale=2,size=n_class_0),
    'Feature 2': np.random.normal(loc=1,scale=2,size=n_class_0),
    'target': [0] * n_class_0
})
class_1 = pd.DataFrame({
    'Feature 1': np.random.normal(loc=2,scale=5,size=n_class_1),
    'Feature 2': np.random.normal(loc=2,scale=5,size=n_class_1),
    'target': [1] * n_class_1
})


In [3]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)
df.head()

Unnamed: 0,Feature 1,Feature 2,target
0,-2.702744,-0.089673,0
1,1.525361,-1.206502,0
2,2.885846,1.511799,0
3,3.073158,1.154589,0
4,2.046138,-0.935897,0


In [4]:
df.target.value_counts()

target
0    900
1    100
Name: count, dtype: int64

# 1️⃣Upsampling:

You increase the number of samples in the minority class by duplicating existing samples or creating synthetic samples.

Goal: Balance the classes

# What Are Synthetic Samples?....we will practice this later...
Synthetic samples are artificial data points created by the computer, not collected from real-world data.

In imbalanced datasets, synthetic samples are used to:

Increase the minority class size

Avoid exact duplicates (like simple upsampling)

Reduce overfitting



In [5]:
df_majority = df[df['target'] == 0]
df_minority = df[df['target'] == 1]

print('shape of df_majority',df_majority.shape)
print('shape of df_minority',df_minority.shape)

shape of df_majority (900, 3)
shape of df_minority (100, 3)


In [6]:
from sklearn.utils import resample  

In [7]:
df_minority_upsampled=resample(df_minority,
         replace=True,
         n_samples=len(df_majority),
         random_state=42)

# replace=True means sampling with replacement.
# With replacement:

# After selecting a sample, it is put back into the dataset before the next 
# sample is picked.

# So, duplicates can occur because the same data point can be picked 
# multiple times.

In [8]:
df_minority_upsampled.shape

(900, 3)

In [9]:
df_minority_upsampled.head()

Unnamed: 0,Feature 1,Feature 2,target
951,5.203665,3.615366,1
992,2.255191,0.161786,1
914,6.623633,-2.184547,1
971,6.386927,3.688517,1
960,-0.260834,0.225929,1


In [10]:
df_upsampled = pd.concat([df_minority_upsampled,df_majority])

In [11]:
df_upsampled.shape

(1800, 3)

# Down Sampling

2️⃣ Downsampling (Undersampling the Majority Class)
Meaning:

You reduce the number of samples in the majority class by randomly removing data.

Goal: Balance the classes

In [12]:
np.random.seed(124)

n__sample = 1000
class__0_ratio = 0.9
n__class_0 = int(n__sample * class__0_ratio)
n__class_1 = n__sample - n__class_0


In [13]:
class__0 = pd.DataFrame({
    'feature 1' : np.random.normal(loc=1,scale=2,size=n__class_0),
    'feature 2' : np.random.normal(loc=1,scale=2,size=n__class_0),
    'target' : [0] * n__class_0
})

class__1 = pd.DataFrame({
    'feature 1' : np.random.normal(loc=2,scale=4,size=n__class_1),
    'feature 2' : np.random.normal(loc=2,scale=4,size=n__class_1),
    'target' : [1] * n__class_1
})

In [14]:
df1 = pd.concat([class__0,class__1]).reset_index(drop=True)

In [15]:
df1.shape

(1000, 3)

In [16]:
df1.target.value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [17]:
## downsampling
df1_minority=df1[df1['target']==1]
df1_majority=df1[df1['target']==0]

In [18]:
from sklearn.utils import resample
df1_majority_downsampled=resample(df1_majority,replace=False, #Sample With replacement
         n_samples=len(df1_minority),
         random_state=42
        )

In [19]:
df1_majority_downsampled.shape

(100, 3)

In [20]:
df1_downsampled = pd.concat([df1_majority_downsampled,df1_minority])

In [21]:
df1_downsampled

Unnamed: 0,feature 1,feature 2,target
70,2.997489,-0.208579,0
827,2.091802,-0.823175,0
231,2.246231,-0.024955,0
588,2.541803,4.233045,0
39,-3.605107,2.894630,0
...,...,...,...
995,-1.972379,-0.661294,1
996,0.260730,0.027519,1
997,-0.636976,-6.637926,1
998,-2.691417,6.052280,1
