# **Handling Imbalanced Dataset**

## Overview
When working with imbalanced datasets, the classes are not equally distributed, which can lead to biased models.

#### Handling Imbalanced Data

**Two main approaches:**

1. **Downsampling**
    - Reduce majority class samples
    - Example: 1000 datapoints with 900 Yes and 100 No
    - Balance ratio: 900:100 = 9:1
    - After downsampling: 100 Yes to 100 No → 1:1 ratio

2. **Upsampling**
    - Increase minority class samples
    - Useful when you cannot afford to lose majority class data

## Example 

### Binary Classification → Class Distribution (Yes/No)
- Total datapoints: 1000
- Yes: 900
- No: 100
- Imbalance Ratio: 9:1

## Workflow

1. Start with imbalanced dataset (1000 datapoints)
2. Apply sampling technique (downsampling or upsampling)
3. Train your model on balanced data
4. Evaluate on test set with original distribution

---

**Note**: The choice between downsampling and upsampling depends on:
- Dataset size
- Computational resources
- Whether losing data is acceptable
- Class importance in your use case

## Creating Synthetic Samples

imbalanced data

In [1]:
import numpy as np
import pandas as pd

np.random.seed(42) # For reproducibility ==> so that random operations yield the same results each time the code is run.

# Dataframe with two classes: 'A' (majority) and 'B' (minority)
n_samples = 1000
class_A_Ratio = 0.9
n_class_A = int(n_samples * class_A_Ratio)
n_class_B = n_samples - n_class_A

In [2]:
n_class_A , n_class_B

(900, 100)

In [5]:
class_A = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_A),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_A),
    'target': [0] * n_class_A
})

class_B = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_B),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_B),
    'target': [1] * n_class_B
})

# Combine both classes to create imbalanced dataset

# method 1
imbalanced_df = pd.concat([class_A, class_B], ignore_index=True)    # ignore_index=True is used to reset the index of the combined DataFrame.
# method 2
# imbalanced_df = pd.concat([class_A, class_B]).reset_index(drop=True)    # reset_index(drop=True) is used to reset the index of the combined DataFrame, dropping the old index.


imbalanced_df.head()

Unnamed: 0,feature_1,feature_2,target
0,-0.863494,-0.391877,0
1,-0.031203,-1.017764,0
2,0.018017,-1.027404,0
3,0.47263,-0.373268,0
4,-1.366858,0.644518,0


In [7]:
imbalanced_df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

## **Upsampling**

In [8]:
df_majority = imbalanced_df[imbalanced_df['target'] == 0]
df_minority = imbalanced_df[imbalanced_df['target'] == 1]

In [9]:
from sklearn.utils import resample      
# resample is a utility function from scikit-learn that allows for easy upsampling or downsampling of datasets.

In [10]:
Upsampled_minority = resample(df_minority, replace=True,    # sample with replacement => allows the same data point to be selected multiple times.
                                n_samples=len(df_majority),    # match number of majority class
                                random_state=42)    # reproducibility

So , we have increased the number of samples in the minority class by randomly duplicating existing samples until both classes have equal representation.

Now let's check the shape of the upsampled minority class to confirm the changes.

In [12]:
Upsampled_minority.shape       

(900, 3)

Concatenate the upsampled minority class with the original majority class to create a new balanced dataset.

In [None]:
df_balanced = pd.concat([df_majority, Upsampled_minority], ignore_index=True)

In [15]:
df_balanced['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

## Creating Synthetic samples

`Same code as above`

In [17]:
import numpy as np
import pandas as pd

np.random.seed(42) # For reproducibility ==> so that random operations yield the same results each time the code is run.

# Dataframe with two classes: 'A' (majority) and 'B' (minority)
n_samples = 1000
class_A_Ratio = 0.9
n_class_A = int(n_samples * class_A_Ratio)
n_class_B = n_samples - n_class_A

class_A = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_A),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_A),
    'target': [0] * n_class_A
})

class_B = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_B),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_B),
    'target': [1] * n_class_B
})

# Combine both classes to create imbalanced dataset

# method 1
imbalanced_df = pd.concat([class_A, class_B], ignore_index=True)    # ignore_index=True is used to reset the index of the combined DataFrame.
# method 2
# imbalanced_df = pd.concat([class_A, class_B]).reset_index(drop=True)    # reset_index(drop=True) is used to reset the index of the combined DataFrame, dropping the old index.


imbalanced_df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [18]:
df_majority = imbalanced_df[imbalanced_df['target'] == 0]
df_minority = imbalanced_df[imbalanced_df['target'] == 1]

## **Downsampling**

`It is considered bad because we are losing valuable data from the majority class which might contain important patterns and information that could help the model learn better .`

- but in some cases when the dataset is very large and computational resources are limited , downsampling can be a practical approach to quickly balance the classes without significantly impacting model performance.

In [19]:
from sklearn.utils import resample      
# resample is a utility function from scikit-learn that allows for easy upsampling or downsampling of datasets.

In [21]:
Downsampled_majority = resample(df_majority, replace=False,    # sample with replacement is False  => Because we want to reduce the data points .
                                n_samples=len(df_minority),    # match number of minority class
                                random_state=42)    # reproducibility

In [22]:
Downsampled_majority.shape

(100, 3)

In [27]:
df_balanced_Downsampled = pd.concat([df_minority, Downsampled_majority], ignore_index=True)

In [28]:
df_balanced_Downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64