# 📌 Imbalanced Dataset in Machine Learning

## 🔹 Understanding Imbalanced Data
In machine learning, an **imbalanced dataset** occurs when one class significantly outnumbers the other in a classification problem.

### **Example of an Imbalanced Dataset**
Imagine we have a binary classification dataset where we are predicting **'yes'** or **'no'** responses.

- **Total Data Points** = 1000  
- **Class Distribution:**
  - **900** → "Yes" ✅  
  - **100** → "No" ❌  
- **Ratio** = 900:100 → **9:1 imbalance**  

## ⚠️ Problem with Imbalanced Data
When training a machine learning model on an imbalanced dataset, it tends to **favor the majority class** (900 "Yes") and underperforms in predicting the minority class (100 "No").  
This leads to **biased predictions** and poor recall for the minority class.

---

## 🎯 Solutions to Handle Imbalanced Data
There are two major techniques to handle imbalanced datasets:  

### **1️⃣ Up-Sampling (Oversampling)**
- **Goal:** Increase the number of samples in the minority class by duplicating or generating synthetic samples.  
- **Effect:** Helps balance the dataset by giving the model more examples of the minority class.  
- **Methods:**  
  - Random Over-Sampling  
  - Synthetic Minority Over-sampling Technique (**SMOTE**)  

### **2️⃣ Down-Sampling (Undersampling)**
- **Goal:** Reduce the number of samples in the majority class to balance the dataset.  
- **Effect:** Prevents the model from being biased towards the majority class.  
- **Methods:**  
  - Random Under-Sampling  
  - Cluster Centroids  

---

## 📊 Final Comparison of Class Distributions

| Technique | 'Yes' Count | 'No' Count |
|-----------|------------|-----------|
| **Original Dataset** | 900 | 100 |
| **Oversampling (Up-Sampling)** | 900 | 900 |
| **Undersampling (Down-Sampling)** | 100 | 100 |
| **SMOTE (Synthetic Oversampling)** | 900 | 900 |
| **Random Undersampling** | 100 | 100 |

---

## 🎯 Key Takeaways
✔ **Up-Sampling** (Oversampling) helps **increase** the minority class but may lead to **overfitting**.  
✔ **Down-Sampling** (Undersampling) helps balance the data but may cause **information loss**.  
✔ **SMOTE** is better than basic oversampling because it **generates new synthetic samples** instead of duplicating data.  
✔ **Choosing the right technique depends on the dataset** and **machine learning model** being used.  

---

## 💡 When to Use What?

| Scenario | Recommended Technique |
|----------|----------------------|
| Small dataset with imbalanced classes | **SMOTE** (Synthetic Oversampling) |
| Large dataset with imbalanced classes | **Random Undersampling** |
| No data generation is allowed | **Basic Oversampling/Undersampling** |
| Need to maintain real-world data distribution | **Undersampling** |

---

📌 **By understanding and applying these techniques, we can improve the accuracy of models trained on imbalanced datasets! 🚀**


In [None]:
import numpy as np
import pandas as pd

# set the random seed for reproducibility
np.random.seed(123)

# create dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [None]:
n_class_0, n_class_1

(900, 100)

In [None]:
## Create my dataframe with Inbalanced dataset
class_0 = pd.DataFrame({
    # Feature 1: Normally distributed with mean 0, std dev 1, n_class_0 samples
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    # Feature 2: Normally distributed with mean 0, std dev 1, n_class_0 samples
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    # Target column: All values are 0 (representing class 0)
    'target': [0]*n_class_0
})

# Create DataFrame for class 1
class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1]*n_class_1
})

In [None]:
 # Combine class_0 and class_1 DataFrames and reset the index
imbalanced_df = pd.concat([class_0, class_1]).reset_index(drop=True)

In [None]:
imbalanced_df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [None]:
imbalanced_df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [None]:
imbalanced_df['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,900
1,100


In [None]:
# Up-Sampling
df_minority = imbalanced_df[imbalanced_df['target'] == 1]
df_majority = imbalanced_df[imbalanced_df['target'] == 0]

In [None]:
from sklearn.utils import resample
# resample => extrapolate the points, what is a point minority it will try to create more new points, and it will try to equalize
df_minority_upsampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),
    random_state=42)

In [None]:
df_minority_upsampled.shape

(900, 3)

In [None]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [None]:
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [None]:
df_upsampled['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,900
1,900


In [None]:
## Down Sampling

# set the random seed for reproducibility
np.random.seed(123)

# create dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

## Create my dataframe with Inbalanced dataset
class_0 = pd.DataFrame({
    # Feature 1: Normally distributed with mean 0, std dev 1, n_class_0 samples
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    # Feature 2: Normally distributed with mean 0, std dev 1, n_class_0 samples
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    # Target column: All values are 0 (representing class 0)
    'target': [0]*n_class_0
})

# Create DataFrame for class 1
class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1]*n_class_1
})

 # Combine class_0 and class_1 DataFrames and reset the index
imbalanced_df = pd.concat([class_0, class_1]).reset_index(drop=True)

df_minority = imbalanced_df[imbalanced_df['target'] == 1]
df_majority = imbalanced_df[imbalanced_df['target'] == 0]

df_majority_downsampled = resample(
    df_majority,
    replace=True,
    n_samples=len(df_minority),
    random_state=42)

df_downsampled = pd.concat([df_majority_downsampled, df_minority])
df_downsampled['target'].value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,100
1,100
