### Handling Imbalanced Dataset

#### 1.UP Sampling
#### 2.DOWN Sampling

### Creating an Imbalanced Dataset
we create a simple synthetic imbalanced dataset to understand how imbalanced classification works and how up-sampling and down-sampling are applied later.

### Why Create Synthetic Data?

#### Machine learning learning requires understanding how models behave with different types of datasets.
        - An imbalanced dataset mimics real-world situations:

        - Fraud Detection → 99% Normal, 1% Fraud

        - Medical Diagnosis → Rare diseases

        - Spam Detection → Only some emails are spam

#### So here we intentionally create an imbalanced dataset with:

      - 90% Class 0

      - 10% Class 1


In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility.When you generate random numbers, Python gives a different result every time.
# But in Machine Learning we want: 1.Consistent Behaviour 2.Reproducibility 3. Same results every time

np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [2]:
n_class_0,n_class_1

(900, 100)

### Explanation
#### Class_0
        - np.random.normal() generates numbers from a normal distribution.
        - loc=0 → mean is 0
        - scale=1 → standard deviation is 1
        - size=n_class_0 → create 900 samples
        - So both feature_1 and feature_2 follow a bell-shaped distribution centered at 0.
        - The target column is: [0, 0, 0, ..., 0]  (900 times)
        - This represents the majority class.
#### Class_1
        - Again we generate data from a normal distribution.
        - loc=2 means the center is shifted to 2, not 0.
        - This simulates a different feature pattern for the minority group.
        - size=n_class_1 → create 100 samples
        - The target column: [1, 1, 1, ..., 1] (100 times)
        - This represents the minority class.


In [4]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

### Combine Class 0 & Class 1 into One Dataset
    - Now that we have both majority and minority classes, we combine them into a single DataFrame.

In [5]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)
df.head()


Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0


In [6]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

## Upsampling: Separating Majority & Minority Classes

Before performing upsampling, we must first separate the dataset into majority and minority classes. This helps us apply the resampling technique only to the minority class.

#### Why extract minority class?
    - Upsampling means increasing the number of minority samples.
    - So we isolate them first so that we can later duplicate or synthetically generate more samples.

#### Once separated, we will:
    - Increase (upsample) the minority class
    - So that df_minority becomes equal in size to df_majority
    - This helps the machine learning model learn both classes equally and reduces bias

In [2]:
## 1. Extract the Minority Class (target = 1)
df_minority=df[df['target']==1]

## 2 .Extract the Majority Class (target = 0)
df_majority=df[df['target']==0]

NameError: name 'df' is not defined

### Upsampling Using sklearn.utils.resample
After separating the minority and majority classes, the next step is to increase the number of minority samples so that both classes are balanced.

#### What does the upsampled data look like?
    - df_minority_upsampled will have 900 rows
    - Many rows will be duplicated (because replace=True)
    - But the dataset becomes balanced
#### Why Upsampling Helps?

##### Without upsampling:
    - Model learns mostly from majority class
    - Poor recall for minority class
    - High chance of predicting only class 0

##### With upsampling:
    - Balanced dataset
    - Minority class becomes learnable
    - Better recall and F1-score

In [1]:
 from scikit_learn.utils import resample
df_minority_upsampled=resample(df_minority
,replace=True, #Sample With replacement. It is allowed to pick the same sample multiple times
         n_samples=len(df_majority), #If replace=False, we cannot upsample because there are not enough unique rows.
         random_state=42
        )
df_upsampled=pd.concat([df_minority_upsampled,df_majority])

ModuleNotFoundError: No module named 'scikit_learn'

In [11]:
import sys
print(sys.executable)
!{sys.executable} -m pip show scikit-learn

c:\Users\LAPTOP\Python\venv\python.exe


