In [1]:
!pip install numpy pandas



You should consider upgrading via the 'C:\Users\ASUS\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


In [2]:
import numpy as np
import pandas as pd

#set the random seed for reproducibility
np.random.seed(123)

#create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples* class_0_ratio)
n_class_1 = n_samples - n_class_0

In [3]:
n_class_0, n_class_1

(900, 100)

In [4]:
#create my dataframe with imbalance dataset
class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2':np.random.normal(loc=0, scale=1, size=n_class_0),
    'target':[0]*n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2':np.random.normal(loc=2, scale=1, size=n_class_1),
    'target':[1]*n_class_1
})

In [6]:
df= pd.concat([class_0, class_1]).reset_index(drop=True)

In [7]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [9]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

In [10]:
df_minortiy = df[df['target']==1]
df_majority = df[df['target']==0]

In [14]:
from sklearn.utils import resample
df_minority_upsample = resample(df_minortiy, replace=True,#sample with replacement 
         n_samples =len(df_majority),
         random_state =42
         )


In [13]:
df_minority_upsample.shape

(900, 3)

In [15]:
df_minority_upsample.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [18]:
df_upsample = pd.concat([df_majority, df_minority_upsample])

In [19]:
df_upsample['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

## Handling Imbalanced Datasets Summary

**What is Class Imbalance?**
- Class imbalance occurs when one class has significantly more samples than others
- Common in real-world datasets (fraud detection, medical diagnosis, etc.)
- Can lead to biased models that favor the majority class

**Problems with Imbalanced Datasets:**
1. Models tend to predict majority class more often
2. Poor performance on minority class (low recall)
3. Misleading accuracy metrics
4. Biased model evaluation

**Techniques to Handle Imbalance:**

### 1. Resampling Methods:

#### Oversampling:
- **Random Oversampling**: Duplicate minority class samples
- **SMOTE**: Generate synthetic minority samples
- **ADASYN**: Adaptive synthetic sampling

#### Undersampling:
- **Random Undersampling**: Remove majority class samples
- **Tomek Links**: Remove borderline majority samples
- **Edited Nearest Neighbors**: Remove noisy samples

### 2. Algorithm-Level Approaches:
- **Cost-Sensitive Learning**: Assign different costs to different classes
- **Threshold Tuning**: Adjust classification threshold
- **Ensemble Methods**: Use balanced bagging or boosting

### 3. Evaluation Metrics:
- **Precision**: TP / (TP + FP)
- **Recall**: TP / (TP + FN)
- **F1-Score**: 2 × (Precision × Recall) / (Precision + Recall)
- **ROC-AUC**: Area under ROC curve
- **PR-AUC**: Area under Precision-Recall curve

**Best Practices:**
1. Use appropriate evaluation metrics
2. Consider the business cost of different types of errors
3. Try multiple techniques and compare results
4. Validate on unseen data
5. Consider the computational cost of resampling
