# 1. Understanding the Problem
Suppose you have a dataset with:

* 90% samples from class 0

* 10% samples from class 1

This causes models to bias towards the majority class, leading to poor recall or precision for the minority class.

Let‚Äôs simulate an imbalanced dataset first.

In [None]:
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

# Create imbalanced dataset
X, y = make_classification(n_samples=5000, n_features=10, n_informative=2,
                           n_redundant=2, n_classes=2,
                           weights=[0.9, 0.1], random_state=42)

print(pd.Series(y).value_counts())

0    4472
1     528
Name: count, dtype: int64


## üß© 2. Data-Level Techniques
(a) Random Undersampling

Removes samples from the majority class to balance the dataset.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print(pd.Series(y_res).value_counts())

0    528
1    528
Name: count, dtype: int64


## (b) Random Oversampling

Duplicates samples from the minority class.

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)

print(pd.Series(y_res).value_counts())

0    4472
1    4472
Name: count, dtype: int64


## (c) SMOTE (Synthetic Minority Oversampling Technique)

Generates synthetic samples (not duplicates) of the minority class using interpolation.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

print(pd.Series(y_res).value_counts())

0    4472
1    4472
Name: count, dtype: int64


## (d) ADASYN (Adaptive Synthetic Sampling)

An advanced version of SMOTE that focuses more on hard-to-learn samples.

In [None]:
from imblearn.over_sampling import ADASYN

adasyn = ADASYN(random_state=42)
X_res, y_res = adasyn.fit_resample(X, y)

print(pd.Series(y_res).value_counts())

0    4472
1    4458
Name: count, dtype: int64


## (e) Combine Over + Under Sampling

Best of both worlds ‚Äî keeps data size moderate and balance good.

In [None]:
from imblearn.combine import SMOTEENN

smote_enn = SMOTEENN(random_state=42)
X_res, y_res = smote_enn.fit_resample(X, y)

print(pd.Series(y_res).value_counts())

1    4400
0    3564
Name: count, dtype: int64


## ‚öôÔ∏è 3. Algorithm-Level Techniques
(a) Change Class Weights

Many algorithms (like Logistic Regression, Random Forest, SVM) allow you to specify class weights.

Example with Logistic Regression:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.85      0.91      1335
           1       0.41      0.84      0.55       165

    accuracy                           0.85      1500
   macro avg       0.69      0.85      0.73      1500
weighted avg       0.92      0.85      0.87      1500



## (b) Use Ensemble Methods

Some ensembles handle imbalance internally:

* Balanced Random Forest

* EasyEnsembleClassifier

In [None]:
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier

# Balanced Random Forest
brf = BalancedRandomForestClassifier(random_state=42)
brf.fit(X_train, y_train)
print("Balanced RF:", brf.score(X_test, y_test))

# Easy Ensemble
eec = EasyEnsembleClassifier(random_state=42)
eec.fit(X_train, y_train)
print("Easy Ensemble:", eec.score(X_test, y_test))

Balanced RF: 0.93
Easy Ensemble: 0.876


## üìä 4. Evaluation Metrics for Imbalanced Data

Accuracy is misleading. Use:

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred))

[[1135  200]
 [  26  139]]
              precision    recall  f1-score   support

           0       0.98      0.85      0.91      1335
           1       0.41      0.84      0.55       165

    accuracy                           0.85      1500
   macro avg       0.69      0.85      0.73      1500
weighted avg       0.92      0.85      0.87      1500

ROC-AUC: 0.8463057541709227
