**Before you dive into the implementations, I highly recommend first learning the heart of each algorithm—its core idea and how it works. You can explore this through YouTube tutorials, books, or online courses. This repository is meant to complement that knowledge by showing how to translate concepts into working code.**

# Handling Unbalanced Datasets

This project focuses on addressing the challenges of **unbalanced datasets** in machine learning. Unbalanced datasets occur when one class significantly outnumbers the other(s), leading to biased models that perform poorly on minority classes. This notebook demonstrates various techniques to handle unbalanced datasets, including **undersampling**, **oversampling**, **SMOTE**, **ensemble methods**, and **Focal Loss**.


## **Techniques Covered**

### **1. Undersampling**
- **What it does**: Reduces the number of instances in the majority class to balance the dataset.
- **Methods**:
  - Random Undersampling
  - Cluster Centroids
  - Tomek Links
- **Pros**: Simple and reduces computational cost.
- **Cons**: May lead to loss of important information.

### **2. Oversampling**
- **What it does**: Increases the number of instances in the minority class to balance the dataset.
- **Methods**:
  - Random Oversampling
  - SMOTE (Synthetic Minority Oversampling Technique)
  - ADASYN (Adaptive Synthetic Sampling)
- **Pros**: Improves minority class representation.
- **Cons**: Can lead to overfitting.

### **3. SMOTE (Synthetic Minority Oversampling Technique)**
- **What it does**: Generates synthetic samples for the minority class by interpolating between existing samples.
- **Formula**: $ x_{\text{new}} = x_i + \lambda \cdot (x_j - x_i) $
- **Pros**: Reduces overfitting compared to random oversampling.
- **Cons**: Can create noisy samples.

### **4. Ensemble Methods**
- **What it does**: Combines multiple models to improve performance on minority classes.
- **Methods**:
  - Balanced Random Forest
  - EasyEnsemble
  - XGBoost with `scale_pos_weight`
- **Pros**: Robust and effective for imbalanced datasets.
- **Cons**: Computationally expensive.

### **5. Focal Loss**
- **What it does**: A loss function that focuses on hard-to-classify examples and handles class imbalance.
- **Formula**: $ \text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t) $
- **Pros**: Improves performance on minority classes.
- **Cons**: Requires tuning of $\gamma$ and $\alpha$ parameters.


## **Notebook Overview**

The notebook demonstrates the following techniques:

1. **Undersampling**:
   - Random Undersampling

2. **Oversampling**:
   - Random Oversampling
   - SMOTE

3. **SMOTE + Ensemble Methods**:
   - SMOTE + Logistic Regression
   - SMOTE + Balanced Random Forest
   - SMOTE + XGBoost

4. **Focal Loss**:
   - Comparison with Cross-Entropy Loss



In [71]:
#import necessary libraries
from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
import numpy as np

In [79]:
#creating a simple dummy unbalanced data set using  make_classification
X, y=make_classification(n_samples=6000, n_classes=2, n_features=2, n_informative=2, n_clusters_per_class=1, n_redundant=0, weights=[0.91, 0.01], random_state=42)# Imbalanced classes
df=pd.DataFrame(X,columns=['f0','f1'])
df['t']=y
print(Counter(y))
df

Counter({0: 5674, 1: 326})


Unnamed: 0,f0,f1,t
0,1.661717,-0.439285,0
1,0.969619,0.424131,0
2,1.402264,-0.195857,0
3,-0.116644,-1.692852,0
4,1.511872,-1.183038,0
...,...,...,...
5995,-0.124438,-1.009315,0
5996,1.254716,0.508662,0
5997,0.266659,-0.760996,0
5998,1.364727,-0.779099,0


In [80]:
#split the data for train and test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
#custom function that trains data using logistic regression
model=LogisticRegression()
def model_function(X_train,y_train,X_test):
  model.fit(X_train, y_train)
  y_pred=model.predict(X_test)
  return y_pred

In [81]:
#let's train , test and evaluate the model
y_pred=model_function(X_train,y_train,X_test)
print(classification_report(y_pred, y_test))
print(confusion_matrix(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1158
           1       0.45      0.67      0.54        42

    accuracy                           0.96      1200
   macro avg       0.72      0.82      0.76      1200
weighted avg       0.97      0.96      0.96      1200

[[1124   34]
 [  14   28]]


Like we predicted the f1 score is not good, because data is unbalanced.

In [82]:
#assign lables to target classes
class_0 = df[df['t']== 0]
class_1 = df[df['t']== 1]
print(class_0.shape)
print(class_1.shape)

(5674, 3)
(326, 3)


In [83]:
#under_sample the class_0(it has 5674 - 0 class samples)
class_0_under = class_0.sample(n=class_1.shape[0], random_state=42)
df_new = pd.concat([class_0_under, class_1])

print(df_new['t'].value_counts())


t
0    326
1    326
Name: count, dtype: int64


Now both classes(0&1) has same amount of samples.

In [84]:
#train with new balanced data
X1=df_new.drop(columns=['t'])
y1= df_new['t']

X_train1, X_test1, y_train1, y_test1= train_test_split(X1, y1, test_size=0.2, random_state=42, stratify=y1)

y_pred1=model_function(X_train1,y_train1,X_test1)
print(classification_report(y_pred1, y_test1))
print(confusion_matrix(y_pred1, y_test1))


              precision    recall  f1-score   support

           0       0.85      0.92      0.88        61
           1       0.92      0.86      0.89        70

    accuracy                           0.89       131
   macro avg       0.89      0.89      0.89       131
weighted avg       0.89      0.89      0.89       131

[[56  5]
 [10 60]]


Results are very good even with very less data , simple algorithms LogisticRegression performs good with less data sometimes.

In [85]:
#Let's try oversampling the class_1 (minor class)
class_1_over = class_1.sample(n=class_0.shape[0], replace=True, random_state= 42)
df_new1 = pd.concat([class_0,class_1_over])

print(df_new1['t'].value_counts())

t
0    5674
1    5674
Name: count, dtype: int64


In [86]:
#train and evaluate the new balanced data
X2=df_new1.drop(columns=['t'])
y2= df_new1['t']

X_train2, X_test2, y_train2, y_test2= train_test_split(X2, y2, test_size=0.2, random_state=42, stratify=y2)

y_pred2=model_function(X_train2,y_train2,X_test2)
print(classification_report(y_pred2, y_test2))
print(confusion_matrix(y_pred2, y_test2))

              precision    recall  f1-score   support

           0       0.86      0.92      0.89      1064
           1       0.92      0.87      0.89      1206

    accuracy                           0.89      2270
   macro avg       0.89      0.89      0.89      2270
weighted avg       0.89      0.89      0.89      2270

[[ 974   90]
 [ 161 1045]]


Results are  same as previous(under sampling) , here LogisticRegression again performs good.

**SMOTE:** Generates new, synthetic data points by interpolation, making the minority class more diverse and reducing the risk of overfitting.

In [87]:
#Let's try SMOTE
smote= SMOTE(random_state=42)
X_train_fit, y_train_fit= smote.fit_resample(X_train, y_train)

print(pd.Series(y_train_fit).value_counts())

#SMOTE + LogisticRegression
X_train3, X_test3, y_train3, y_test3= train_test_split(X_train_fit, y_train_fit, test_size=0.2, random_state=42, stratify=y_train_fit)

y_pred3=model_function(X_train3,y_train3,X_test3)
print(classification_report(y_pred3, y_test3))
print(confusion_matrix(y_pred3, y_test3))

0    4536
1    4536
Name: count, dtype: int64
              precision    recall  f1-score   support

           0       0.85      0.91      0.88       854
           1       0.91      0.86      0.89       961

    accuracy                           0.88      1815
   macro avg       0.88      0.88      0.88      1815
weighted avg       0.88      0.88      0.88      1815

[[774  80]
 [134 827]]




Almost same as oversampling not much difference in performance. Let's try ensemble methods.

In [89]:
#BalancedRandomForestClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


# xgb = BalancedXGBClassifier(random_state=42)

# xgb.fit(X_train, y_train)

# y_predict = xgb.predict(X_test)

# print(classification_report(y_test, y_predict))

# # print(pd.Series(y_train_fit).value_counts())

# print(confusion_matrix(y_predict, y_test))


#RandomForestClassifier

ratio = sum(y_train == 0) / sum(y_train == 1)

xgb = XGBClassifier(scale_pos_weight=ratio, random_state=42)

xgb.fit(X_train, y_train)

y_predict = xgb.predict(X_test)

print(classification_report(y_test, y_predict))

print(confusion_matrix(y_predict, y_test))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97      1138
           1       0.43      0.55      0.48        62

    accuracy                           0.94      1200
   macro avg       0.70      0.75      0.72      1200
weighted avg       0.95      0.94      0.94      1200

[[1093   28]
 [  45   34]]


Here , either BalancedRandomForestClassifier nor RandomForestClassifier performes good , it might be the data is not enough.

**Focal Loss** is a loss function designed to address the problem of class imbalance in classification tasks, particularly in object detection and image segmentation. Let's try  this..

In [78]:
#Custom Objective Function: will compute gradients and hessians that implement focal loss.
def focal_loss(gamma=2., alpha=0.25):
    def focal_loss_obj(y_true, y_pred):
        # Sigmoid activation to get probabilities
        y_pred = 1.0 / (1.0 + np.exp(-y_pred))

        # Calculate the log of probabilities for the positive and negative class
        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)

        # Focal loss formula
        loss = -alpha * (1 - p_t) ** gamma * np.log(p_t)

        # Gradient (derivative of loss)
        grad = -alpha * (1 - p_t) ** gamma * (y_true - y_pred)

        # Hessian (second derivative of loss)
        hess = alpha * (1 - p_t) ** gamma * y_pred * (1 - y_pred)

        return grad, hess

    return focal_loss_obj

# Apply Focal Loss with XGBoost classifier
xgb = XGBClassifier(objective=focal_loss(gamma=2., alpha=0.25), random_state=42)

# Fit the model
xgb.fit(X_train, y_train) #unbalanced data
# xgb.fit(X_train_fit, y_train_fit) # balanced data after SMOTE


# Make predictions
y_pred = xgb.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


# print(pd.Series(y_train).value_counts())
# print(pd.Series(y_train_fit).value_counts())

              precision    recall  f1-score   support

           0       0.96      0.99      0.98      1138
           1       0.57      0.34      0.42        62

    accuracy                           0.95      1200
   macro avg       0.77      0.66      0.70      1200
weighted avg       0.94      0.95      0.95      1200

[[1122   16]
 [  41   21]]
0    4536
1     264
Name: count, dtype: int64


Here also the model don't get improved , same as ensemble. It seems logistic regression gives best performance for simple unbalanced dataset but in realtime the datasets are quite large and complex where logistic regression might not be a better model and we have lot of methods and algorithms we can try, test & otimize for better performance.