<font color="red" size="6">Techniques for handling imbalanced datasets</font>
<P><font color="yELLOW" size="5"><B>3_BorderlineSMOTE</font>

<font color="pink" size=4>BorderlineSMOTE</font>

BorderlineSMOTE is a variant of the original SMOTE (Synthetic Minority Over-sampling Technique). It specifically focuses on generating synthetic samples near the decision boundary (i.e., borderline samples) between the majority and minority classes. The idea behind BorderlineSMOTE is to focus on the most difficult samples to classify, which are the ones lying near the decision boundary. This can help improve the classifier's ability to distinguish between classes.

<font color="pink" size=4>How BorderlineSMOTE Works:</font>
<ol>
     <li><font color="orange">Borderline Samples:</font> BorderlineSMOTE identifies the samples from the minority class that are near the decision boundary between the majority and minority classes.</li>
     <li><font color="orange">Synthetic Samples Generation:</font> It generates synthetic samples by creating new samples from the minority class only near the decision boundary, similar to SMOTE, but focusing on these borderline samples.</li></ol>

In [2]:
import numpy as np
import pandas as pd
from imblearn.over_sampling import BorderlineSMOTE
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying BorderlineSMOTE
print("Class distribution before BorderlineSMOTE:", Counter(y))

# Step 3: Apply BorderlineSMOTE to oversample the minority class
borderline_smote = BorderlineSMOTE(random_state=42)
X_resampled, y_resampled = borderline_smote.fit_resample(X, y)

# Step 4: Check the class distribution after applying BorderlineSMOTE
print("Class distribution after BorderlineSMOTE:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before BorderlineSMOTE: Counter({0: 898, 1: 102})
Class distribution after BorderlineSMOTE: Counter({0: 898, 1: 898})
              precision    recall  f1-score   support

           0       0.98      0.95      0.96       265
           1       0.95      0.98      0.97       274

    accuracy                           0.96       539
   macro avg       0.97      0.96      0.96       539
weighted avg       0.97      0.96      0.96       539



<font color="pink" size=4>Advantages of BorderlineSMOTE:</font>
<ol>
    <li><font color="orange">Focuses on Borderline Samples:</font> BorderlineSMOTE focuses on generating synthetic samples for those minority class samples that are near the decision boundary, which are typically harder to classify.</li>
    <li><font color="orange">Improved Classifier Performance:</font> This technique can improve the performance of classifiers by making them better at distinguishing between the classes, especially when class overlap is high.</li>
    <li><font color="orange">Better for Noisy Data:</font> By focusing on the most difficult-to-classify minority samples, it can lead to a better classification decision boundary.</li></ol>

<font color="pink" size=4>Drawbacks of BorderlineSMOTE:</font>
<ol>
    <li><font color="orange">Noise Sensitivity:</font> If there are noisy data points near the decision boundary, BorderlineSMOTE may amplify this noise, leading to overfitting.</li>
    <li><font color="orange">Computational Complexity:</font> Like SMOTE, BorderlineSMOTE can be computationally expensive, especially for large datasets.</li>
    <li><font color="orange">Overfitting Risk:</font> If the synthetic samples are too close to existing samples, overfitting can occur, especially when training more complex models.</li></ol>