<font color="red" size="6">Techniques for handling imbalanced datasets</font>
<P><font color="yELLOW" size="5"><B>2_SMOTE (Synthetic Minority Over-sampling Technique)</font>

SMOTE (Synthetic Minority Over-sampling Technique) from the imbalanced-learn library. SMOTE is a technique for generating synthetic samples of the minority class to balance an imbalanced dataset.


<font color="pink" size=4>SMOTE Overview:</font>
<ol>
    <li>SMOTE works by creating synthetic samples for the minority class rather than just duplicating the existing minority class samples (as in RandomOverSampler).</li>
    <li>It generates synthetic samples by selecting two or more nearest neighbors of the minority class samples and interpolating between them.</li></ol>

In [2]:
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying SMOTE
print("Class distribution before SMOTE:", Counter(y))

# Step 3: Apply SMOTE to oversample the minority class
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Step 4: Check the class distribution after applying SMOTE
print("Class distribution after SMOTE:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before SMOTE: Counter({0: 898, 1: 102})
Class distribution after SMOTE: Counter({0: 898, 1: 898})
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       265
           1       0.96      0.96      0.96       274

    accuracy                           0.96       539
   macro avg       0.96      0.96      0.96       539
weighted avg       0.96      0.96      0.96       539



<font color="pink" size=4>Advantages of SMOTE:</font>
<ol>
    <li><font color="orange">Synthetic Samples:</font> Unlike random oversampling, SMOTE creates new, synthetic samples rather than just duplicating the minority class samples.</li>
    <li><font color="orange">Improves Performance:</font> By generating synthetic data points, SMOTE helps the classifier to generalize better, especially in imbalanced datasets.</li>
    <li><font color="orange">Balances the Dataset:</font> SMOTE effectively balances the dataset, leading to more balanced performance across classes.</li></ol>

<font color="pink" size=4>Drawbacks of SMOTE:</font>
<ol>
    <li><font color="orange">Risk of Overfitting:</font> As SMOTE generates synthetic samples based on existing data, it could lead to overfitting if the model learns to memorize the synthetic data.</li>
    <li><font color="orange">Noise in the Data:</font> SMOTE could generate synthetic samples in noisy or less meaningful areas of the feature space, potentially harming model performance.</li>
    <li><font color="orange">Computationally Expensive:</font> SMOTE may be more computationally expensive than simple oversampling techniques, especially for large datasets.</li></ol>