<font color="red" size="6">Hybrid Model</font>
<p><font color="Yellow" size="5"><b>2_SMOTETomek (SMOTE + Tomek Links)</b> </font>

<b><font size=4>SMOTETomek</font></b>

<font color="pink" size=4>SMOTETomek is another hybrid technique designed to handle class imbalance in datasets. It combines two methods:</font>
<ol>
    <li><font color="orange">SMOTE (Synthetic Minority Over-sampling Technique):</font> This technique generates synthetic samples for the minority class by interpolating between existing minority class samples.</li>
    <li><font color="orange">Tomek Links:</font> Tomek Links is an under-sampling technique that removes pairs of samples from different classes that are close to each other (i.e., they are neighbors). The idea is to eliminate borderline examples or ambiguous samples that lie close to the decision boundary between classes, which could confuse a classifier.</li></ol>

<font color="pink" size=4>How SMOTETomek Works:
<ol>
    <li><font color="orange">Generate Synthetic Minority Samples (SMOTE):</font> 
        The first step in SMOTETomek is to apply SMOTE to generate synthetic samples for the minority class. This increases the number of minority class instances by interpolating between existing minority class samples.</li>
    <li><font color="orange">Remove Borderline Samples Using Tomek Links:</font> 
        After applying SMOTE, Tomek Links is applied to remove pairs of samples that are neighbors but belong to different classes. These pairs are likely to be borderline samples that are close to the decision boundary and could introduce noise or ambiguity into the classifier.</li>
    <li><font color="orange">Balanced Dataset:</font> 
        The result is a dataset that is both balanced and cleaned of noisy or borderline samples, leading to potentially better generalization by the model.</li></ol>

In [1]:
import numpy as np
from imblearn.combine import SMOTETomek
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying SMOTETomek
print("Class distribution before SMOTETomek:", Counter(y))

# Step 3: Apply SMOTETomek to balance the dataset
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

# Step 4: Check the class distribution after applying SMOTETomek
print("Class distribution after SMOTETomek:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before SMOTETomek: Counter({0: 898, 1: 102})
Class distribution after SMOTETomek: Counter({0: 896, 1: 896})
              precision    recall  f1-score   support

           0       0.95      0.95      0.95       264
           1       0.96      0.96      0.96       274

    accuracy                           0.96       538
   macro avg       0.96      0.96      0.96       538
weighted avg       0.96      0.96      0.96       538



<font color="pink" size=4>Advantages of SMOTETomek:</font>
<ol>
    <li><font color="orange">Improved Decision Boundaries:</font> The Tomek Links step helps to clean the dataset by removing borderline or noisy instances, resulting in better decision boundaries between classes.
    <li><font color="orange">Hybrid Approach:</font> SMOTETomek combines both over-sampling and under-sampling techniques, making it a more sophisticated solution than applying either technique alone.</li>
    <li><font color="orange">Reduces Overfitting:</font> By cleaning the dataset of noisy instances, SMOTETomek helps the classifier generalize better, reducing the chances of overfitting to the minority class.</li></ol>

<font color="pink" size=4>Drawbacks of SMOTETomek:</font>
<ol>
    <li><font color="orange">Computational Complexity:</font> Since SMOTETomek involves both over-sampling and under-sampling steps, it can be computationally expensive, especially for large datasets.</li>
    <li><font color="orange">Loss of Information:</font> The Tomek Links step may lead to the removal of useful majority class samples, which could impact model performance in certain cases.</li>
    <li><font color="orange">Not Always Effective:</font> If the dataset doesn't have significant noise or borderline instances, the effectiveness of SMOTETomek may be limited.</li></ol>