<font color="red" size="6">Undersampling Methods</font>
<p><font color="yellow" size="5"><b>2_TomekLinks</b></font></p>

Tomek Links is a technique used to clean up a dataset by removing ambiguous or borderline samples between the minority and majority classes. It is an under-sampling method that aims to improve class separability by identifying pairs of nearest neighbors from different classes and removing the majority class samples from these pairs. This helps to clean the decision boundary by eliminating noisy or borderline instances that might cause confusion for classifiers.

<font color="pink" size=4>How Tomek Links Work:</font>
<ol>
    <li><font color="orange">Find Nearest Neighbors:</font>
       <ol><li>The first step in identifying a Tomek Link is to calculate the nearest neighbor for each sample in the dataset.</li></ol></li>
    <li><font color="orange">Identify Pairs:</font>
        <ol><li>If two points are nearest neighbors and belong to different classes (i.e., one is from the minority class and the other is from the majority class), then they are termed as a Tomek Link.</li></ol></li>
    <li><font color="orange">Remove Majority Class Instances:</font>
       <ol><li> After identifying the Tomek Links, the algorithm removes the majority class instance from each link. This helps to clean the decision boundary by removing noisy or ambiguous samples.</li></ol></ol></li>

In [2]:
import numpy as np
from imblearn.under_sampling import TomekLinks
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying TomekLinks
print("Class distribution before TomekLinks:", Counter(y))

# Step 3: Apply TomekLinks to clean the dataset
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X, y)

# Step 4: Check the class distribution after applying TomekLinks
print("Class distribution after TomekLinks:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before TomekLinks: Counter({0: 898, 1: 102})
Class distribution after TomekLinks: Counter({0: 889, 1: 102})
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       267
           1       0.95      0.65      0.77        31

    accuracy                           0.96       298
   macro avg       0.96      0.82      0.87       298
weighted avg       0.96      0.96      0.96       298



<font color="pink" size=4>Advantages of Tomek Links:</font>
<ol>
    <li><font color="orange">Improves Decision Boundary:</font> By removing borderline instances that are ambiguous or noisy, Tomek Links helps clean the decision boundary between classes.</li>
    <li><font color="orange">No Synthetic Data Generation:</font> Unlike methods like SMOTE, Tomek Links doesn't generate synthetic data. Instead, it focuses on cleaning the existing data.</li>
    <li><font color="orange">Simple and Efficient:</font> It's a simple technique to implement and can be useful for improving classifier performance without overcomplicating the dataset.</li></ol>

<font color="pink" size=4>Drawbacks of Tomek Links:</font>
<ol>
    <li><font color="orange">Loss of Majority Class Data:</font> Removing samples from the majority class might lead to a loss of important information.</li>
    <li><font color="orange">Limited Effectiveness:</font> Tomek Links only works in cases where the decision boundary between the classes is not well-defined or is noisy. If the boundary is clear, Tomek Links may have limited impact.</li>
    <li><font color="orange">Doesn't Address Under-sampling Bias:</font> While it reduces noise, it doesn’t directly balance the class distribution as other methods like SMOTE or RandomUnderSampler do.</li></ol>