<font color="red" size="6"><b>Nearest Neighbor Methods</font>
<p><font color="Yellow" size="5"><b>2_ADASYN (Adaptive Synthetic Sampling)</font>

<font color="pink" size=4>ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning)</font>

ADASYN is an over-sampling technique that focuses on the generation of synthetic samples for the minority class by adaptively selecting the samples that are harder to learn, i.e., those that are closer to the decision boundary. This is different from traditional techniques like SMOTE, where synthetic data is generated uniformly across the minority class. ADASYN aims to improve the learning of the classifier by generating more synthetic samples near the border where the minority class is underrepresented.

<font color="pink" size=4>How ADASYN Works:</font>
<ol>
    <li><font color="orange">Identify Hard-to-Learn Samples:</font>
        The algorithm first computes the k-nearest neighbors (k-NN) for each minority class sample.
        The samples that are more difficult to classify (i.e., those with fewer majority neighbors) are given a higher weight and are over-sampled more.</li>
    <li><font color="orange">Generate Synthetic Samples:</font>
        Synthetic samples are generated using the SMOTE technique, but the number of synthetic samples is proportional to the difficulty of the minority class samples. Harder-to-learn samples are over-sampled more, leading to a more balanced dataset that also focuses on improving the boundary between classes.</li>
    <li><font color="orange">Balanced Dataset:</font>
        ADASYN results in a more balanced dataset, especially focusing on the regions where the classifier would have the most difficulty in classifying the minority class.</li></ol>

In [2]:
import numpy as np
from imblearn.over_sampling import ADASYN
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying ADASYN
print("Class distribution before ADASYN:", Counter(y))

# Step 3: Apply ADASYN to balance the dataset
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

# Step 4: Check the class distribution after applying ADASYN
print("Class distribution after ADASYN:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before ADASYN: Counter({0: 898, 1: 102})
Class distribution after ADASYN: Counter({0: 898, 1: 890})
              precision    recall  f1-score   support

           0       0.95      0.95      0.95       270
           1       0.95      0.95      0.95       267

    accuracy                           0.95       537
   macro avg       0.95      0.95      0.95       537
weighted avg       0.95      0.95      0.95       537



<font color="pink" size=4>Advantages of ADASYN:
<ol>
    <li><font color="orange">Focus on Hard Samples:</font> Unlike SMOTE, ADASYN generates more synthetic samples in regions where the classifier struggles most, making it more effective for some problems.</li>
    <li><font color="orange">Improves Boundary Learning:</font> By focusing on the decision boundary between classes, ADASYN improves the model's ability to distinguish between the minority and majority classes.</li>
    <li><font color="orange">Dynamic Sampling:</font> ADASYN dynamically adjusts the number of synthetic samples generated for each minority class sample based on its difficulty.</li></ol>

<font color="pink" size=4>Limitations of ADASYN:</font>
<ol>
    <li><font color="orange">Computationally Expensive:</font> It is more computationally expensive than SMOTE because it computes the k-nearest neighbors for each sample.</li>
    <li><font color="orange">Overfitting Risk:</font> By focusing heavily on the minority class and its boundary, there is a risk of overfitting, especially if the minority class has a lot of noise or if there are too many synthetic samples.</li></ol>