<font color="red" size="6"><b>Nearest Neighbor Methods</font>
<p><font color="Yellow" size="5"><b>1_KMeansSMOTE</font>

<font color="pink" size=4>KMeansSMOTE (K-Means Synthetic Minority Over-sampling Technique)</font>

KMeansSMOTE is an advanced technique used to handle class imbalance in datasets, particularly when data is clustered or has a complex structure. It combines K-Means clustering and SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples for the minority class in an optimal way.

<font color="pink" size=4>How KMeansSMOTE Works:</font>
<ol>
    <li><font color="orange">K-Means Clustering:</font>
        The dataset is clustered into k clusters using the K-Means algorithm. The goal is to find clusters of minority class samples that can be used to generate synthetic data points.</li>
    <li><font color="orange">SMOTE:</font>
        After the clusters are formed, the SMOTE algorithm is used to generate synthetic data points within each cluster. These synthetic points are based on the nearest neighbors of each minority sample in the cluster.</li>
    <li><font color="orange">Balanced Dataset:</font>
        The result is a balanced dataset where the minority class has been augmented using synthetic data points generated in a structured and optimized manner based on the clustering.</li></ol>

In [1]:
#pip install -U imbalanced-learn

In [3]:
from imblearn.over_sampling import KMeansSMOTE

In [4]:
import numpy as np
#from imblearn.combine import KMeansSMOTE
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying KMeansSMOTE
print("Class distribution before KMeansSMOTE:", Counter(y))

# Step 3: Apply KMeansSMOTE to balance the dataset
kmeans_smote = KMeansSMOTE(random_state=42)
X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)

# Step 4: Check the class distribution after applying KMeansSMOTE
print("Class distribution after KMeansSMOTE:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before KMeansSMOTE: Counter({0: 898, 1: 102})
Class distribution after KMeansSMOTE: Counter({0: 898, 1: 898})


  super()._check_params_vs_input(X, default_n_init=3)


              precision    recall  f1-score   support

           0       0.97      0.99      0.98       265
           1       0.99      0.97      0.98       274

    accuracy                           0.98       539
   macro avg       0.98      0.98      0.98       539
weighted avg       0.98      0.98      0.98       539



<font color="pink" size=4>Advantages of KMeansSMOTE:</font>
<ol>
    <li><font color="orange">Optimal Synthetic Data Generation:</font> By clustering minority class samples and generating synthetic data points within each cluster, KMeansSMOTE can produce more meaningful synthetic data, leading to better generalization.</li>
    <li><font color="orange">Handling Complex Data Structures:</font> The clustering step allows handling datasets with non-linear distributions and complex relationships.</li>
    <li><font color="orange">Reduces Noise:</font> By generating data points only from similar samples within clusters, the technique reduces the risk of introducing noise.</li></ol>

<font color="pink" size=4>Drawbacks of KMeansSMOTE:</font>
<ol>
    <li><font color="orange">Computational Complexity:</font> The clustering step involved in K-MeansSMOTE can be computationally expensive, especially for large datasets.</li>
    <li><font color="orange">Tuning Hyperparameters:</font> The performance of KMeansSMOTE depends on the choice of the number of clusters (k), which needs careful tuning.</li>
    <li><font color="orange">Limited Availability:</font> The method is less commonly implemented and may not be readily available in all libraries.</li></ol>