<font color="red" size="6">Undersampling Methods</font>
<p><font color="yellow" size="5"><b>3_ClusterCentroids</b></font></p>

ClusterCentroids is an under-sampling technique used to handle class imbalance in datasets. It works by identifying clusters of majority class samples and replacing them with centroids (the mean of the samples in the cluster). This technique aims to reduce the size of the majority class while preserving the overall structure of the data. By representing clusters of majority class samples with their centroids, it helps to maintain important characteristics of the majority class while reducing its number.

<font color="pink" size=4>How ClusterCentroids Works:</font>
<ol>
    <li><font color="orange">Clustering the Majority Class:</font>
        The first step is to perform clustering (typically using KMeans) on the majority class samples. The goal is to group similar samples together into clusters.</li>
   <li><font color="orange">Generate Centroids:</font>
        After clustering the majority class, the centroids of these clusters are calculated. A centroid represents the mean of all the samples in a cluster.</li>
    <li><font color="orange">Replace Majority Class Samples with Centroids:</font>
        The majority class samples are then replaced with a smaller number of centroids. This reduces the number of majority class instances, while still preserving the essential characteristics of the data.</li>
    <li><font color="orange">Resampling:</font>
        The resulting dataset has a reduced number of majority class samples, making the dataset more balanced. The minority class samples remain unchanged.</li></ol>

In [2]:
import numpy as np
from imblearn.under_sampling import ClusterCentroids
from sklearn.datasets import make_classification
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, 
                            n_redundant=10, n_classes=2, weights=[0.9, 0.1], 
                            random_state=42)

# Step 2: Check the class distribution before applying ClusterCentroids
print("Class distribution before ClusterCentroids:", Counter(y))

# Step 3: Apply ClusterCentroids to balance the dataset
cluster_centroids = ClusterCentroids(random_state=42)
X_resampled, y_resampled = cluster_centroids.fit_resample(X, y)

# Step 4: Check the class distribution after applying ClusterCentroids
print("Class distribution after ClusterCentroids:", Counter(y_resampled))

# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, 
                                                    test_size=0.3, random_state=42)

# Step 6: Train a classifier (RandomForest) on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Evaluate the classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Class distribution before ClusterCentroids: Counter({0: 898, 1: 102})


  super()._check_params_vs_input(X, default_n_init=10)


Class distribution after ClusterCentroids: Counter({0: 102, 1: 102})
              precision    recall  f1-score   support

           0       0.88      0.94      0.91        32
           1       0.93      0.87      0.90        30

    accuracy                           0.90        62
   macro avg       0.91      0.90      0.90        62
weighted avg       0.90      0.90      0.90        62



<font color="pink" size=4>Advantages of ClusterCentroids:</font>
<ol>
    <li><font color="orange">Preserves Data Structure:</font> By using centroids of clusters, this technique retains the important patterns of the majority class while reducing its size.</li>
    <li><font color="orange">Improves Class Separation:</font> By reducing the majority class and removing redundant samples, it helps to improve the separability between the classes.</li>
    <li><font color="orange">No Synthetic Data Generation:</font> Unlike SMOTE, ClusterCentroids does not generate synthetic data, which can avoid potential overfitting that may arise from creating too many new synthetic samples.</li></ol>

<font color="pink" size=4>Drawbacks of ClusterCentroids:</font>
<ol>
    <li><font color="orange">Loss of Information:</font> By reducing the number of majority class samples, some information from the majority class may be lost. The centroids may not capture all the nuances of the original data.</li>
    <li><font color="orange">Cluster Dependency:</font> The effectiveness of the technique depends on the quality of the clustering. If clustering does not produce meaningful clusters, the resulting centroids may not be representative of the majority class.</li>
    <li><font color="orange">Not Suitable for All Types of Data:</font> ClusterCentroids is particularly useful when the data has well-defined clusters. It may not perform well if the data does not exhibit clear clustering structures.</li></ol>