# 3. Under-sampling

[under-sampling page](https://imbalanced-learn.org/stable/under_sampling.html)

One way of handling imbalanced datasets is to reduce the number of observations from all classes but the minority class. 
The minority class is that with the least number of observations. 
The most well known algorithm in this group is random undersampling, where samples from the targeted classes are removed at random.

These algorithms can be grouped based on their undersampling strategy into:

* Prototype generation methods.

* Prototype selection methods.

And within the latter, we find:

* Controlled undersampling

* Cleaning methods

### 3.1. Prototype generation

Given an original data set S, prototype generation algorithms will generate a new set S'
 where |S'| < |S| and S' not a subset of S.  
In other words, prototype generation techniques will reduce the number of samples in the targeted classes but the remaining samples are generated — and not selected — from the original set.

`ClusterCentroids` makes use of K-means to reduce the number of samples. Therefore, each class will be synthesized with the centroids of the K-means method instead of the original samples:

In [2]:
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
print(sorted(Counter(y).items()))
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

[(np.int64(0), 64), (np.int64(1), 262), (np.int64(2), 4674)]
[(np.int64(0), 64), (np.int64(1), 64), (np.int64(2), 64)]


![](https://imbalanced-learn.org/stable/_images/sphx_glr_plot_comparison_under_sampling_001.png)

`ClusterCentroids` offers an efficient way to represent the data cluster with a reduced number of samples. 
Keep in mind that this method requires that your data are grouped into clusters. 
In addition, the number of centroids should be set such that the under-sampled clusters are representative of the original one.

### 3.2 Prototype selection