# 3. Under-sampling

[under-sampling page](https://imbalanced-learn.org/stable/under_sampling.html)

One way of handling imbalanced datasets is to reduce the number of observations from all classes but the minority class. 
The minority class is that with the least number of observations. 
The most well known algorithm in this group is random undersampling, where samples from the targeted classes are removed at random.

These algorithms can be grouped based on their undersampling strategy into:

* Prototype generation methods.

* Prototype selection methods.

And within the latter, we find:

* Controlled undersampling

* Cleaning methods

### 3.1. Prototype generation

Given an original data set S, prototype generation algorithms will generate a new set S'
 where |S'| < |S| and S' not a subset of S.  
In other words, prototype generation techniques will reduce the number of samples in the targeted classes but the remaining samples are generated — and not selected — from the original set.

`ClusterCentroids` makes use of K-means to reduce the number of samples. Therefore, each class will be synthesized with the centroids of the K-means method instead of the original samples:

In [2]:
from collections import Counter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                           n_redundant=0, n_repeated=0, n_classes=3,
                           n_clusters_per_class=1,
                           weights=[0.01, 0.05, 0.94],
                           class_sep=0.8, random_state=0)
print(sorted(Counter(y).items()))
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

[(np.int64(0), 64), (np.int64(1), 262), (np.int64(2), 4674)]
[(np.int64(0), 64), (np.int64(1), 64), (np.int64(2), 64)]


![](https://imbalanced-learn.org/stable/_images/sphx_glr_plot_comparison_under_sampling_001.png)

`ClusterCentroids` offers an efficient way to represent the data cluster with a reduced number of samples. 
Keep in mind that this method requires that your data are grouped into clusters. 
In addition, the number of centroids should be set such that the under-sampled clusters are representative of the original one.

### 3.2 Prototype selection

Given an original data set S, prototype selection algorithms will generate a new set S'
 where |S'| < |S| and S' a subset of S.  

Prototype selection algorithms can be divided into two groups: (i) controlled under-sampling techniques and (ii) cleaning under-sampling techniques.

Controlled under-sampling methods reduce the number of observations in the majority class or classes to an arbitrary number of samples specified by the user. Typically, they reduce the number of observations to the number of samples observed in the minority class.

In contrast, cleaning under-sampling techniques “clean” the feature space by removing either “noisy” or “too easy to classify” observations, depending on the method. The final number of observations in each class varies with the cleaning method and can’t be specified by the user. 


### 3.2.1.1 Random under-sampling

`RandomUnderSampler` is a fast and easy way to balance the data by randomly selecting a subset of data for the targeted classes:

In [3]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

[(np.int64(0), 64), (np.int64(1), 64), (np.int64(2), 64)]


![](https://imbalanced-learn.org/stable/_images/sphx_glr_plot_comparison_under_sampling_002.png)

`RandomUnderSampler` allows bootstrapping the data by setting replacement to True. When there are multiple classes, each targeted class is under-sampled independently:

In [7]:
import numpy as np
print(np.vstack([tuple(row) for row in X_resampled]).shape)
rus = RandomUnderSampler(random_state=0, replacement=True)
X_resampled, y_resampled = rus.fit_resample(X, y)
print(np.vstack(np.unique([tuple(row) for row in X_resampled], axis=0)).shape)

(192, 2)
(181, 2)


`RandomUnderSampler` handles heterogeneous data types, i.e. numerical, categorical, dates, etc.:

In [8]:
X_hetero = np.array([['xxx', 1, 1.0], ['yyy', 2, 2.0], ['zzz', 3, 3.0]],
                    dtype=object)
y_hetero = np.array([0, 0, 1])
X_resampled, y_resampled = rus.fit_resample(X_hetero, y_hetero)
print(X_resampled)
print(y_resampled)

[['xxx' 1 1.0]
 ['zzz' 3 3.0]]
[0 1]


`RandomUnderSampler` also supports pandas dataframes as input for undersampling:



In [11]:
from sklearn.datasets import fetch_openml
df_adult, y_adult = fetch_openml(
    'adult', version=2, as_frame=True, return_X_y=True)
df_adult.head() 

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States


In [12]:
df_resampled, y_resampled = rus.fit_resample(df_adult, y_adult)
df_resampled.head()  

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
3582,29,Private,201101,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,50,United-States
27844,23,Private,188950,Assoc-voc,11,Never-married,Sales,Own-child,White,Male,0,0,40,United-States
39877,24,Private,282604,Some-college,10,Married-civ-spouse,Protective-serv,Other-relative,White,Male,0,0,24,United-States
42144,29,Private,174419,HS-grad,9,Never-married,Other-service,Unmarried,White,Female,0,0,30,United-States
27199,20,Private,236592,12th,8,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,35,Italy


`NearMiss` adds some heuristic rules to select samples [MZ03](https://imbalanced-learn.org/stable/zzz_references.html#id2). `NearMiss` implements 3 different types of heuristic which can be selected with the parameter version:

In [10]:
from imblearn.under_sampling import NearMiss
nm1 = NearMiss(version=1)
X_resampled_nm1, y_resampled = nm1.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

[(np.int64(0), 64), (np.int64(1), 64), (np.int64(2), 64)]


`NearMiss` heuristic rules are based on nearest neighbors algorithm. 
Therefore, the parameters `n_neighbors` and `n_neighbors_ver3` accept classifier derived from `KNeighborsMixin` from scikit-learn. 
The former parameter is used to compute the average distance to the neighbors while the latter is used for the pre-selection of the samples of interest.


When under-sampling a specific class, NearMiss-1 can be altered by the presence of noise.
 In fact, it will implied that samples of the targeted class will be selected around these samples as it is the case in the illustration below for the yellow class. 
 However, in the normal case, samples next to the boundaries will be selected. 
 NearMiss-2 will not have this effect since it does not focus on the nearest samples but rather on the farthest samples. 
 We can imagine that the presence of noise can also altered the sampling mainly in the presence of marginal outliers. 
 NearMiss-3 is probably the version which will be less affected by noise due to the first step sample selection.


![](https://imbalanced-learn.org/stable/_images/sphx_glr_plot_comparison_under_sampling_003.png)