<font color="red" size="6">NearMiss</font>

The NearMiss technique is not a classifier itself, but rather a data preprocessing method for handling class imbalance by performing under-sampling on the majority class. It is available in the imbalanced-learn library under the under_sampling module. The technique focuses on retaining samples from the majority class that are closest to the minority class, ensuring a more informative subset of data for training.

<font color="pink" size=4>Key Features of NearMiss:</font>
<ol>
    <li><font color="orange">Under-Sampling Method:</font> Reduces the majority class samples to balance the dataset.</li>
    <li><font color="saffron">Strategies:</font>
        <ol><li>NearMiss-1: Selects majority samples with the smallest average distance to the k-nearest neighbors of the minority class.</li>
        <li>NearMiss-2: Selects majority samples with the smallest average distance to the farthest k-nearest neighbors of the minority class.</li>
        <li>NearMiss-3: Retains majority samples closest to each minority sample until the desired balance is achieved.</li></ol></li></ol>

In [1]:
from imblearn.under_sampling import NearMiss
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from collections import Counter

# Step 1: Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
                            n_redundant=10, weights=[0.9, 0.1], random_state=42)

# Step 2: Check class distribution
print("Class distribution before NearMiss:", Counter(y))

# Step 3: Apply NearMiss for under-sampling
nm = NearMiss(version=1)  # Specify the NearMiss version (1, 2, or 3)
X_res, y_res = nm.fit_resample(X, y)

# Step 4: Check the new class distribution
print("Class distribution after NearMiss:", Counter(y_res))

# Step 5: Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)

# Step 6: Train a classifier on the balanced dataset
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Step 7: Make predictions
y_pred = clf.predict(X_test)

# Step 8: Evaluate the classifier
print("Classification Report:\n", classification_report(y_test, y_pred))


Class distribution before NearMiss: Counter({0: 898, 1: 102})
Class distribution after NearMiss: Counter({0: 102, 1: 102})
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.88      0.88        32
           1       0.87      0.87      0.87        30

    accuracy                           0.87        62
   macro avg       0.87      0.87      0.87        62
weighted avg       0.87      0.87      0.87        62



<font color="pink" size=4>Parameters of NearMiss</font>
<ol>
    <li><font color="orange">version:</font> The version of the NearMiss algorithm (1, 2, or 3).
    <li><font color="orange">sampling_strategy:</font> The ratio of minority to majority class after resampling.
    <li><font color="orange">n_neighbors:</font> Number of nearest neighbors to consider for majority class selection (default: 3).
    <li><font color="orange">n_neighbors_ver3:</font> Used only for version=3, specifies the number of neighbors to use for the minority class.
    <li><font color="orange">random_state:</font> Seed for reproducibility.

<b><font color="sky blue">When to Use NearMiss?</font></b>
<ol>
   <li>When you want to balance an imbalanced dataset by selectively under-sampling the majority class.</li>
    <li>When the dataset size is large, and removing redundant majority samples improves computational efficiency.</li>
    <li>When you want to retain majority class samples that are informative and close to the minority class.</li></ol>