if you are getting this error while import :
    
    AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DatasetsPair'
        
Downgrade your scikit-learn via 
    pip install scikit-learn==1.1.0
    
    pip install scikit-learn==1.1.0 --user    (if any permission issue)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
  
# load the data set
df = pd.read_csv('Data/Diabetes.csv')
  
# print info about columns in the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None


In [2]:
# check Target distribution

df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [3]:
X = df.drop('Outcome', axis =1)
Y = df['Outcome']

In [4]:
from sklearn.model_selection import train_test_split
  
# split into 70:30 ration
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
  
# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (537, 8)
Number transactions y_train dataset:  (537,)
Number transactions X_test dataset:  (231, 8)
Number transactions y_test dataset:  (231,)


In [5]:
#Outcome distribution in training data
y_train.value_counts()

0    343
1    194
Name: Outcome, dtype: int64

### SMOTE

SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem.
It aims to balance class distribution by randomly increasing minority class examples by replicating them.
SMOTE synthesises new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.

In [6]:
sm = SMOTE(random_state = 2)
X_train_sm,y_train_sm = sm.fit_resample(X_train, y_train)

In [7]:
#Outcome distribution in training data
#way : 1

print('After OverSampling, the shape of train_X: {}'.format(X_train_sm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_sm.shape))
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_sm == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_sm == 0)))

#way : 2

y_train_sm.value_counts()

After OverSampling, the shape of train_X: (686, 8)
After OverSampling, the shape of train_y: (686,) 

After OverSampling, counts of label '1': 343
After OverSampling, counts of label '0': 343


1    343
0    343
Name: Outcome, dtype: int64

### Near Miss Algorithm

Near-miss is an algorithm that can help in balancing an imbalanced dataset. It can be grouped under undersampling algorithms and is an efficient way to balance the data. The algorithm does this by looking at the class distribution and randomly eliminating samples from the larger class. When two points belonging to different classes are very close to each other in the distribution, this algorithm eliminates the datapoint of the larger class thereby trying to balance the distribution.

The algorithm first calculates the distance between all the points in the larger class with the points in the smaller class. This can make the process of undersampling easier. 
Select instances of the larger class that have the shortest distance with the smaller class. These n classes need to be stored for elimination. 
If there are m instances of the smaller class then the algorithm will return m*n instances of the larger class.

Types of the near-miss algorithm:

Version 1: In the first version, the data is balanced by calculating the average minimum distance between the larger distribution and three closest smaller distributions.

Version 2: Here, the data is balanced by calculating the average minimum distance between the larger distribution and three furthest smaller distributions. 

Version 3: Here, the smaller class instances are considered and m neighbours are stored. Then the distance between this and the larger distribution is taken and the largest distance is eliminated.

In [8]:
near_miss = NearMiss()

In [9]:
X_train_nm,y_train_nm = near_miss.fit_resample(X_train, y_train)

In [10]:
#Outcome distribution in training data
#way : 1

print('After UnderSampling, the shape of train_X: {}'.format(X_train_nm.shape))
print('After UnderSampling, the shape of train_y: {} \n'.format(y_train_nm.shape))
  
print("After UnderSampling, counts of label '1': {}".format(sum(y_train_nm == 1)))
print("After UnderSampling, counts of label '0': {}".format(sum(y_train_nm == 0)))

#way : 2

y_train_nm.value_counts()

After UnderSampling, the shape of train_X: (388, 8)
After UnderSampling, the shape of train_y: (388,) 

After UnderSampling, counts of label '1': 194
After UnderSampling, counts of label '0': 194


0    194
1    194
Name: Outcome, dtype: int64