# Oversampling Techniques (Class-13, Module-2)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


* Imbalanced Data Distribution, generally happens when observations in one of the class are much higher or lower than the other classes.
* As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class distribution.
* Standard ML techniques have a bias towards the majority class, and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major misclassification of the minority class in comparison with the majority class.
* In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very lesser recall.

### Imbalanced Data Handling Techniques:
-> SMOTE  
-> Near Miss Algorithm

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
data = pd.read_csv('/content/drive/MyDrive/imbalanced data handling/breast_cancer_survival.csv')
data.head()

Unnamed: 0,Age,Protein1,Protein2,Protein3,Protein4,Patient_Status
0,42,0.95256,2.15,0.007972,-0.04834,Alive
1,54,0.0,1.3802,-0.49803,-0.50732,Dead
2,63,-0.52303,1.764,-0.37019,0.010815,Alive
3,78,-0.87618,0.12943,-0.37038,0.13219,Alive
4,42,0.22611,1.7491,-0.54397,-0.39021,Alive


In [None]:
data.isnull().sum()

Age                0
Protein1           0
Protein2           0
Protein3           0
Protein4           0
Patient_Status    13
dtype: int64

In [None]:
data = data.dropna(axis=0)

In [None]:
data['Patient_Status'] = data['Patient_Status'].map({'Alive':1,'Dead':0})

In [None]:
data.columns

Index(['Age', 'Protein1', 'Protein2', 'Protein3', 'Protein4',
       'Patient_Status'],
      dtype='object')

It is a highly imbalanced dataset.

In [None]:
x = data[['Age', 'Protein1', 'Protein2', 'Protein3', 'Protein4']]

y = data['Patient_Status']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

# SMOTE

* Short for Synthetic Minority Oversampling Technique
* It aims to balance class distribution by randomly increasing minority class examples by replicating them.  
*  It generates the virtual training records  by randomly selecting one or more of the k-nearest neighbors for each example in the minority class.


-> for every sample in the minority class, k-nearest neighbors are obtained by calculating the euclidean distance.  
-> for each sample in the minority class, a subset of the k-nearest neighbours are randomly selected.  
-> for each sample in the subset, a new synthetic sample is generated as  
*x' = x + rand(0,1)*|x-xk|  
where **x** is original sample  
**xk** is sample from subset  
**x'** is new synthetic sample


In [None]:
print("Before OverSampling, counts of label 'Alive': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label 'Dead': {} \n".format(sum(y_train == 0)))

Before OverSampling, counts of label 'Alive': 174
Before OverSampling, counts of label 'Dead': 50 



In [None]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 2)
x_train_res, y_train_res = sm.fit_resample(x_train, y_train.ravel())

print('After OverSampling, the shape of train_X: {}'.format(x_train_res.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label 'Alive': {}".format(sum(y_train_res == 1)))
print("After OverSampling, counts of label 'Dead': {}".format(sum(y_train_res == 0)))


After OverSampling, the shape of train_X: (348, 5)
After OverSampling, the shape of train_y: (348,) 

After OverSampling, counts of label 'Alive': 174
After OverSampling, counts of label 'Dead': 174


# Near Miss Algorithm

* NearMiss is an under-sampling technique.
* It aims to balance class distribution by randomly eliminating majority class examples.
* When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes.

-> first finds the distances between all the samples of majority class and minority class.  
-> *n* samples of the majority class that have the smallest distances to those in the minority class are selected.  
-> hence, we get k*n samples where k is the number of samples in the majority class.

In [None]:
print("Before Undersampling, counts of label 'Alive': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'Dead': {} \n".format(sum(y_train == 0)))

Before Undersampling, counts of label 'Alive': 174
Before Undersampling, counts of label 'Dead': 50 



In [None]:
from imblearn.under_sampling import NearMiss
nr = NearMiss()

x_train_miss, y_train_miss = nr.fit_resample(x_train, y_train.ravel())

print('After Undersampling, the shape of train_X: {}'.format(x_train_miss.shape))
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape))

print("After Undersampling, counts of label 'Alive': {}".format(sum(y_train_miss == 1)))
print("After Undersampling, counts of label 'Dead': {}".format(sum(y_train_miss == 0)))


After Undersampling, the shape of train_X: (100, 5)
After Undersampling, the shape of train_y: (100,) 

After Undersampling, counts of label 'Alive': 50
After Undersampling, counts of label 'Dead': 50
