**Dealing with Imbalanced Datasets**

Imbalanced class problem are very common in classification model where one class count of response variable is very less in comparision to other class. Such as in banking Fraud detection, health care medicial diagnosis of rare disease etc where the fraud counts are very less in comparision to non fraud rows. It has been observed that positive cases of being deafult or fraud is approximately to 2-3% of the total data. So in such scenario sometimes machine learning algorithm fails to learn the underlying pattern and could not correctly identify the cases where real default occurs.

So to deal with this kind of problem is to oversample the minority class of response variable and make it as 50:50(class=0 :class =1) or 60:40(class=0 :class =1)

We have many ways to deal this , few techniques are as below:

**1. Resampling techniques - Undersampling majority class**

Let us consider a fraud detection dataset where we have
Total Observation = 2000
Non Fradulent rows = 1660
Fradulent rows = 40

So here we can see the fradulent rows are only 2% of the total dataset.

Undersampling majority class is a technique where we will take some 10% or 15% from samples without replacement from Non Fraud instances and combining them with the Fradulent rows.

 10% of 2000 = 200
 
 Total Observation = 40 + 200 = 240
 Fraudulent rows% = 40/240 = 16.6%
 
 Now we have significant increase in the Fradulent data set count. 
 
 **Disadvantages:**
 
*  Due to less number of data we will have bias problem, as a result machine learning algorithm will fail to learn    many underlying pattern and cannot able to predict for new data.

*  Many useful data will be missed.
 

**2. Resampling Techniques - Oversampling minorty class**

In Oversampling minority class we will increasing the fraud rows to such an extent that it will be 1:1 ratio with non fraud rows so as to attain equal representation of both the classes.

Non Fradulent rows = 1660

Increasing the Fradulent rows to 1660 to have equal ratio between both.

Let us learn by solving one example. We will use the **Credit Card Fraud Detection Dataset** available on Kaggle for our operations.




In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score, precision_score
from sklearn.utils import resample ## Used for sampling the data

In [None]:
cc = pd.read_csv("../input/creditcardfraud/creditcard.csv")


In [None]:
cc.head()

In [None]:
cc.shape

In [None]:
cc.info()

In [None]:
cc.Class.value_counts()

In [None]:
Y = cc['Class']

In [None]:
Y.count()

In [None]:
X = cc.drop(['Class'], axis = 1)
X.head()

In [None]:
Y.value_counts()

In [None]:
## Preparing the Training and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.20, random_state = 30)

In [None]:
X_train.shape

In [None]:
Y_train.shape

In [None]:
X_test.shape

In [None]:
Y_train.shape

**Let us run Logistic regression and evaluate the performance metrics**

In [None]:
## Logistic Regression
lr_model = LogisticRegression(solver='liblinear').fit(X_train,Y_train)

In [None]:
lr_pred = lr_model.predict(X_test)

In [None]:
print("Logistic Regression Metrics:")
print("")
print("Accuracy Score:",accuracy_score(Y_test, lr_pred))
print("F1 Score:", f1_score(Y_test,lr_pred))
print("Recall Score:",recall_score(Y_test, lr_pred))

** Let us run Random Forest and evaluate the performance Metrics**

In [None]:
## Random Forest Classifier 

rf = RandomForestClassifier(n_estimators=10)

In [None]:
rf_model = rf.fit(X_train, Y_train)

In [None]:
rf_pred = rf_model.predict(X_test)

In [None]:
print("Random Forest Metrics:")
print("")
print("Accuracy Score:",accuracy_score(Y_test, rf_pred))
print("Recall Score:", recall_score(Y_test, rf_pred))
print("F1 Score:", f1_score(Y_test, rf_pred))

**Oversampling minorty class** 

In [None]:
# concatenate our training data back together

X = pd.concat([X_train, Y_train], axis=1)
X.head()

**we will apply resample function from sklearn**

In [None]:
not_fraud = X[X.Class==0]
fraud = X[X.Class==1]

# upsample minority
fraud_upsampled = resample(fraud,
                          replace=True, # sample with replacement
                          n_samples=len(not_fraud), # match number in majority class
                          random_state=27) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([not_fraud, fraud_upsampled])

# check new class counts
upsampled.Class.value_counts()

**We can see the Fraud and Non Fraud rows are same in count. Now we will run classifier algorithm and check whether the metrics parameter changed or not.**

In [None]:
y_train = upsampled.Class
X_train = upsampled.drop('Class', axis = 1)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
## Logistic Regression
lr_model2 = LogisticRegression(solver='liblinear').fit(X_train,y_train)

In [None]:
lr_pred2 = lr_model2.predict(X_test)

In [None]:
print("Logistic Regression Metrics after Oversampling minority class:")
print("")
print("Accuracy Score:",accuracy_score(Y_test, lr_pred2))
print("F1 Score:", f1_score(Y_test,lr_pred2))
print("Recall Score:",recall_score(Y_test, lr_pred2))

In [None]:
## Random Forest Classifier 

rf = RandomForestClassifier(n_estimators=10)

In [None]:
rf_model2 = rf.fit(X_train, y_train)

In [None]:
rf_pred2 = rf_model2.predict(X_test)

In [None]:
print("Random Forest Metrics after Oversampling minority class:")
print("")
print("Accuracy Score:",accuracy_score(Y_test, rf_pred2))
print("Recall Score:", recall_score(Y_test, rf_pred2))
print("F1 Score", f1_score(Y_test, rf_pred2))

**If we observe here, after Oversampling of minority class the accuracy has reduced but the Recall has significantly increased which serve some of our purpose of classifications model. The percentage of FN( False Negative) has reduced a lot.**

**Recall is define as TP / (TP + FN)**

**3. Generation Synthetic Samples - SMOTE (Synthetic Minority Oversampling Technique)**

A technique similar to upsampling is to create synthetic samples. Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data.

* Works by creating synthetic samples from the minor class (no-subscription) instead of creating copies.
* Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
# Separate input features and target
y = cc.Class
X = cc.drop('Class', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=27)

sm = SMOTE(random_state=27, ratio=1.0)
X_train, y_train = sm.fit_sample(X_train, y_train)

**Running Logistic regression**

In [None]:
lr_pred_smote = LogisticRegression(solver='liblinear').fit(X_train, y_train)

smote_pred = lr_pred_smote.predict(X_test)


In [None]:
print("Logistic Regression Metrics after SMOTE:")
print("")
print("Accuracy Score:",accuracy_score(y_test, smote_pred))
print("F1 Score:", f1_score(y_test,smote_pred))
print("Recall Score:",recall_score(y_test, smote_pred))

**Running Random Forest**

In [None]:
rf_pred_smote = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

rf_smote_pred = rf_pred_smote.predict(X_test)


In [None]:
print("Random Forest Metrics after SMOTE:")
print("")
print("Accuracy Score:",accuracy_score(y_test, rf_smote_pred))
print("Recall Score:", recall_score(y_test, rf_smote_pred))
print("F1 Score", f1_score(y_test, rf_smote_pred))

**We can see the Recall score has increased significantly after the SMOTE**

**CONCLUSION**

We explored 3 different methods for dealing with imbalanced datasets:

1. Undersampling majority class
2. Oversampling minorty class
3. Generation Synthetic Samples - SMOTE

There are lot of methods to deal with Imbalanced dataset. We have to choose whoich best suits your problem.

