### Qinhui Xu 09/15/2018

#### Data Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

As the data exploration step shows, the dataset is a real imbalanced one. There is only less than 2% fraud records in the dataset. And actually when we predict wheter the record is a fraud or non-fraud, we care more about true positive rate (if we can predict right about fraud  data). Without dealing with imbalance existing in the dataset, model can easily achieve a good rate of accuracy but cannot achieve a good rate  of sensitivity. Therefore, dealing with imbalance is really essential.

In this Jupyter Notebook, I am going to use six differnet undersampling or oversampling methods to deal with the imbalance existing in the dataset.
Before we deal with imbalance, I am going to calculate Sensitivity and Specificity from simple **Logistic Regression** with original dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score,accuracy_score
from sklearn.metrics import confusion_matrix




In [2]:
# Data Preprocessing based on Data Exploration / Feature Engineering
df = pd.read_csv("C:/Users/Tiffany Xu/Documents/MachineLearningStudy/DeepLearning/creditcard.csv")

df['V1_'] = df.V1.map(lambda x: 1 if x < -3 else 0)
df['V2_'] = df.V2.map(lambda x: 1 if x > 2.5 else 0)
df['V3_'] = df.V3.map(lambda x: 1 if x < -4 else 0)
df['V4_'] = df.V4.map(lambda x: 1 if x > 2.5 else 0)
df['V5_'] = df.V5.map(lambda x: 1 if x < -4.5 else 0)
df['V6_'] = df.V6.map(lambda x: 1 if x < -2.5 else 0)
df['V7_'] = df.V7.map(lambda x: 1 if x < -3 else 0)
df['V9_'] = df.V9.map(lambda x: 1 if x < -2 else 0)
df['V10_'] = df.V10.map(lambda x: 1 if x < -2.5 else 0)
df['V11_'] = df.V11.map(lambda x: 1 if x > 2 else 0)
df['V12_'] = df.V12.map(lambda x: 1 if x < -2 else 0)
df['V14_'] = df.V14.map(lambda x: 1 if x < -2.5 else 0)
df['V16_'] = df.V16.map(lambda x: 1 if x < -2 else 0)
df['V17_'] = df.V17.map(lambda x: 1 if x < -2 else 0)
df['V18_'] = df.V18.map(lambda x: 1 if x < -2 else 0)
df['V19_'] = df.V19.map(lambda x: 1 if x > 1.5 else 0)
df['V21_'] = df.V21.map(lambda x: 1 if x > 0.6 else 0)

df = df.drop(['V28','V27','V26','V25','V24','V23','V22','V20','V15','V13','V8'], axis =1)

#### Simple Logistic Regression without Dealing with Imbalance

In [3]:
df['normalized_amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df = df.drop(['Amount','Time'], axis=1)
X = df.loc[:,df.columns != 'Class']
y = df.loc[:,df.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 12)
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))

Sensitivity/Recall is: 0.6938775510204082
Specificity is: 0.9997772462952542


From the result above, we can see that compared to Specificity, *Recall is pretty low*,which is only a little bit better than random guess. Then, I will use different methods to deal with imbalance existing in the dataset and then calculate sensitivity and make a comparsion.

In [4]:
df_notfraud = df[df["Class"]==0]
df_fraud = df[df["Class"]==1]

The methods to deal with imbalance can be mainly divided into three kinds: 
    
   1.Undersampling - reduce the size of majority (non-fraud) to match the minority (fraud)
   
   2.Oversampling - increase the size of minority (fraud)
   
   3.Oversampling followed by Undersampling - increase the size first then use undersampling technique to deal with some potential issue

#### Random Undersampling

In [5]:
df_notfraud_sample = df_notfraud.sample(len(df_fraud))
df_random_undersample = pd.concat([df_notfraud_sample,df_fraud],axis=0)

X_random_under = df_random_undersample.loc[:,df_random_undersample.columns != 'Class']
y_random_under = df_random_undersample.loc[:,df_random_undersample.columns == 'Class']
X_under_train, X_under_test, y_under_train, y_under_test = train_test_split(X_random_under,y_random_under,test_size = 0.3, random_state = 12)

lr = LogisticRegression()
lr.fit(X_under_train,y_under_train)
y_pred = lr.predict(X_under_test)
tn, fp, fn, tp = confusion_matrix(y_under_test, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))

Sensitivity/Recall is: 0.9078947368421053
Specificity is: 0.9722222222222222


Using random undersampling, Sensitivity is hugely increased to **0.9079**. But random undersampling will potentially cause the loss of information. Therefore, I will try to do undersampling based on K-means clusters. In this way, I can extract a specific number of cluster centroids, which can represent characteristics of each cluster to prevent from information loss.

#### Cluster based Undersampling

In [6]:
import h2o
h2o.init()
import imp
from h2o.estimators.kmeans import H2OKMeansEstimator

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 hour 48 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.2
H2O cluster version age:,3 months and 3 days
H2O cluster name:,H2O_from_python_Tiffany_Xu_osupl8
H2O cluster total nodes:,1
H2O cluster free memory:,3.332 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [7]:
df_kmeans_notfraud = df_notfraud.drop(['Class'], axis=1)
hf = h2o.H2OFrame(df_kmeans_notfraud)
cls = H2OKMeansEstimator(k=len(df_fraud), standardize=True)
cls.train(x=hf.columns, training_frame=hf)

Parse progress: |█████████████████████████████████████████████████████████| 100%
kmeans Model Build progress: |████████████████████████████████████████████| 100%


In [9]:
df_centers = pd.DataFrame(cls.centers())
df_centers.columns = hf.columns
df_centers["Class"] = 0
df_kmeans_undersample = pd.concat([df_centers,df_fraud],axis=0)

In [10]:
X_Kmeans_under = df_kmeans_undersample.loc[:,df_kmeans_undersample.columns != 'Class']
y_Kmeans_under = df_kmeans_undersample.loc[:,df_kmeans_undersample.columns == 'Class']
X_under_train, X_under_test, y_under_train, y_under_test = train_test_split(X_Kmeans_under,y_Kmeans_under,test_size = 0.3, random_state = 12)

lr = LogisticRegression()
lr.fit(X_under_train,y_under_train)
y_pred = lr.predict(X_under_test)
tn, fp, fn, tp = confusion_matrix(y_under_test, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))

Sensitivity/Recall is: 0.9342105263157895
Specificity is: 0.9097222222222222


We can see that Sensitivity is **increase by 0.027** from the result come from random undersampling.

#### SMOTE - Synthetic Minority Oversampling Technique

In [11]:
from imblearn.over_sampling import SMOTE

In [12]:
df_features = df.drop(['Class'], axis=1)
df_target = df["Class"]

sm = SMOTE(random_state=12, ratio=1.0)
x_res,y_res = sm.fit_sample(df_features,df_target)

x_train_res, x_val_res, y_train_res, y_val_res = train_test_split(x_res,
                                                    y_res,
                                                    test_size = .3,
                                                    random_state=12)

lr = LogisticRegression()
lr.fit(x_train_res,y_train_res)
y_pred = lr.predict(x_val_res)
tn, fp, fn, tp = confusion_matrix(y_val_res, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))



Sensitivity/Recall is: 0.9192237421530965
Specificity is: 0.9706824716859339


Using SMOTE makes sensitivity **0.919**, which is better than random sampling but not as good as undersampling.

#### Oversampling followed by Undersampling - SMOTE + Tomek Links

In [13]:
from imblearn.combine import SMOTETomek

smote_tomek = SMOTETomek(random_state=12)
x_res,y_res = smote_tomek.fit_sample(df_features,df_target)
x_train_res, x_val_res, y_train_res, y_val_res = train_test_split(x_res,
                                                    y_res,
                                                    test_size = .3,
                                                    random_state=12)

lr = LogisticRegression()
lr.fit(x_train_res,y_train_res)
y_pred = lr.predict(x_val_res)
tn, fp, fn, tp = confusion_matrix(y_val_res, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))

Sensitivity/Recall is: 0.9192237421530965
Specificity is: 0.9706824716859339


The combination of SMOTE and Tomek Links does not improve the result from SMOTE, which indicates that the dataset we created after using SMOTE method does not have significant issues Tomek Links is focused on.

#### Oversampling followed by Undersampling - SMOTE + NearMiss

In [18]:
from imblearn.under_sampling import NearMiss

nr = NearMiss(random_state=12)
sm = SMOTE(random_state=12, ratio=1.0)
x_res,y_res = sm.fit_sample(df_features,df_target)
x_res,y_res = nr.fit_sample(x_res,y_res)

x_train_res, x_val_res, y_train_res, y_val_res = train_test_split(x_res,
                                                    y_res,
                                                    test_size = .3,
                                                    random_state=12)

lr = LogisticRegression()
lr.fit(x_train_res,y_train_res)
y_pred = lr.predict(x_val_res)
tn, fp, fn, tp = confusion_matrix(y_val_res, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))



Sensitivity/Recall is: 0.9174981262881768
Specificity is: 0.970186743664683


The combination of SMOTE and NearMiss does change the result, but it makes Sensitivity **lower by 0.02**, which is not what I want. 

#### Oversampling followed by Undersampling - SMOTE + ENN (Edited Nearest Neighbor)

In [19]:
from imblearn.combine import SMOTEENN

se = SMOTEENN(random_state=12, ratio=1.0)
x_res,y_res = se.fit_sample(df_features,df_target)

x_train_res, x_val_res, y_train_res, y_val_res = train_test_split(x_res,
                                                    y_res,
                                                    test_size = .3,
                                                    random_state=12)

lr = LogisticRegression()
lr.fit(x_train_res,y_train_res)
y_pred = lr.predict(x_val_res)
tn, fp, fn, tp = confusion_matrix(y_val_res, y_pred).ravel()
print("Sensitivity/Recall is:",tp/(tp+fn))
print("Specificity is:",tn/(tn+fp))



Sensitivity/Recall is: 0.918257344066744
Specificity is: 0.9708363687636594


The combination of SMOTE and ENN does not improve the result from sensitivity come from only SMOTE method.

#### Though sensitivity has highest value with method of cluster based undersampling,undersampling would still potentially cause loss of information existing in the dataset.

#### Therefore, I will use try building models based on the data come from cluster base undersampling and SMOTE seperately,