We are told that 'V columns may be result of a PCA Dimensionality reduction to protect user identities and sensitive features(v1-v28)'.

Time: Number of seconds elapsed between this transaction and the first transaction in the dataset

Amount: Transaction amount

Class: 1 for fraudulent transactions, 0 otherwise

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('creditcard.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.isnull().any()

In [None]:
df.describe()

In [None]:
print('Fraud Count: {}\nNon-Fraud Count: {}'.format(df['Class'].value_counts()[1], df['Class'].value_counts()[0]))
print('Fraud: {:.3%}\nNon-Fraud: {:.3%}'.format(df['Class'].value_counts()[1] / len(df), 
                                                df['Class'].value_counts()[0] / len(df)))

df['Class'].value_counts().plot(kind='bar')

Let's look at the distribution of Amount and Time to determine how skewed these features are.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(18, 4))

amount_val = df['Amount'].values
time_val = df['Time'].values

sns.distplot(amount_val, ax=ax[0], color='r')
ax[0].set_title('Distribution of Transaction Amount', fontsize=14)
ax[0].set_xlim([min(amount_val), max(amount_val)])
ax[0].set_xlabel('Time (s)')

sns.distplot(time_val, ax=ax[1], color='b')
ax[1].set_title('Distribution of Transaction Time', fontsize=14)
ax[1].set_xlim([min(time_val), max(time_val)])
ax[1].set_xlabel('Amount ($)')



plt.show()

### Scaling & Distribution
Note how each feature of our dataset is spread differently (ie not every feature is between [0, 10]).

In order to not have any bias, we should scale Time and Amount.

- StandardScaler will transform the data such that its distribution will have a mean value of 0 and standard deviation of 1. (ie the mean is subtracted and then divided by the standard deviation)
- RobustScaler removes the median and scales based on IQR (which makes it less prone to outliers).

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler

# RobustScaler is less prone to outliers.

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1, 1))

df.drop(['Time', 'Amount'], axis=1, inplace=True)

### Sub-Sample of Data

Only 0.18% of our data contains fraudulent transactions. In order to have our model properly understand fraudulent and non-fraudulent transactions, we will evenly split our dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

print('No Frauds {:.3%}'.format(df['Class'].value_counts()[0] / len(df)))
print('Frauds {:.3%}'.format(df['Class'].value_counts()[1] / len(df)))

X = df.drop('Class', axis=1)
y = df['Class']

sss = KFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in sss.split(X, y):
    print('Train: {}     Test: {}'.format(train_index, test_index))
    original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
    original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]
    

# Turn into an array
original_Xtrain = original_Xtrain.values
original_Xtest = original_Xtest.values
original_ytrain = original_ytrain.values
original_ytest = original_ytest.values

# See if both the train and test label distribution are similarly distributed
train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True)
print('-' * 100)

print('\nLabel Distributions:')
print(train_counts_label / len(original_ytrain))
print(test_counts_label / len(original_ytest))

#### Random Undersampling

Random Undersampling removes samples from the majority class, with or without replacement. This is a good technique to alleviate imbalanced datasets, however it may increase the variance of the classifier.

In [None]:
# We will make our classes equivalent in order to have a normal distribution of classes.

# First we'll shuffle the data (where frac=1 means 100% of the dataframe will be sampled)
df = df.sample(frac=1)

# Number of fraudulent classes (492 rows)
fraud_df = df.loc[df['Class'] == 1]
non_fraud_df = df.loc[df['Class'] == 0][:492]

normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

# Shuffle df again
new_df = normal_distributed_df.sample(frac=1, random_state=42)

new_df.head()

In [None]:
new_df['Class'].value_counts().plot(kind='barh', title='Number of Classes')

We have evenly distributed the subset.

#### Correlation Matrix (Heatmap)

In [None]:
#### Pearson Correlation of Features
dfCorr = df.corr()
new_dfCorr = new_df.corr()


# Original df
fig, ax = plt.subplots(2, 1, figsize=(24, 24))
a = sns.heatmap(dfCorr, cmap='coolwarm',         # Hotter = Positive, Colder = Negative
            annot=False,
            vmin=-1,                             # set the minimum value of the color scale to -1
            linewidths=2,
            linecolor='white',
           ax=ax[0])
bottom, top = a.get_ylim()
a.set_ylim(bottom + 0.5, top - 0.5)
plt.title("Pearson Correlation - Imbalanced Dataset", size=16)
a.tick_params(axis='both', which='major', rotation=45, labelsize=10)


# New df
b = sns.heatmap(new_dfCorr, cmap='coolwarm',     # Hotter = Positive, Colder = Negative
            annot=False,
            vmin=-1,                             # set the minimum value of the color scale to -1
            linewidths=2,
            linecolor='white',
           ax=ax[1])
bottom, top = b.get_ylim()
b.set_ylim(bottom + 0.5, top - 0.5)
plt.title("Pearson Correlation - Balanced Dataset", size=16)
b.tick_params(axis='both', which='major', rotation=45, labelsize=10)

Notice how the 'Class' row has a few strong correlations. This may indicate fraudulent transactions.

- V4
- V3
- V7
- V9
- V10
- V11
- V12
- V14
- V16
- V17

In [None]:
# Box plot of negative correlation classes
fig, ax = plt.subplots(nrows=4, ncols=3, figsize=(24,24))

sns.boxplot(x="Class", y="V3", data=new_df, ax=ax[0, 0]).set_title('V3')
sns.boxplot(x="Class", y="V4", data=new_df, ax=ax[0, 1]).set_title('V4')
sns.boxplot(x="Class", y="V7", data=new_df, ax=ax[0, 2]).set_title('V7')
sns.boxplot(x="Class", y="V9", data=new_df, ax=ax[1, 0]).set_title('V9')
sns.boxplot(x="Class", y="V10", data=new_df, ax=ax[1, 1]).set_title('V10')
sns.boxplot(x="Class", y="V11", data=new_df, ax=ax[1, 2]).set_title('V11')
sns.boxplot(x="Class", y="V12", data=new_df, ax=ax[2, 0]).set_title('V12')
sns.boxplot(x="Class", y="V14", data=new_df, ax=ax[2, 1]).set_title('V14')
sns.boxplot(x="Class", y="V16", data=new_df, ax=ax[2, 2]).set_title('V16')
sns.boxplot(x="Class", y="V17", data=new_df, ax=ax[3, 0]).set_title('V17')
fig.delaxes(ax[3, 1])
fig.delaxes(ax[3, 2])

The goal now is to remove outliers from these boxplots. Maybe not all outliers, but at the very least the extreme outliers. This will allow our model to be more accurate.

'V14' looks nicely centered and will have a normal distribution. 'V12' and 'V10' seem to look like they will have some imbalances.

In [None]:
from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(20, 6))

v14_fraud_dist = new_df['V14'].loc[new_df['Class'] == 1].values
sns.distplot(v14_fraud_dist, ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)

v12_fraud_dist = new_df['V12'].loc[new_df['Class'] == 1].values
sns.distplot(v12_fraud_dist, ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14)


v10_fraud_dist = new_df['V10'].loc[new_df['Class'] == 1].values
sns.distplot(v10_fraud_dist, ax=ax3, fit=norm, color='#C5B3F9')
ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14)

plt.show()

#### Removing Outliers

In [None]:
# V14
# Apply standard IQR of 1.5

v14_fraud = new_df['V14'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(v14_fraud, 25), np.percentile(v14_fraud, 75)
print('Quartile 25: {:.3f}\nQuartile 75: {:.3f}'.format(q25, q75))
v14_iqr = q75 - q25
print('IQR: {:.3f}'.format(v14_iqr))

v14_cut_off = v14_iqr * 1.5
v14_lower, v14_upper = q25 - v14_cut_off, q75 + v14_cut_off
print('\nCut off: {:.3f}\nLower Bound: {:.3f}\nUpper Bound:{:.3f}'.format(v14_cut_off, v14_lower, v14_upper))

# new_df = new_df.drop(new_df[(new_df['V14'] > v14_upper) | (new_df['V14'] < v14_lower)].index)
print('----' * 10)

## Dimensionality Reduction

t-SNE is great for highly clustered data.

PCA and SVD are both very similar. They both revolve around computing the orthogonal transformation that decorrelates the variables and keeps the ones with the largest variance.

In [None]:
X = new_df.drop('Class', axis=1)
y = new_df['Class']

# t-SNE
from sklearn.manifold import TSNE

X_reduced_tsne = TSNE(n_components=2, random_state=42).fit_transform(X.values)

# PCA
from sklearn.decomposition import PCA

X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X.values)

# Truncated SVD 
from sklearn.decomposition import TruncatedSVD

X_reduced_svd = TruncatedSVD(n_components=2, algorithm='randomized', random_state=42).fit_transform(X.values)

In [None]:
X_reduced_tsne[:3]

In [None]:
X_reduced_pca[:3]

In [None]:
import matplotlib.patches as mpatches

blue_patch = mpatches.Patch(color='#0A0AFF', label='No Fraud')
red_patch = mpatches.Patch(color='#AF0000', label='Fraud')

fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 4))

# t-SNE
ax[0].scatter(X_reduced_tsne[:, 0], X_reduced_tsne[:, 1], c=(y==0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax[0].set_title('t-SNE', fontsize=14)
ax[0].grid(True)
ax[0].legend(handles=[blue_patch, red_patch])

# PCA
ax[1].scatter(X_reduced_pca[:, 0], X_reduced_pca[:, 1], c=(y==0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax[1].set_title('PCA', fontsize=14)
ax[1].grid(True)
ax[1].legend(handles=[blue_patch, red_patch])

# SVD
ax[2].scatter(X_reduced_svd[:, 0], X_reduced_svd[:, 1], c=(y==0), cmap='coolwarm', label='No Fraud', linewidths=2)
ax[2].set_title('SVD', fontsize=14)
ax[2].grid(True)
ax[2].legend(handles=[blue_patch, red_patch])

Notice how PCA and SVD yield very similar results. This is due to the fact that PCA and SVD both revolve around orthogonal transformations.

### Model

In [None]:
X = new_df.drop('Class', axis=1)
y = new_df['Class']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Turn the values into arrays
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

classifiers = {
    'LogisticRegression' : LogisticRegression(),
    'KNN' : KNeighborsClassifier(),
    'SVC' : SVC(),
    'DecisionTree' : DecisionTreeClassifier()
}

for key, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    trainingScore = cross_val_score(classifier, X_train, y_train, cv=5)
    print('{}\nTraining Score: {:.3%}\n'.format(key, trainingScore.mean()))

##### GridSearchCV - Optimal Parameters

In [None]:
from sklearn.model_selection import GridSearchCV

# [LogisticRegression Parameters]
log_reg_params = {'penalty' : ['l1', 'l2'],
                  'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)
# LR best parameters
log_reg = grid_log_reg.best_estimator_ 


# [KNN Parameters]
knears_params = {'n_neighbors': list(range(2,5,1)), 
                 'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']}

grid_knears = GridSearchCV(KNeighborsClassifier(), knears_params)
grid_knears.fit(X_train, y_train)
# KNN best parameters
knears_neighbors = grid_knears.best_estimator_



# [SVC Parameters]
svc_params = {'C': [0.5, 0.7, 0.9, 1], 
              'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)

# SVC best parameters
svc = grid_svc.best_estimator_



# [DecisionTree Parameters]
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)), 
              "min_samples_leaf": list(range(5,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)

# DT best parameters
tree_clf = grid_tree.best_estimator_

In [None]:
# Overfitting Case

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5)
print('Logistic Regression Cross Validation Score: {:.3%}'.format(log_reg_score.mean()))


knears_score = cross_val_score(knears_neighbors, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score: {:.3%}'.format(knears_score.mean()))

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score: {:.3%}'.format(svc_score.mean()))

tree_score = cross_val_score(tree_clf, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score: {:.3%}'.format(tree_score.mean()))

### Learning Curves:

- The wider the gap between the training score and the cross validation score, the more likely your model is overfitting (high variance).
- If the score is low in both training and cross-validation sets this is an indication that our model is underfitting (high bias)
- Logistic Regression Classifier shows the best score in both training and cross-validating sets.


##### Random Undersampling

- Randomly remove samples from the majority class, with or without replacement. This is one of the earliest techniques used to alleviate imbalance in the dataset, however, it may increase the variance of the classifier and may potentially discard useful or important samples.

In [None]:
from imblearn.under_sampling import NearMiss

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report

from collections import Counter

from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced

# We will undersample during cross validating
undersample_X = df.drop('Class', axis=1)
undersample_y = df['Class']

# Gather data from our KFold cross-validator
for train_index, test_index in sss.split(undersample_X, undersample_y):
#     print("Train:", train_index, "Test:", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]
    
    
# Set undersampled data to arrays
undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values 

# Define metrics arrays
undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

# Implementing NearMiss Technique 
# Distribution of NearMiss (just to see how it distributes the labels we won't use these variables)
X_nearmiss, y_nearmiss = NearMiss().fit_sample(undersample_X.values, undersample_y.values)
print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))

# Cross Validating the right way
for train, test in sss.split(undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg) # SMOTE happens during Cross Validation not before..
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
    
    undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
    undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))

In [None]:
# Logistic Regression Learning Curves

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator1, estimator2, estimator3, estimator4, X, y, ylim=None, cv=None, n_jobs=1, 
                        train_sizes=np.linspace(0.1, 1.0, 5)):
    
    # Plot all 4 estimators
    fig, ax = plt.subplots(2, 2, figsize=(20, 14), sharey=True)
    
    if ylim is not None:
        plt.ylim(*ylim) 
        
    
    # Estimator 1
    train_sizes, train_scores, test_scores = learning_curve(estimator1, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax[0, 0].fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean,
                   train_scores_mean + train_scores_std, alpha=0.1, color="#FF9124")
    
    ax[0, 0].fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean,
                   test_scores_mean + test_scores_std, alpha=0.1, color="#2492FF")
    
    ax[0, 0].plot(train_sizes, train_scores_mean, 'o-', color='#FF9124')
    ax[0, 0].plot(train_sizes, test_scores_mean, 'o-', color='#24292FF')
    ax[0, 0].set_title("Logistic Regression Learning Curve", fontsize=14)
    ax[0, 0].set_xlabel('Training size (m)')
    ax[0, 0].set_ylabel('Score')
    ax[0, 0].grid(True)
    ax[0, 0].legend(loc="best")
    
    
    # Estimator 2
    train_sizes, train_scores, test_scores = learning_curve(estimator2, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax[0, 1].fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean,
                   train_scores_mean + train_scores_std, alpha=0.1, color="#FF9124")
    
    ax[0, 1].fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean,
                   test_scores_mean + test_scores_std, alpha=0.1, color="#2492FF")
    
    ax[0, 1].plot(train_sizes, train_scores_mean, 'o-', color='#FF9124')
    ax[0, 1].plot(train_sizes, test_scores_mean, 'o-', color='#24292FF')
    ax[0, 1].set_title("KNN Learning Curve", fontsize=14)
    ax[0, 1].set_xlabel('Training size (m)')
    ax[0, 1].set_ylabel('Score')
    ax[0, 1].grid(True)
    ax[0, 1].legend(loc="best")
    
    
    # Estimator 3
    train_sizes, train_scores, test_scores = learning_curve(estimator3, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax[1, 0].fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean,
                   train_scores_mean + train_scores_std, alpha=0.1, color="#FF9124")
    
    ax[1, 0].fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean,
                   test_scores_mean + test_scores_std, alpha=0.1, color="#2492FF")
    
    ax[1, 0].plot(train_sizes, train_scores_mean, 'o-', color='#FF9124')
    ax[1, 0].plot(train_sizes, test_scores_mean, 'o-', color='#24292FF')
    ax[1, 0].set_title("SVM Curve", fontsize=14)
    ax[1, 0].set_xlabel('Training size (m)')
    ax[1, 0].set_ylabel('Score')
    ax[1, 0].grid(True)
    ax[1, 0].legend(loc="best")
    
    
    # Estimator 4
    train_sizes, train_scores, test_scores = learning_curve(estimator4, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    ax[1, 1].fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean,
                   train_scores_mean + train_scores_std, alpha=0.1, color="#FF9124")
    
    ax[1, 1].fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean,
                   test_scores_mean + test_scores_std, alpha=0.1, color="#2492FF")
    
    ax[1, 1].plot(train_sizes, train_scores_mean, 'o-', color='#FF9124')
    ax[1, 1].plot(train_sizes, test_scores_mean, 'o-', color='#24292FF')
    ax[1, 1].set_title("Decision Tree Learning Curve", fontsize=14)
    ax[1, 1].set_xlabel('Training size (m)')
    ax[1, 1].set_ylabel('Score')
    ax[1, 1].grid(True)
    ax[1, 1].legend(loc="best")
    
    

In [None]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=42)
plot_learning_curve(log_reg, knears_neighbors, svc, tree_clf, X_train, y_train, (0.87, 1.01), cv=cv, n_jobs=4)

The general behavior we would expect from a learning curve is this:

- A model of a given complexity will __overfit__ a small dataset: this means the training score will be relatively high, while the validation score will be relatively low.

- A model of a given complexity will __underfit__ a large dataset: this means that the
training score will decrease, but the validation score will increase.

- A model will never, except by chance, give a better score to the validation set than
the training set: this means the curves should keep getting closer together but
never cross.

A decision function is a function which takes a dataset as input and gives a decision as output. What the decision can be depends on the problem at hand. In our scenario, we will attempt to classify an output.

- Estimation problems: the "decision" is the estimate.
- Hypothesis testing problems: the decision is to reject or not reject the null hypothesis.
- Classification problems: the decision is to classify a new observation (or observations) into a category.
- Model selection problems: the decision is to chose one of the candidate models.


### Our goal now is to predict the output from our cross validation as opposed to just the scores. Hence we will use cross_val_predict and not cross_val_score.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_predict

# Create dataframe with all scores and classifier names
log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5, method='decision_function')

knears_pred = cross_val_predict(knears_neighbors, X_train, y_train, cv=5)

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5,
                             method="decision_function")

tree_pred = cross_val_predict(tree_clf, X_train, y_train, cv=5)



# Take predicted results and find roc_auc_score
print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))

#### Plot ROC Curve

In [None]:
log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knear_fpr, knear_tpr, knear_threshold = roc_curve(y_train, knears_pred)
svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, tree_pred)


def graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr):
    plt.figure(figsize=(16,8))
    plt.title('ROC Curve \n Top 4 Classifiers', fontsize=18)
    plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train, log_reg_pred)))
    plt.plot(knear_fpr, knear_tpr, label='KNears Neighbors Classifier Score: {:.4f}'.format(roc_auc_score(y_train, knears_pred)))
    plt.plot(svc_fpr, svc_tpr, label='Support Vector Classifier Score: {:.4f}'.format(roc_auc_score(y_train, svc_pred)))
    plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train, tree_pred)))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([-0.01, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
                arrowprops=dict(facecolor='#6E726D', shrink=0.05),
                )
    plt.legend()
    
graph_roc_curve_multiple(log_fpr, log_tpr, knear_fpr, knear_tpr, svc_fpr, svc_tpr, tree_fpr, tree_tpr)
plt.show()

**True Positives**: Correctly Classified Fraud Transactions

**False Positives**: Incorrectly Classified Fraud Transactions

**True Negative**: Correctly Classified Non-Fraud Transactions

**False Negative**: Incorrectly Classified Non-Fraud Transactions

**Precision**: True Positives/(True Positives + False Positives)

**Recall**: True Positives/(True Positives + False Negatives)

**Precision** tells us how precise (or sure) our model is detecting fraud transactions while **recall** is the amount of fraud cases our model is able to detect.

**Precision/Recall Tradeoff**: The more precise (selective) our model is, the less cases it will detect. 

Example: Assuming that our model has a precision of 95%, Let's say there are only 5 fraud cases in which the model is 95% precise or more that these are fraud cases. Then let's say there are 5 more cases that our model considers 90% to be a fraud case, if we lower the precision there are more cases that our model will be able to detect.

Precision starts to descend between 0.90 and 0.92 nevertheless, our precision score is still pretty high and still we have a descent recall score.

In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

precision, recall, threshold = precision_recall_curve(y_train, log_reg_pred)

y_pred = log_reg.predict(X_train)


# Overfitting Case
print('---' * 10)
print('Overfitting: \n')
print('Recall Score: {:.2f}'.format(recall_score(y_train, y_pred)))
print('Precision Score: {:.2f}'.format(precision_score(y_train, y_pred)))
print('F1 Score: {:.2f}'.format(f1_score(y_train, y_pred)))
print('Accuracy Score: {:.2f}'.format(accuracy_score(y_train, y_pred)))
print('---' * 10)

# How it should look like
print('---' * 10)
print('How it should be:\n')
print("Accuracy Score: {:.2f}".format(np.mean(undersample_accuracy)))
print("Precision Score: {:.2f}".format(np.mean(undersample_precision)))
print("Recall Score: {:.2f}".format(np.mean(undersample_recall)))
print("F1 Score: {:.2f}".format(np.mean(undersample_f1)))
print('---' * 10)

In [None]:
# Weighted mean of precisions achieved at each threshold (average precision score)

undersample_y_score = log_reg.decision_function(original_Xtest)

from sklearn.metrics import average_precision_score

undersample_average_precision = average_precision_score(original_ytest, undersample_y_score)
print("Average precision-recall score: {:.3f}".format(undersample_average_precision))
