<h1> A Fraud Detection Case Study </h1>

<h2> Introduction </h2>
<p> In this case study, we will be working with a dataset containing transactions made by credit cards. The dataset contains transactions that occurred over a period of two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. </p>

<h2> Exploratory Analysis </h2>
<p> Let's start by importing the necessary libraries and loading the dataset. </p>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

In [None]:
df = pd.read_csv('creditcard.csv')
df.head()

In [None]:
print(df.shape, df.columns, df.describe(), df.info())

In [None]:
# check for outliers and skewness in the data
df.hist(figsize=(20,20))
plt.show()

In [None]:
#check for missing values
df.isnull().sum()

<h3> conclusion </h3>
<p> the data is clean and ready to be used for the model. </p>

In [None]:
# correlation matrix
corrmat = df.corr()
fig = plt.figure(figsize = (12, 9))
sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()

<h4> conclusion </h4>
<p> there is no correlation between the features and the target. </p>

<h2> Try the model </h2>
<p> Now that we have a dataset, let's try to train a model to predict the target. </p>

In [None]:
# scale all the features except the target variable
from sklearn.preprocessing import RobustScaler
df['normalizedAmount'] = RobustScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df['normalizedTime'] = RobustScaler().fit_transform(df['Time'].values.reshape(-1,1))
df = df.drop(['Amount'], axis = 1)
df = df.drop(['Time'], axis = 1)
df.hist(figsize=(20,20))
plt.show()

In [None]:
#split the data into train and test
X = df.drop(['Class'], axis = 1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.3, random_state = 101)

In [None]:

#make a pipeline for the models logistic regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pipe = Pipeline([('classifier' , LogisticRegression())])

param_grid = [
    {'classifier' : [LogisticRegression()],
    'classifier__penalty' : ['l1', 'l2'],
    'classifier__C' : np.logspace(-4, 4, 20),
    'classifier__solver' : ['liblinear']},
]

parm = GridSearchCV(param_grid=param_grid, estimator = pipe, cv = 3, verbose=2, n_jobs=-1)
best_model = parm.fit(X_train, y_train)
print(best_model.best_estimator_)
print("The mean accuracy of the model is:",best_model.score(X_test, y_test))
print("The best parameters for the model are:",best_model.best_params_)
print("The best estimator for the model is:",best_model.best_estimator_)
print("The best score for the model is:",best_model.best_score_)
print("The best index for the model is:",best_model.best_index_)
print("The best parameters for the model are:",best_model.cv_results_)
# print the confusion matrix
y_pred = best_model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

In [None]:
# print classification report
print(classification_report(y_test, y_pred))

In [None]:
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)
plt.figure(figsize=(5, 5))
sns.heatmap(conf_mat, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")
plt.show()

<h4> conclusion </h4>
<p> the model made it so well to predect the legit transactions but it failed to predict the fraud transactions. </p>

In [None]:
# print the ROC curve
y_pred_proba = best_model.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(y_test,  y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()

<h2> Deal with the unbalanced data </h2>
<p> we will try to deal with the unbalanced data by using the SMOTE technique. </p>

In [None]:
# import smote to handle the imbalanced data
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=101)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

log = LogisticRegression()
log.fit(X_train_res, y_train_res.ravel())
y_pred = log.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

# plot the confusion matrix
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)
plt.figure(figsize=(5, 5))
sns.heatmap(conf_mat, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")
plt.show()


In [None]:
# UNQ_C2
# GRADED FUNCTION: my_dense
#Below, compose a new my_dense_v subroutine that performs the layer calculations for a matrix of examples. This will utilize np.matmul()
# UNQ_C3
# UNGRADED FUNCTION: my_dense_v

def my_dense_v(A_in, W, b, g):
    """
    Computes dense layer
    Args:
      A_in (ndarray (m,n)) : Data, m examples, n features each
      W    (ndarray (n,j)) : Weight matrix, n features per unit, j units
      b    (ndarray (1,j)) : bias vector, j units  
      g    activation function (e.g. sigmoid, relu..)
    Returns
      A_out (tf.Tensor or ndarray (m,j)) : m examples, j units
    """
### START CODE HERE ###
    Z = np.matmul(A_in, W) + b
    A_out = g(Z) 
    
    
### END CODE HERE ### 
    return(A_out)