# Credit Card Fraud Detection Classification Model

### Logistic Regression

This project is the first attempt at building a classification model using the sklearn library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv("creditcard.csv")[:40000]

In [None]:
#Number of Fraud cases
fraud_cases = df["Class"].sum()
print("The number of identified cases of fraud are: {}".format(fraud_cases))

In [None]:
x = df.drop(columns=["Time", "Amount", "Class"]).values
y = df["Class"].values

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, make_scorer, roc_auc_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.33, random_state=2)

In [None]:
grid_reg = GridSearchCV(
        estimator=LogisticRegression(max_iter=1000, random_state=42),
        param_grid= {'class_weight': [{0:1, 1:v} for v in np.linspace(1,20,10)]},
        cv = 5,
        scoring={'Precision': make_scorer(precision_score), 'Recall': make_scorer(recall_score)},
        refit='Recall',
        n_jobs=-1
        )
model_reg = grid_reg.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(12,4))
cv_results = pd.DataFrame(grid_reg.cv_results_)
for score in ['mean_test_Precision','mean_test_Recall']:
    plt.plot([_[1] for _ in cv_results['param_class_weight']],
            cv_results[score],
            label = score)
plt.legend()
plt.show()

The resulting confusion matrix shows that the model has succesfully managed to spot most fraud cases and only missed 5 of them. In addition, the model has resulted in 21 false negatives which could easily be checked by a human operator, since the number is not that high, to properly classify them as non-fraud.


In [None]:
prediction = grid_reg.predict(X_test)
conf_matrix = confusion_matrix(y_test, prediction)
ConfusionMatrixDisplay(conf_matrix, display_labels=["Not fraud", "Fraud"]).plot()

The sensitivity for the logistic regression is 0.85294. We are interested in the sensitivity since it is a metric of the percentage of correctly identified fraud cases.

In [None]:
tn1, fp1, fn1, tp1 = conf_matrix.ravel()
sensitivity1 = tp1 / (tp1+fn1)
print("The Specificity is: {}".format(sensitivity1))

The roc-auc score for the logistic regreession is 0.925

In [None]:
roc_auc_score_reg = roc_auc_score(y_test, prediction)
print("The roc-auc score of the logistic regression is: {}".format(roc_auc_score_reg))

### Decision Tree Classifier

A different classification model, like a decision tree, can be used to try and fit the data to attempt at obtaining a better model.


In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
grid_tree = GridSearchCV(
        estimator=DecisionTreeClassifier(),
        param_grid= { 'class_weight': [{0:1, 1:v} for v in np.linspace(1,20,10)],
                      'criterion':['gini'],
                      'random_state':[42]},
        cv = 4,
        scoring={'Precision': make_scorer(precision_score), 'Recall': make_scorer(recall_score)},
        refit='Recall',
        n_jobs=-1
        )
model = grid_tree.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(12,4))
cv_results = pd.DataFrame(grid_tree.cv_results_)
for score in ["mean_test_Precision", "mean_test_Recall"]:
    plt.plot([_[1] for _ in cv_results['param_class_weight']],
            cv_results[score],
            label = score)
plt.legend()
plt.show()

The resulting confusion matrix shows that the model has succesfully managed to spot most fraud cases and only missed 4 of them. In addition, the model has resulted in 9 false negatives which could easily be checked by a human operator, since the number is not that high, to properly classify them as non-fraud.


In [None]:
prediction = grid_tree.predict(X_test)
conf_matrix2 = confusion_matrix(y_test, prediction)
ConfusionMatrixDisplay(conf_matrix2, display_labels=["Not fraud", "Fraud"]).plot()

It appears clear that the number decision tree classifier performs a little bit better than the logistic regression does, probably due to an overfit by the latter

The sensitivity for the decision tree classifier is 0.8524

In [None]:
tn2, fp2, fn2, tp2 = conf_matrix.ravel()
sensitivity2 = tp2 / (tp2+fn2)
print("The Specificity is: {}".format(sensitivity2))