Credit Card Fraud Prediction Model

Premise

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Summary

This is a very interesting exercise, specialy because we have no idea what the features may be so there is no real business knowledge we can apply on them. To make things worse, there is an extreme difference between the number of "Fraud" cases and "No Fraud" as we will see later.

To begin the problem, we know that the dataset we are dealing with is unbalance. Just for fun we will show how different our correlation heatmap can be if we are using the whole dataset and not an undersampled version. After properly understanding the relation between the varaibles we proceed to delete those features that share a high correlation with each others since having them both will not bring value to our model. From the resulting group of features we select those who have a decent correlation with our target class. As our final step a logistic regression model was created and its performance reviewed.

Link to Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

df_X = df.drop(['Class'], axis=1)
df_y = df['Class']

f, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=['Fraud', 'No Fraud'], y=df_y.value_counts().tolist(), palette="muted",ax = ax)
ax.axhline(0, color="k", clip_on=False)
ax.set_title("Fraud vs No Fraud")
ax.set_ylabel("Number of cases")

Text(0, 0.5, 'Number of cases')

Heatmap Analysis

f, ax = plt.subplots(figsize=(20, 20))
corrs = df.corr()
corrs = corrs.applymap(lambda x : round(x, 2))
sns.heatmap(corrs, annot=True, linewidths=.5, ax=ax, cmap="YlGnBu")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff6de7195b0>

Undersampling Data

from sklearn.datasets import make_classification
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=42)
X_und, y_und = cc.fit_resample(df_X, df_y)
print('Resampled dataset shape %s' % X_und.shape[0])

Resampled dataset shape 984

f, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=['Fraud', 'No Fraud'], y=y_und.value_counts().tolist(), palette="muted",ax = ax)
ax.axhline(0, color="k", clip_on=False)
ax.set_title("Fraud vs No Fraud")
ax.set_ylabel("Number of cases")

Text(0, 0.5, 'Number of cases')

Heatmap Analysis with Undersampled Data

f, ax = plt.subplots(figsize=(20, 20))
corrs = (X_und.join(y_und)).corr()
corrs = corrs.applymap(lambda x : abs(round(x, 2)))
sns.heatmap(corrs, annot=True, linewidths=.5, ax=ax, cmap="YlGnBu")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff6d6eba040>

Removing Redundant Features

As seen in the heatmap, there are certain variable that overlap at the point that they share almost 95% correlation. It was decided that from those features that hold 85%+ correlation would be removed since they do not provide much value.

features = X_und.columns.tolist()
to_remove = []

for i in range(len(features)-1):
    for j in range(i, len(features)):
        if i != j and corrs[features[i]][features[j]] >= 0.85:
            print(cols[i], " -> ", features[j], " = ", corrs[features[i]][j])
            to_remove.append(features[i])
            break
for i in to_remove:
    features.remove(i)

V2  ->  V3  =  0.9
V5  ->  V10  =  0.86
V15  ->  V10  =  0.89
V18  ->  V12  =  0.89
V19  ->  V12  =  0.94
V20  ->  V14  =  0.88
V24  ->  V17  =  0.95
V25  ->  V18  =  0.97

f, ax = plt.subplots(figsize=(15, 15))
corrs = (X_und[cols].join(y_und)).corr()
corrs = corrs.applymap(lambda x : round(x, 2))
sns.heatmap(corrs, annot=True, linewidths=.5, ax=ax, cmap="YlGnBu")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff6cc0827c0>

Boxplot Analysis

We are using boxplots to further understand the statistic values of our features. As seen below, when using the raw data there is a huge number of outliers that might affect the training of our model. After removing outliers using the IQR our data is much more usable. We have to remember to be careful to not removed to many instances of the minority class in order to safeguard enough information for our model.

Boxplots with raw data

sns.set(style="whitegrid")

for fea in features:
    
    f, ax = plt.subplots(figsize=(6,6))
    sns.boxplot(x="Class", y=fea, palette=["r", "g"],
                data=df, ax=ax)
    plt.show()

Boxplots after outlier removal

new_df = df

outlier_limit = 1.5

for fea in features:
    plotData = df[[fea, 'Class']]
    
    nofraud = plotData[plotData['Class'] == 0]
    nofraud_q1 = nofraud[fea].quantile(0.25)
    nofraud_q3 = nofraud[fea].quantile(0.75)
    nofraud_iqr = nofraud_q3 - nofraud_q1
    nofraud_il = nofraud_q1 - outlier_limit * nofraud_iqr
    nofraud_ml = nofraud_q3 + outlier_limit * nofraud_iqr
    
    fraud = plotData[plotData['Class'] == 1]
    fraud_q1 = fraud[fea].quantile(0.25)
    fraud_q3 = fraud[fea].quantile(0.75)
    fraud_iqr = fraud_q3 - fraud_q1
    fraud_il = fraud_q1 - 1.5 * fraud_iqr
    fraud_ml = fraud_q3 + 1.5 * fraud_iqr
    plotData = plotData.loc[((plotData[fea] >= nofraud_il) & (plotData[fea] <= nofraud_ml) & (plotData['Class'] == 0)) |
                           ((plotData[fea] >= fraud_il) & (plotData[fea] <= fraud_ml) & (plotData['Class'] == 1))]
    
    bad_df = df.index.isin(plotData.index.tolist())
    new_df = new_df.drop(df[~bad_df].index.tolist(), errors="ignore")

for fea in features:
    f, ax = plt.subplots(figsize=(6,6))
    sns.boxplot(x="Class", y=fea, palette=["r", "g"],
                data=new_df, ax=ax)
    plt.show()

df_X = new_df.drop(['Class'], axis=1)
df_y = new_df['Class']

Feature Selection

For the feature selection we based the desicion based on both the boxplots and the correlation matrix. Both sources provide similar insight and show that there is a conection between our features and our class and thos chosen have no extreme overlap on values. Our selected features are V2, V4, V6, V7, V14, V18

Distribution Analysis

final_features = ['V2', 'V4', 'V6', 'V7', 'V14', 'V18']
for i in final_features:
    sns.distplot(df_X[i], hist=False, color="g", kde_kws={"shade": True})
    plt.show()

Oversamping Data with SMOTE

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_smote, y_smote = sm.fit_resample(df_X[final_features], df_y)
f, ax = plt.subplots(figsize=(7, 5))
sns.barplot(x=['Fraud', 'No Fraud'], y=y_smote.value_counts().tolist(), palette="muted",ax = ax)
ax.axhline(0, color="k", clip_on=False)
ax.set_title("Fraud vs No Fraud")
ax.set_ylabel("Number of cases")

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_smote, y_smote, test_size=0.30, random_state=42)

Model creation

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0).fit(X_train, y_train)

y_result = clf.predict(X_test)

Model scoring

As seen below in the confusion matrix and the PR-Curve, our model hold a good performance overall so we can be satisfied with our model. Just a reminder PR-Curve can be use in cases like this where our dataset is not balance.

from sklearn.metrics import explained_variance_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve

print(clf.score(X_smote, y_smote))
print(explained_variance_score(y_test, y_result))

print('Confusion matrix')
print(confusion_matrix(y_test, y_result))

print(classification_report(y_test, y_result))

y_score = clf.decision_function(X_test)

from sklearn.metrics import average_precision_score
average_precision = average_precision_score(y_test, y_score)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt

disp = plot_precision_recall_curve(clf, X_test, y_test)
disp.ax_.set_title('2-class Precision-Recall curve: '
                   'AP={0:0.2f}'.format(average_precision))

0.9425490333829362
0.7729668661737412
Confusion matrix
[[45414  1521]
 [ 3870 43142]]
              precision    recall  f1-score   support

           0       0.92      0.97      0.94     46935
           1       0.97      0.92      0.94     47012

    accuracy                           0.94     93947
   macro avg       0.94      0.94      0.94     93947
weighted avg       0.94      0.94      0.94     93947

Average precision-recall score: 0.99





Text(0.5, 1.0, '2-class Precision-Recall curve: AP=0.99')

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebook		notebook
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Prediction Model

Premise

Summary

Table of Contents

Heatmap Analysis

Undersampling Data

Heatmap Analysis with Undersampled Data

Removing Redundant Features

Boxplot Analysis

Boxplots with raw data

Boxplots after outlier removal

Feature Selection

Distribution Analysis

Oversamping Data with SMOTE

Model creation

Model scoring

About

Releases

Packages

Languages

License

Altron01/Credit-Card-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Prediction Model

Premise

Summary

Table of Contents

Heatmap Analysis

Undersampling Data

Heatmap Analysis with Undersampled Data

Removing Redundant Features

Boxplot Analysis

Boxplots with raw data

Boxplots after outlier removal

Feature Selection

Distribution Analysis

Oversamping Data with SMOTE

Model creation

Model scoring

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages