# **This notebook contains a binary classifier for Credit Card Fraud detection**

Several models are used in this notebook to make predictions for fraudulent (class: 1) or non-fraudulent (class: 0) credit card transactions.

In this notebook the data's quality is reported first.

The data is split into train = 70% and test = 30%.

Algorithms used are SVM, Random Forest, and a Voting Classifier with Naïve Bayes, SVM, Logistic Regression & Random Forests used as estimators with hard voting.

Algorithms are evaluated on Accuracy, Precision, Recall, F1 score & R2 score. These metrics are evaluated on testing data.

## **According to the results of this notebook**

## SVM

Is the fastest performing algorithm with high accuracy & precision.

## Random Forest

Is the second fastest algorithm but seems to overfit on training data

## Voting Classifier

Is the slowest of the algorithms and has the most satisfying results without overfitting. Voting Classifiers are a great method to classify data since it takes several models into account and votes on the most commonly predicted class.



# **Chosen Dataset**

Credit Card Fraud Dataset from Kaggle

https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud

The dataset contains 1 million entries (rows) and 7 features (columns) with an additional target column that indicates binary classification fraudulent (1) or not (0)

## **Dataset Features:**

**distance_from_home** - the distance from home to where the transaction happened.

**distance_from_last_transaction** - the distance from the last transaction that happened.

**ratio_to_median_purchase_price** - ratio of purchased price transaction to median purchase price.

**repeat_retailer** - did the transaction happened from same retailer.

**used_chip** - did the transaction happen through chip (credit card).

**used_pin_number** - did the transaction happen by using PIN number.

**online_order** - was the transaction of an online order.

### **Target:**

**fraud** - is the transaction fraudulent.

In [24]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import r2_score




In [25]:
# load the dataset
df = pd.read_csv('card_transdata.csv')
df

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.311140,1.945940,1.0,1.0,0.0,0.0,0.0
1,10.829943,0.175592,1.294219,1.0,0.0,0.0,0.0,0.0
2,5.091079,0.805153,0.427715,1.0,0.0,0.0,1.0,0.0
3,2.247564,5.600044,0.362663,1.0,1.0,0.0,1.0,0.0
4,44.190936,0.566486,2.222767,1.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...
999995,2.207101,0.112651,1.626798,1.0,1.0,0.0,0.0,0.0
999996,19.872726,2.683904,2.778303,1.0,1.0,0.0,0.0,0.0
999997,2.914857,1.472687,0.218075,1.0,1.0,0.0,1.0,0.0
999998,4.258729,0.242023,0.475822,1.0,0.0,0.0,1.0,0.0


# **Data Quality & Cleaning**

## Missing Values

Replace any possible "?" entries with NaN. Check the number of missing values by counting the number of NaN entries. No missing values are found.

## Cardinality

Use cardinality to differentiate continuous and categorical features

### Continuous Features
The data quality report shows some important statistics: count, missing values percentage, minimum, maximum, mean, first quartile, median, third quartile, and standard deviation


1.   distance_from_home
2.   distance_from_last_transaction
3.   ratio_to_median_purchase_price


### Categorical Features

The data quality report shows some important statistics: count, missing values percentage, mode and mode percentage

It's common to find second mode and second mode percentage, but this is unnecessary for this dataset since all categorical features have only 2 categories, hence the second mode is directly the other category.


1.   repeat_retailer
2.   used_chip
3.   used_pin_number
4.   online_order

## Outliers

Around 10% of data includes outliers in the 3 categorical features, all outliers are above the upper boind of the features, no outliers below the lower bound. Although some outliers are far from the upper bound, the data seems reliable and to not include errors. For this reason, the outliers will be kept in the data.

In [26]:
# Data Quality Report

# # Checking cardinality to distinguish categorical & continuous features
print(df.nunique())

# Continuous
cont_features = ['distance_from_home', 'distance_from_last_transaction', 'ratio_to_median_purchase_price']
data_quality_report_continuous = {
    'Features': [cont_features[0], cont_features[1], cont_features[2]],
    'Count': [df.describe()[cont_features[0]]['count'], df.describe()[cont_features[1]]['count'], df.describe()[cont_features[2]]['count']],
    '% Missing': [df[cont_features[0]].isna().sum(), df[cont_features[1]].isna().sum(), df[cont_features[2]].isna().sum()],
    'Min.': [df.describe()[cont_features[0]]['min'], df.describe()[cont_features[1]]['min'], df.describe()[cont_features[2]]['min']],
    'Max.': [df.describe()[cont_features[0]]['max'], df.describe()[cont_features[1]]['max'], df.describe()[cont_features[2]]['max']],
    'Mean': [df.describe()[cont_features[0]]['mean'], df.describe()[cont_features[1]]['mean'], df.describe()[cont_features[2]]['mean']],
    '1st Qrt.': [df.describe()[cont_features[0]]['25%'], df.describe()[cont_features[1]]['25%'], df.describe()[cont_features[2]]['25%']],
    'Median': [df[cont_features[0]].median(), df[cont_features[1]].median(), df[cont_features[2]].median()],
    '3rd Qrt.': [df.describe()[cont_features[0]]['75%'], df.describe()[cont_features[1]]['75%'], df.describe()[cont_features[2]]['75%']],
    'Std. Dev.': [df.describe()[cont_features[0]]['std'], df.describe()[cont_features[1]]['std'], df.describe()[cont_features[2]]['std']]
    }

data_quality_report_continuous_table =  pd.DataFrame(data = data_quality_report_continuous)
data_quality_report_continuous_table





distance_from_home                1000000
distance_from_last_transaction    1000000
ratio_to_median_purchase_price    1000000
repeat_retailer                         2
used_chip                               2
used_pin_number                         2
online_order                            2
fraud                                   2
dtype: int64


Unnamed: 0,Features,Count,% Missing,Min.,Max.,Mean,1st Qrt.,Median,3rd Qrt.,Std. Dev.
0,distance_from_home,1000000.0,0,0.004874,10632.723672,26.628792,3.878008,9.96776,25.743985,65.390784
1,distance_from_last_transaction,1000000.0,0,0.000118,11851.104565,5.036519,0.296671,0.99865,3.355748,25.843093
2,ratio_to_median_purchase_price,1000000.0,0,0.004399,267.802942,1.824182,0.475673,0.997717,2.09637,2.799589


In [27]:
# Categorical

cat_features = ["repeat_retailer", "used_chip", "used_pin_number", "online_order"]
data_quality_report_categorical = {
    'Features': [cat_features[0], cat_features[1], cat_features[2], cat_features[3]],
    'Count': [df.describe()[cat_features[0]]['count'], df.describe()[cat_features[1]]['count'], df.describe()[cat_features[2]]['count'], df.describe()[cat_features[3]]['count']],
    '% Missing': [df[cat_features[0]].isna().sum(), df[cat_features[1]].isna().sum(), df[cat_features[2]].isna().sum(), df[cat_features[3]].isna().sum()],
    'Mode': [df.mode()[cat_features[0]][0], df.mode()[cat_features[1]][0], df.mode()[cat_features[2]][0], df.mode()[cat_features[3]][0]],
    'Mode %': [df[cat_features[0]].value_counts(normalize=True)[1]*100, df[cat_features[1]].value_counts(normalize=True)[0]*100, df[cat_features[2]].value_counts(normalize=True)[0]*100, df[cat_features[3]].value_counts(normalize=True)[1]*100]

    }

data_quality_report_categorical_table =  pd.DataFrame(data = data_quality_report_categorical)
data_quality_report_categorical_table

Unnamed: 0,Features,Count,% Missing,Mode,Mode %
0,repeat_retailer,1000000.0,0,1.0,88.1536
1,used_chip,1000000.0,0,0.0,64.9601
2,used_pin_number,1000000.0,0,0.0,89.9392
3,online_order,1000000.0,0,1.0,65.0552


In [28]:
# Outliers

# Continuous Feature 0
Q3_feature0 = np.quantile(df[cont_features[0]], 0.75)
Q1_feature0 = np.quantile(df[cont_features[0]], 0.25)
IQR_feature0 = Q3_feature0 - Q1_feature0

lower_range_feature0 = Q1_feature0 - 1.5 * IQR_feature0
upper_range_feature0 = Q3_feature0 + 1.5 * IQR_feature0

column0 = df[cont_features[0]]
count_lower_outlier0 = column0[column0 < lower_range_feature0].count()
count_upper_outlier0 = column0[column0 > upper_range_feature0].count()
print(count_lower_outlier0)
print(count_upper_outlier0)

# Continuous Feature 1
Q3_feature1 = np.quantile(df[cont_features[1]], 0.75)
Q1_feature1 = np.quantile(df[cont_features[1]], 0.25)
IQR_feature1 = Q3_feature1 - Q1_feature1

lower_range_feature1 = Q1_feature1 - 1.5 * IQR_feature1
upper_range_feature1 = Q3_feature1 + 1.5 * IQR_feature1

column1 = df[cont_features[1]]
count_lower_outlier1 = column1[column1 < lower_range_feature1].count()
count_upper_outlier1 = column1[column1 > upper_range_feature1].count()
print(count_lower_outlier1)
print(count_upper_outlier1)

# Continuous Feature 2
Q3_feature2 = np.quantile(df[cont_features[2]], 0.75)
Q1_feature2 = np.quantile(df[cont_features[2]], 0.25)
IQR_feature2 = Q3_feature2 - Q1_feature2

lower_range_feature2 = Q1_feature2 - 1.5 * IQR_feature2
upper_range_feature2 = Q3_feature2 + 1.5 * IQR_feature2

column2 = df[cont_features[2]]
count_lower_outlier2 = column2[column2 < lower_range_feature2].count()
count_upper_outlier2 = column2[column2 > upper_range_feature2].count()
print(count_lower_outlier2)
print(count_upper_outlier2)

0
103631
0
124367
0
84386


# **Train-Test Split**

## Train
### 70% of the dataset


## Test
### 30% of the dataset


## Random State
### set to 42 to reproduce the same result with each run

In [29]:
# Splitting the dataset to Test & Train

y = df.pop('fraud')
X = df
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# **Algorithm**

There are many algorithms used for binary classification problems. Considering the size of the dataset, I've excluded some of the slower algorithms such as K-nearest Neighbors, Decision Tree (with bagging or boosting) from consideration.


## Considered Algorithms


*   ### Support Vector Machine (SVM)
####  Hyperparameters

  **penalty (regularization)**: l2 normalization results in better results than l1 normalization

  **dual**: set dual to False when n_samples > n_features which is true for this dataset since n_samples = 1 million & n_features = 7

  **C (Regularization parameter)**: through some trial and error C = 1.0 is the best regularization parameter choice for this data

  ####  Time: 3 seconds



*   ### Random Forest
####  Hyperparameters

  **criterion**: the function used to determine the best split, here entropy for the Shannon information gain is used

  **n_estimators**: the number of trees used in the forest for estimation, 30 is chosen

  **max_depth**: is reduced to the minimum possible depth that reduces overfit on training data while resulting in high accuracy on test data

  ####  Time: 23 seconds



*   ### Voting Classifier
####  Hyperparameters

  **estimators**: some of the most common binary classifiers Naïve Bayes, SVM, Logistic Regression & Random Forests. Each of these models has its pros & cons

   **voting**: using a voting classifier with hard voting means all these models are used to predict a class for test data & use a majority vote to determine the final decision or classification

  ####  Time: 33 seconds




In [30]:
### Support Vector Machine ###

svm = LinearSVC(penalty='l2', dual=False, C = 1.0)
svm.fit(X_train, y_train)
svm_prediction = svm.predict(X_test)
print("-- SVM -- \n")
print("Accuracy on test: " + str(svm.score(X_test, y_test)))
print("Accuracy on train: "+ str(svm.score(X_train, y_train)) + "\n")


### Random Forest ###

rf = RandomForestClassifier(criterion='entropy', n_estimators=30, max_depth=5)
rf.fit(X_train, y_train)
rf_prediction = rf.predict(X_test)
print("-- Random Forest -- \n")
print("Accuracy on test: " + str(rf.score(X_test, y_test)))
print("Accuracy on train: "+ str(rf.score(X_train, y_train)) + "\n")


### Voting Classifier (Naïve Bayes, SVM, Logistic Regression, Random Forests) ###

# Logistic Regression
lr=LogisticRegression(max_iter=1000)

# Naïve Bayes
mnb = MultinomialNB()

evc=VotingClassifier(estimators=[('mnb',mnb),('lr',lr),('rf',rf),('svm',svm)],voting='hard')
evc.fit(X_train, y_train)
voting_prediction = evc.predict(X_test)
print("-- Voting Classifier -- \n")
print("Accuracy on test: " + str(evc.score(X_test, y_test)))
print("Accuracy on train: "+ str(evc.score(X_train, y_train)))

-- SVM -- 

Accuracy on test: 0.9441233333333333
Accuracy on train: 0.9440671428571429

-- Random Forest -- 

Accuracy on test: 0.9953966666666667
Accuracy on train: 0.9953342857142857

-- Voting Classifier -- 

Accuracy on test: 0.9478466666666666
Accuracy on train: 0.9478985714285715


# **Algorithm Evaluation**

Some common algorithm evaluation metrics commonly used to evaluate classification models are

**Accuracy -** the percentage of correct classifications that a trained machine learning model achieves (calculated above)

**Precision -** the quality of a positive prediction made by the model. The number of true positives divided by the total number of positive predictions

**Recall -** ability of a model to find all the relevant cases within a data set. The number of true positives divided by the number of true positives plus the number of false negatives

**F1 score -** the harmonic mean of precision and recall score and is used as a metrics in the scenarios where choosing either of precision or recall score can result in compromise in terms of model giving high false positives and false negatives respectively

**R2 score -** the proportion of the variance in the dependent variable that is predictable from the independent variable(s)


All these metrics are used alongside accuracy because accuracy alone is not enough to evaluate an algorithm's performance.

In [31]:
# Evaluation Metrics

def evaluation(y_test, y_pred):

  # Accuracy
  accuracy = accuracy_score(y_test, y_pred)
  print('Accuracy: {:.2f}'.format(accuracy))

  # Precision
  precision = precision_score(y_test, y_pred)
  print('Precision: {:.2f}'.format(precision))

  # Recall
  recall = recall_score(y_test, y_pred)
  print('Recall: {:.2f}'.format(recall))

  # F1 score
  f1 = f1_score(y_test, y_pred)
  print('F1 score: {:.2f}'.format(f1))

  # r2
  r2 = r2_score(y_test, y_pred)
  print('R2 Score: {:.2f}'.format(r2))

print("-- SVM --\n")
evaluation(y_test, svm_prediction)
print()

print("-- Random Forest --\n")
evaluation(y_test, rf_prediction)
print()

print("-- Voting Classifier --\n")
evaluation(y_test, voting_prediction)
print()

-- SVM --

Accuracy: 0.94
Precision: 0.90
Recall: 0.40
F1 score: 0.56
R2 Score: 0.30

-- Random Forest --

Accuracy: 1.00
Precision: 1.00
Recall: 0.95
F1 score: 0.97
R2 Score: 0.94

-- Voting Classifier --

Accuracy: 0.95
Precision: 0.91
Recall: 0.45
F1 score: 0.60
R2 Score: 0.34

