## The Dataset
The dataset can be accessed from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud?resource=download. It consists of genuine bank transactions conducted by cardholders in Europe in 2013. For security reasons, the specific variables have not been disclosed and instead have been altered through PCA (Principal Component Analysis). Consequently, there are a total of 29 columns representing features and 1 column representing the final class.

## Importing Necessary Libraries
All necessary libraries are imported in one place. The credit card dataset features are already transformed through PCA, eliminating the need for feature selection.

In [1]:
#Packages related to general operating system & warnings
import os 
import warnings
warnings.filterwarnings('ignore')
#Packages related to data importing, manipulation, exploratory data #analysis, data understanding
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import csv
#from termcolor import colored as cl # text customization
#Packages related to data visualizaiton
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Setting plot sizes and type of plot
plt.rc("font", size=14)
plt.rcParams['axes.grid'] = True
plt.figure(figsize=(6,3))
plt.gray()
from matplotlib.backends.backend_pdf import PdfPages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import  PolynomialFeatures, KBinsDiscretizer, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, OrdinalEncoder
import statsmodels.formula.api as smf
import statsmodels.tsa as tsa
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import BaggingClassifier, BaggingRegressor,RandomForestClassifier,RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor 
from sklearn.svm import LinearSVC, LinearSVR, SVC, SVR
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

<Figure size 432x216 with 0 Axes>

We first read the CSV file "creditcard.csv":

In [2]:
data=pd.read_csv("D:\creditcard.csv")

In [3]:
Total_transactions = len(data)
normal = len(data[data.Class == 0])
fraudulent = len(data[data.Class == 1])
fraud_percentage = round(fraudulent/normal*100, 2)
print('Total number of Transactions are {}'.format(Total_transactions))
print('Number of Normal Transactions are {}'.format(normal))
print('Number of fraudulent Transactions are {}'.format(fraudulent))
print('Percentage of fraud Transactions is {}'.format(fraud_percentage))

Total number of Transactions are 284807
Number of Normal Transactions are 284315
Number of fraudulent Transactions are 492
Percentage of fraud Transactions is 0.17


Only 0.17% of transactions are fraudulent.

Check for null values:

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

No null values exist in the columns, and feature selection is not applicable for this particular use case. However, it may be worth trying out feature selection mechanisms to potentially improve results. It's worth noting that out of the 28 features in our data, 27 are transformed through PCA, while the Amount feature remains original. After checking the minimum and maximum values in the Amount feature, it was discovered that the difference is significant enough to potentially skew our results.

In [5]:
min (data.Amount),max(data.Amount)

(0.0, 25691.16)

Scaling this variable using a standard scaler:

In [6]:
sc = StandardScaler()
amount = data['Amount'].values
data['Amount'] = sc.fit_transform(amount.reshape(-1, 1))

The fit_transform() method of StandardScaler first fits the scaler to the data by computing the mean and standard deviation of the input values, and then scales the values using these parameters to obtain the standardized values. The reshaping of the amount array is necessary because fit_transform() expects a 2D array as input. Finally, the scaled values are assigned back to the Amount column in the data dataframe.

We have one more variable which is the time which can be an external deciding factor — but in our modelling process, we can drop it.

In [7]:
data.drop(['Time'], axis=1, inplace=True)

We can identify duplicate transactions in our dataset. After removal, the original 284807 transactions will be reduced.

In [8]:
data.shape

(284807, 30)

Now, removing any duplicates.

In [19]:
data.drop_duplicates(inplace=True)

Performing a count verification:

In [10]:
data.shape

(275663, 30)

We initially had approximately 9000 instances of duplicated transactions. However, we have now successfully preprocessed our data by scaling it and removing any instances of duplicates or missing values. We can now proceed to partition our data for use in constructing our machine learning model.

## Train & Test Split
In order to split data into training and testing sets, we first established the independent and dependent variables. The independent variable, commonly referred to as "y", is the input to a machine learning model, whereas the dependent variable, also known as "X", is the output that is predicted by the model.

We first create the input feature matrix X and target vector y for the machine learning models as:

In [17]:
X = data.drop('Class', axis = 1).values
y = data['Class'].values

The drop() method of a dataframe is used to remove one or more columns from the dataframe. In this case, the 'Class' column is dropped along the vertical axis (axis=1), and the resulting dataframe is converted to a numpy array using the values attribute. This array is assigned to the variable X. The 'Class' column values are extracted as a numpy array using the values attribute and assigned to the variable y.

Now, we split our train and test data.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

The train_test_split() function randomly splits the data into training and test sets. The X and y arrays are provided as the first two arguments to the function. The test_size parameter is set to 0.25, which means that 25% of the data will be used for testing and 75% for training. The random_state parameter is set to 1, which ensures that the same random split is generated every time the code is run. The resulting training and test sets are assigned to the variables X_train, X_test, y_train, and y_test.

## Model Building
We will conduct a sequential evaluation of various machine learning algorithms and optimize their parameters to enhance their performance. This are then compared to find out the model with the highest fraud detection prediction accuracy and the F1-score.

F1-score is a metric that combines precision and recall of a model to provide a single score that summarizes the model's performance. It is calculated as the harmonic mean of precision and recall. In the context of evaluating classification models for credit card fraud detection, F1-score can be used to measure the overall performance of the model in detecting both fraudulent and non-fraudulent transactions.

## 1. Decision Tree

We fit the Decision Tree model with a maximum depth of 4 and using entropy as the splitting criterion on the training set. We then predict the labels of the test set using the predict() method and store the result in dt_yhat. Heres the code:

In [13]:
DT = DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
DT.fit(X_train, y_train)
dt_yhat = DT.predict(X_test)

Checking the accuracy of our decision tree model.

In [14]:
print('Accuracy score of the Decision Tree model is {}'.format(accuracy_score(y_test, dt_yhat)))


Accuracy score of the Decision Tree model is 0.9991438853096524


Checking F1-Score for the decision tree model.

In [15]:
print('F1 score of the Decision Tree model is {}'.format(f1_score(y_test, dt_yhat)))


F1 score of the Decision Tree model is 0.7467811158798283


Checking the confusion matrix:

In [16]:
confusion_matrix(y_test, dt_yhat, labels = [0, 1])

array([[68770,    18],
       [   41,    87]], dtype=int64)

The first row denotes true positives, while the second row denotes true negatives. The results show 68782 true positives and 18 false positives. False positives are the number of normal transactions predicted as fraud, true negatives are the number of correctly predicted normal transactions, and false negatives are the number of fraud transactions predicted as normal.

This indicates that out of a total of 68800 cases, 68782 were correctly identified as normal transactions, but 18 were falsely classified as normal, despite being fraudulent.

## 2. K-Nearest Neighbors

In [20]:
n = 7
# fit a K-Nearest Neighbors model to the training data
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
# make predictions on the test set
knn_yhat = knn.predict(X_test)
acc_score = accuracy_score(y_test, knn_yhat)

The dataset is split into training and test sets using train_test_split(). A K-Nearest Neighbors model is then fit to the training data using KNeighborsClassifier(). Predictions are made on the test set using predict(), and the accuracy score is calculated using accuracy_score(). Finally, the accuracy score is printed using print().

Checking accuracy of K-Nearest Neighbors model:

In [21]:
print('Accuracy score of the K-Nearest Neighbors model is {}'.format(acc_score))


Accuracy score of the K-Nearest Neighbors model is 0.999288989494457


Checking F1-Score for the K-Nearest Neighbors model.

In [22]:
f1 = f1_score(y_test, knn_yhat)
print('F1 score of the K-Nearest Neighbors model is {}'.format(f1))


F1 score of the K-Nearest Neighbors model is 0.7949790794979079


## 3. Logistic Regression

In [25]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_yhat = lr.predict(X_test)
acc_score = accuracy_score(y_test, lr_yhat)


A Logistic Regression model is fit to the training data using LogisticRegression(). Predictions are made on the test set using predict(), and the F1 score is calculated using f1_score(). Finally, the F1 score is printed using print().

Checking the accuracy of the Logistic Regression model:

In [26]:
print('Accuracy score of the Logistic Regression model is {}'.format(acc_score))

Accuracy score of the Logistic Regression model is 0.9989552498694062


Checking F1-Score for the Logistic Regression model.

In [28]:
print('F1 score of the Logistic Regression model is {}'.format(f1_score(y_test, lr_yhat)))


F1 score of the Logistic Regression model is 0.6666666666666666


## 4. Support Vector Machines

In [20]:
svm = SVC()
svm.fit(X_train, y_train)
svm_yhat = svm.predict(X_test)
acc_score = accuracy_score(y_test, svm_yhat)
f1 = f1_score(y_test, svm_yhat)

An SVM model is then fit to the training data using SVC(). Predictions are made on the test set using predict(), and the accuracy and F1 scores are calculated using accuracy_score() and f1_score(), respectively. Finally, the accuracy and F1 scores are printed using print().

Checking the accuracy of our SVM model.

In [21]:
print('Accuracy score of the Support Vector Machines model is {}'.format(acc_score))

Accuracy score of the Support Vector Machines model is 0.999318010331418


Checking F1-Score for the SVM model.

In [22]:
print('F1 score of the Support Vector Machines model is {}'.format(f1))

F1 score of the Support Vector Machines model is 0.7813953488372093


## 5. Random Forest

In [23]:
rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)
acc_score = accuracy_score(y_test, rf_yhat)

The Random Forest Classifier model is initialized using a maximum depth of 4. A Random Forest Classifier model is then fit to the training data using RandomForestClassifier(). Predictions are made on the test set using predict(), and the accuracy score is calculated using accuracy_score(). Finally, the accuracy score is printed using print().

Checking the accuracy of the Random Forest model:

In [24]:
print('Accuracy score of the Random Forest model is {}'.format(acc_score))

Accuracy score of the Random Forest model is 0.9991293748911718


## 6. XGBoost

In [25]:
xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)
acc_score = accuracy_score(y_test, xgb_yhat)
f1 = f1_score(y_test, xgb_yhat)

An XGBoost Classifier model is then fit to the training data using XGBClassifier(). Predictions are made on the test set using predict(), and the accuracy score and F1 score are calculated using accuracy_score() and f1_score(), respectively. Finally, the accuracy score and F1 score are printed using print().

Checking accuracy of our XGBoost model:

In [26]:
print('Accuracy score of the XGBoost model is {}'.format(acc_score))

Accuracy score of the XGBoost model is 0.999506645771664


Checking F1-Score for the XGBoost model.

In [27]:
print('F1 score of the XGBoost model is {}'.format(f1))

F1 score of the XGBoost model is 0.8495575221238937


## Conclusion
The credit card fraud detection model achieves an accuracy of 99.95%, primarily due to class imbalance in the data. The confusion matrix indicates the model is not overfitting. XGBoost has the highest accuracy score. 