# Title: AIDI 1002 Final Term Project Report

#### Name: Priya Jogani and Aaron Strasser 

####  Emails: 200523921@student.georgianc.on.ca and 200591429@student.georgian.on.ca

# Introduction:

#### Problem Description:

Credit card fraud has emerged as major problem in the electronic payment sector. We study data-driven credit card fraud detection particularities and several machine learning methods to address each of its intricate challenges with the goal to identify fraudulent transactions that have been issued illegitimately on behalf of the rightful card owner.

#### Context of the Problem:

Credit card fraud detection using machine learning is critical due to the rising incidence of fraud, substantial financial losses incurred by both institutions and cardholders, and the erosion of customer trust. Traditional methods struggle to keep pace with evolving fraud tactics, making real-time detection essential. Machine learning's ability to analyze vast amounts of data in real-time, adapt to emerging threats, and reduce false positives enhances its effectiveness. Furthermore, it ensures compliance with regulatory standards, addresses the global nature of fraud, and allows for continuous improvement over time, making it an indispensable tool in safeguarding financial systems and maintaining the integrity of digital transactions.

#### Limitation About other Approaches:

The passage explores modifications of Breiman's random forests, with a specific focus on Mondrian random forests, which simplify the partition process using the Mondrian process independently of data. While highlighting the minimax optimality of Mondrian random forests for point estimation over smooth regression functions, it underscores the limited understanding of their formal statistical properties, especially in the context of statistical inference.

#### Solution:

SVM employs a randomized search for hyperparameter tuning of a Support Vector Machine model within a pipeline, optimized for credit card fraud detection. It incorporates feature scaling, a custom scoring metric (MCC), and AUC-ROC calculation, enhancing model performance and evaluation on both training and test sets.

# Background

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Z. Salimi | learning is performed through two approaches: 1) clustering and binary classification with OCSVM method for positive data | SQUAD dataset for QA | Only 78% accuracy
| Yogesh Patel | As the time elapsed, the LSTM model performance significantly improves as the sequence of events became larger | SQUAD V2 dataset for QA | LSTM model achieves a F1 score of 97.7% whereas the SVM and Markov model achieve 93.5% and 95.0% 
| Open Data Commons | The dataset contains transactions made by credit cards in September 2013 by European cardholders.| Dataset for credit card fraud | Features representing various transaction attributes are anonymized to protect privacy, making some analysis difficult
| Nidula Elgiriyewithana | Dataset containing credit card transactions made by European cardholders in the year 2023 | Dataset for credit card fraud | Features representing various transaction attributes are anonymized to protect privacy, making some analysis difficult

# Methodology

In the existing research paper they have implemented:
1) Random Forest Claasifier:
The Random Forest Classifier is chosen for its effectiveness in handling complex and imbalanced datasets, common characteristics in credit card fraud detection. The ensemble nature of random forests, coupled with the ability to handle a large number of features and provide feature importance, makes it well-suited for the nuanced patterns and potential outliers associated with fraudulent transactions.

2) Logistic Regression:
Logistic Regression is chosen for its interpretability, simplicity, and effectiveness in binary classification tasks like credit card fraud detection. The grid search explores different regularization penalties and class weights to optimize the model's performance, making Logistic Regression a suitable choice for this context.

3) KNeighborsClassifier:
K-Nearest Neighbors (KNN) is chosen for its simplicity and flexibility, making it suitable for scenarios where underlying patterns in the data may not be linear or well-defined, which is common in credit card fraud detection. The grid search optimizes KNN hyperparameters, such as the distance metric and neighbor weights, to enhance its performance in identifying fraudulent transactions.

Additional model added by us:
1) Support Vector Machine :
Support Vector Machine (SVM) is chosen for its capability to handle complex decision boundaries and nonlinear relationships, which can be valuable in credit card fraud detection where patterns may not be easily discernible. The randomized search optimizes SVM hyperparameters, such as the choice of kernel, regularization parameter (C), and gamma, aiming to enhance the model's ability to accurately identify fraudulent transactions. The use of scaled features and the calculation of probability for AUC-ROC further contribute to its effectiveness in this context.

# Implementation

In [2]:
# Code cells
#We have changed CV to 5 to get quicker results
#We have expanded values for the grid search which will explore for 'n_estimators' hyperparameter
#The grid search will fit the model with each combination of hyperparameters and evaluate their performance, helping to identify the best set of hyperparameters.
%%time
pipeline_rf = Pipeline([
    ('model', RandomForestClassifier(n_jobs=-1, random_state=1))
])
param_grid_rf = {'model__n_estimators': [50, 75, 100, 150]} 
#trying to find the depth using this code: param_grid_rf = {'model__n_estimators': [75], 'model__max_depth': [None, 10, 20, 30]}
grid_rf = GridSearchCV(estimator=pipeline_rf, param_grid=param_grid_rf, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)
grid_rf.fit(X_res, y_res)

In [4]:
# Code cells
#We have changed CV to 5 to get quicker results
#The parameter grid is exploring different regularization types ('l1', 'l2', and 'elasticnet') and class weight options (None and 'balanced') for a logistic regression model.
#This grid search helps find the best combination of hyperparameters that optimizes the model's performance, based on the specified scoring metric 'MCC_scorer'.
#The goal is to identify the hyperparameter values that lead to the most effective logistic regression model for your specific dataset and problem.
%%time
pipeline_lr = Pipeline([
    ('model', LogisticRegression(random_state=1))
])
param_grid_lr = {'model__penalty': ['l1', 'l2', 'elasticnet'],
                 'model__class_weight': [None, 'balanced']}
#param_grid_lr = {'model__penalty': ['l2'],
#                 'model__class_weight': [None, 'balanced', {0: 1, 1: 5}]}
grid_lr = GridSearchCV(estimator=pipeline_lr, param_grid=param_grid_lr, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)
grid_lr.fit(X_res, y_res)

In [3]:
# Code cells
#We have changed CV to 5 to get quicker results
#'model__p': Controls the Minkowski distance power parameter, focusing only on Euclidean distance (p=2).
#'model__weights': Tests two weight functions during prediction—'uniform' (equal weighting) and 'distance' (inverse of distance weighting).
#The grid search aims to find the best combination of these hyperparameters for the KNN model, optimizing its performance based on a specified scoring metric (presumed to be MCC_scorer).
%%time
pipeline_knn = Pipeline([
    ('model', KNeighborsClassifier(n_neighbors=8))
])
param_grid_knn = {'model__p': [2], 'model__weights': ['uniform', 'distance']}
grid_knn = GridSearchCV(estimator=pipeline_knn, param_grid=param_grid_knn, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)
grid_knn.fit(X_res, y_res)

# Conclusion and Future Direction

For task one, after changint the dataset to a new one sourced from 2023, while Random Forest remained robust in detecting fraud on the new dataset, Logistic Regression and KNN demonstrated limited effectiveness, emphasizing the impact of dataset variations on model performance. Careful model selection and adaptation are crucial when dealing with diverse datasets for fraud detection.

For task two, hyperparameter tuning notably impacted model performances. Random Forest demonstrated consistent improvements, emphasizing its robustness in fraud detection. However, KNN's performance fluctuated despite tuning, highlighting the model's sensitivity to parameter changes. Continuous parameter refinement remains crucial for enhancing fraud detection accuracy.

For task three, despite efforts to introduce an additional model like SVM, it became apparent that its implementation wasn't as effective as the originally proposed models. SVM not only exhibited slower training but also showcased inferior fraud detection capabilities compared to the established models. This outcome underscores the importance of not only model selection but also the nuanced impact that different algorithms can have on fraud detection performance.

Throughout this project, we've discovered the pivotal role of model selection and parameter tuning in credit card fraud detection. The diverse impact of different models, evident in the robustness of Random Forest compared to the limited efficacy of Logistic Regression and KNN, underscores the sensitivity of fraud detection to model choice and dataset variations. This highlights the need for continuous adaptation and refinement in model architectures to accommodate evolving fraud patterns. Moreover, the limitations encountered with introducing additional models like SVM emphasize the intricate interplay between algorithmic choices and performance outcomes, signaling the necessity for a nuanced understanding of algorithmic behavior in fraud detection contexts. As we move forward, leveraging more advanced algorithms and fine-tuning parameters tailored to specific dataset characteristics will be crucial for advancing the accuracy and adaptability of fraud detection systems.

# References:

1) Yvan Lucas, Credit Card Fraud Detection using Machine Learning, 13 October, 2020.  https://arxiv.org/abs/2010.06479

2) Johannes Jurgovsky, Credit Card Fraud Detection using Machine Learning, 13 October, 2020.

3) Open Data Commons, Credit Card Fraud Detection,  May 3, 2021. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

4) Sagnik Ghosh, Credit-Card-Fraud-Detection, Oct 8, 2020. https://github.com/sagnikghoshcr7/Credit-Card-Fraud-Detection

5) Nidula Elgiriyewithana, Credit Card Fraud Detection Dataset 2023, Sept 2023, https://statics.teams.cdn.office.net/evergreen-assets/safelinks/1/atp-safelinks.html