# Fraud Detection Project Summary

## Brief Intro
In this project, we are addressing the problem of fraud detection in financial transactions, a critical issue for many businesses, especially in the banking and e-commerce sectors. The primary goal is to identify fraudulent activities effectively without causing undue inconvenience to customers conducting legitimate transactions. To this end, it is crucial to design a model that maximizes recall, the ability to catch all fraudulent transactions, without overly sacrificing precision, ensuring that transactions flagged as fraudulent are indeed fraudulent. Therefore, my primary measure for performance evaluation is recall and my secondary measure is precision. Recall can be calculated as the True Positives divided by True Postitives + False Negatives. Precision is equal to True Positives divided by True Positives + False Positives. 

## Data Cleaning

I examined the dataset checking the dimensions, columns and null values. The dataset had 416789 rows and 24 columns and zero null values. When examining the columns, I noticed some variables that I did not need including and am few values I wanted to convert to different formats. For example, I just wanted the hour from the transaction timestamp column because I thought the time of day may play an important role in fraud as fraudulent activity may be more likely to occur at unusual hours. For date of birht, I extracted year only as age of the customer as a persons age may impact the likely hood of their card information being stolen or misused. I threw out personal information of the customer except their latitude and longitude as transactions that take place far away from where a person lives may be more likely to be fraudulent. 

I used the following 9 features to predict fraud:
['amount', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'hour', 'year_of_birth', 'category']

To convert categorical variables to numerical variables, I used one hot encoder which converts each unique value into its own binary column. The only variable that needed to be converted when I selected my final features was the category of purchase. I used it with the merchant as well but found that it did not help my models and thus left it out to make my models more efficient. 

## Classification Model Selection

I tested various classification models including Logistic Regression, Support Vector Machines, Gaussian naive Bayes. The three best models I selected are Decision Tree, Random Forest, and Gradient Boosted Trees. My decision tree performed best on nearly all performance measures with recall rate of 70.4% meaning it would detect about 70% of fraudulent transactions and a precision rate around 92.46%. Ranking second place was a little trickier as my random forest model had a much higher precision rate but a slightly lower recall rate. Given the significant difference in precision rate, I decided that the random forest model would be a better fit with a 60.4% recall rate but a 96.6% precision rate. Gradient boosting was 3rd with a recall rate around 64.1% and precision of 80.4%. 

## Description of classifier and Hyperparameters of each model:
Hyperparameters used:
1. max_depth: species maximum depth of the tree. Limiting the depth can prevent overfitting of data and improve generalizability.
2. min_samples_split: determines the minimum number of samples needed for node to split. This controls the minimum size of the dataset on which decisions to stop excessive splitting of data.
3. min_samples_leaf: sets the minimum number of samples a leaf node must have. All this means is that no leaf in the tree will represent less than two samples ensuring nodes are not hyperspecific.
4. random_state: meant to make data reproducible. Ensures that each random split is the same and this is used in all the models
5. n_estimators: specifies number of trees in a forest (random forest) or models in an ensemble (Graident Boosted Trees). Generally, higher n-estimator numbers would lead to better accuracy but to an extent; however, those reutrns diminish and can eventually have a negative impact on the model 
6. learning_rate: determines how much to change the model in regard to estimated error each time weights are updates (gradient boosting). Lower learning rates result in slower learning for the model and being more cautious which can be beneficial when finding the optimal solution.

In finding optimal hyperparameters, I had to manually adjust the models do to limitations of my computer and it was through analysing the the recall, precision, accuracy, f1-score, and generalizability to the test set that I determined the best parameters.

1. Decision Tree:  This is a classification model that consists of a root node with no incoming branches, followed by internal nodes that have both incoming and outgoing branches and terminate into leaf nodes, hence the term decision "tree". Each node runs a test on a feature of the data with a binary outcome. Pros of decision trees are its simplicity and interprebility, ability to work with categorical and numerical data (though it is unable to do this in scikit learn), and are generally fast for training and prediction. Some of the cons are overfitting from overly complex trees and can be volatile from small variations in data.

Decision Tree Scores:
Accuracy: 0.9986
Precision: 0.9264
Recall: 0.7040
F1 Score: 0.8000

hyperparameters = (max_depth = 10, min_samples_split - 9, min_samples_leaf = 2, random_state = 42)
Meaning: no more than 10 levels of nodes, at least 9 samples to split a node, at least 2 samples to create a leaf node and an arbitrary random_state of 42 to ensure reproducibility of results.

2. Random Forest: A random forest classificatoin model combines multiple decision trees to come to a single decision. Each decision tree comprises a sample of the dataset that is drawn with replacement meaning the same data points can be used more than once. One of the key benefits of random forest models is that they are pretty good at preventing overfitting of data. A couple of downsides are that it is time-consuming and can be harder to interpret how decisions were made. 

Random Forest Scores:
Accuracy: 0.9983
Precision: 0.9664
Recall: 0.6037
F1 Score: 0.7432

hyperparameters = (random_state = 42)
Meaning: Here, I have only put random state and that is because the default values produced the best result. 100 decision trees will be used for averaging predictions (n-estimator), at least 2 samples per split (default), at least one sample per leaf node (default)
  
3.  Gradient Boosted Trees: This is an ensemble technique that sequentially builds a series of decision trees. Each tree corrects errors of the previous one, which effectively improves the model incrementally. Thus, the trees are dependent the ones that come before unlike with random forest classification. Like random forest, gradient boosted trees can be time consuming but are also more prone to overfitting.

Gradient-Boosted Tree Scores:
Accuracy: 0.9979
Precision: 0.8041
Recall: 0.6410
F1 Score: 0.7134

hyperparameters =  (learning_rate = 0.1, max_depth=3, random_state=42)
Meaning: each iteration will correct for 10% of the errors (learning rate) and 3 is the maximum level of nodes. n_estimator is the default value of 100.

## Importing packages

In [243]:
import pandas as pd
import polars as pl
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import SGDClassifier

# Data Cleaning

In [244]:
# using one hot encoder to convert categorical variables to be used in models that require numeric variables
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False).set_output(transform = 'pandas')
# read in data
data = pd.read_csv("mis501_fraud.csv")
# first 5 rows with columns
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,transaction_timestamp,credit_card_num,merchant,category,amount,first_name,last_name,gender,...,lat,long,city_pop,job_title,date_of_birth,transaction,POSIX_time,merch_lat,merch_long,is_fraud
0,119106,119106,02/08/2020 07:55,377896000000000.0,"fraud_Bahringer, Schoen and Corkery",shopping_pos,1.07,Kimberly,Myers,F,...,41.4682,-72.5751,5438,"Librarian, academic",17/11/1964,cf581d75ccc9ba838a05dec8bfa78b5b,1375430128,41.240083,-71.837788,0
1,179292,179292,23/08/2020 14:05,30364100000000.0,"fraud_Romaguera, Wehner and Tromp",kids_pets,94.99,Samuel,Sandoval,M,...,35.8896,-96.0887,7163,Fitness centre manager,05/02/1982,b1bfaf13224da41f422db483fd810dd7,1377266716,35.156537,-95.806648,0
2,540729,540729,28/12/2020 16:22,30328400000000.0,fraud_Berge-Hills,kids_pets,31.28,Helen,Campbell,F,...,40.029,-93.1607,602,Cytogeneticist,14/07/1954,cde9fc0136873645778d0ad8817db655,1388247749,39.888665,-93.106804,0
3,374360,374360,14/11/2020 10:44,30364100000000.0,"fraud_Connelly, Reichert and Fritsch",gas_transport,73.06,Samuel,Sandoval,M,...,35.8896,-96.0887,7163,Fitness centre manager,05/02/1982,90b8429191e5c83df1afba4e5db4d61e,1384425890,36.734101,-96.737345,0
4,314574,314574,19/10/2020 01:50,4198470000000.0,fraud_Kuphal-Predovic,misc_net,9.99,Christie,Williamson,F,...,41.4768,-95.3509,2036,Engineering geologist,20/08/1971,e4893795b6b3e41667129b9ed13b9650,1382147409,40.922072,-94.899388,0


In [245]:
# data dimensions
print(data.shape)
# get column names
print(data.columns)
#no null_values
print(data.isnull().sum())

#remove "fraud_" from start of each merchant name
data['merchant'] = data['merchant'].str.replace('^fraud_', '', regex=True)

#adjust format of data in certain columns: get hour from time of transaction and year from date of birth
data['hour'] = pd.to_datetime(data['transaction_timestamp'], format='%d/%m/%Y %H:%M').dt.hour
data['year_of_birth']=pd.to_datetime(data['date_of_birth'], format='%d/%m/%Y').dt.year

#removing variables we do not want
new_data = data.drop(["transaction_timestamp", "job_title", "merchant", "date_of_birth", "credit_card_num", "POSIX_time", "zip", "Unnamed: 0.1", "Unnamed: 0", "first_name", "last_name", "gender", "street", "city", "state", "transaction"], axis = "columns")

# encoding the category column using one hot encoder to convert the values to a numeric, in this case a binary value
ohetransform = ohe.fit_transform(new_data[['category']])
# create new df
updated_data = pd.concat([new_data, ohetransform], axis=1).drop(columns = ['category'])

updated_data.head()

# define target and feature variables
predictors = updated_data[[column for column in updated_data.columns if column != 'is_fraud']]

# Split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(predictors, updated_data["is_fraud"], test_size=0.25, random_state=42)

(416789, 24)
Index(['Unnamed: 0.1', 'Unnamed: 0', 'transaction_timestamp',
       'credit_card_num', 'merchant', 'category', 'amount', 'first_name',
       'last_name', 'gender', 'street', 'city', 'state', 'zip', 'lat', 'long',
       'city_pop', 'job_title', 'date_of_birth', 'transaction', 'POSIX_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')
Unnamed: 0.1             0
Unnamed: 0               0
transaction_timestamp    0
credit_card_num          0
merchant                 0
category                 0
amount                   0
first_name               0
last_name                0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job_title                0
date_of_birth            0
transaction              0
POSIX_time               0
merch_lat                0
merch_long               0
is_f

# 1. Decision Tree Classifier

In [252]:
try:
    clf_dt = DecisionTreeClassifier(max_depth = 10, min_samples_split = 9, min_samples_leaf = 2, random_state = 42) # shift-tab here to show how much can be changed
    dt_predicted = clf_dt.fit(X_train, y_train).predict(X_test)
except ValueError as e:
    print(f"ValueError occurred: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

try:
    accuracy = accuracy_score(y_test, dt_predicted)
    precision = precision_score(y_test, dt_predicted)
    recall = recall_score(y_test, dt_predicted)
    f1 = f1_score(y_test, dt_predicted)

    print(f'Accuracy: {accuracy:.4f}')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
except Exception as e:
    print(f"An error occurred: {e}")

try:
    cm = confusion_matrix(y_test, dt_predicted)
    print(cm)
except Exception as e:
    print(f"An error occurred: {e}")

Accuracy: 0.9986
Precision: 0.9264
Recall: 0.7040
F1 Score: 0.8000
[[103745     24]
 [   127    302]]


# 2. Random Forest

In [253]:
try:
    rf_clf = RandomForestClassifier(random_state = 42)
    rf_predicted = rf_clf.fit(X_train, y_train).predict(X_test)
except ValueError as e:
    print(f"ValueError occurred: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

In [254]:
try:
    accuracy = accuracy_score(y_test, rf_predicted)
    precision = precision_score(y_test, rf_predicted)
    recall = recall_score(y_test, rf_predicted)
    f1 = f1_score(y_test, rf_predicted)

    print(f'Accuracy: {accuracy:.4f}')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
except Exception as e:
    print(f"An error occurred: {e}")
try:
    cm = confusion_matrix(y_test, rf_predicted)
    print(cm)
except Exception as e:
    print(f"An error occurred: {e}")

Accuracy: 0.9983
Precision: 0.9664
Recall: 0.6037
F1 Score: 0.7432
[[103760      9]
 [   170    259]]


# 3. Gradient Boosting

In [250]:
try:
    gbc_clf = GradientBoostingClassifier(n_estimators=100, learning_rate = 0.1, max_depth=3, random_state=42)
    gbc_pred = gbc_clf.fit(X_train, y_train).predict(X_test)
except ValueError as e:
    print(f"ValueError occurred: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

In [251]:
try:
    accuracy = accuracy_score(y_test, gbc_pred)
    precision = precision_score(y_test, gbc_pred)
    recall = recall_score(y_test, gbc_pred)
    f1 = f1_score(y_test, gbc_pred)

    print(f'Accuracy: {accuracy:.4f}')
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
except Exception as e:
    print(f"An error occurred: {e}")

try:
    cm = confusion_matrix(y_test, gbc_pred)
    print(cm)
except Exception as e:
    print(f"An error occurred: {e}")

Accuracy: 0.9979
Precision: 0.8041
Recall: 0.6410
F1 Score: 0.7134
[[103702     67]
 [   154    275]]
