# CREDIT CARD FRAUD DETECTION

The idea behind this study is to compare the credit card transactions for the same region for different timelines to determine if there is an improvement in fraud detection in the same region.


### Literature Review

Credit Card fraud is a critical issue for any banking firm. It has a huge impact on the customer and the banking firm. It has thus become an important factor to employ fraud detection and preventive measures to eliminate such losses.

The issue with the usage of credit card data is that data is highly skewed wherein the amount of non fradulent transaction is higher in comparison to the amount of fradulent transcation.

To tackle the problem of class imbalance, the data is re-sampled using the Synthetic Minority over-sampling Technique.

The Synthetic Minority over-sampling TEchnique (SMOTE)
is amongst of the most dominant techniques that are used to
address the issue of class imbalance that is found in datasets
such as the ones used to build credit card fraud detection ML - based
models [1].

These SMOTE method when coupled with the Adaptive Boosting (AdaBoost) technique increases the classification quality.

As an ensemble method in machine learning, the
AdaBoost algorithm uses the boosting technique known as
Adaptive Boosting. Each instance is given a new set of
weights, with the greater weights going to instances that were
mistakenly categorised. During the data training process, N
decision trees are generated. Priority is given to the record
that was incorrectly classified during the previous model
during the construction of the first decision tree/model Input
for the second model is limited to the records in the first
record.[2]

Machine Learning plays an important role in efficient data processing of financial data. Several research methods employed supervised, unsupervised, hybrid machine learning models to detect the fradulent transactions. As most transactions are legitimate, for high precision prediction may be obtained without properly identifying the fradulent transaction. [3]


Research on fraud detection of credit card fraud transaction based on SMOTE-GAN is widespread and with several effective models developed over the years to tackle the issue. Maram Alamri and Mourad Ykhlef proposed a credit card fraud detection
method based on sampling techniques [4][5].

Their method involves the use of the SMOTE  algorithm to ensure a balanced representation of positive and negative samples in the training
dataset. It depicts the different sampling techniques and methods of implementation. 
The paper also explains more on the imbalance in the data and their impact on the algorithm performance based on inaccuracies, wrong result and F1 values.
Finally, they discuss the significance of sampling techniques in addressing the challenge of imbalanced data in credit card fraud detection .

[1] Ileberi, Emmanuel, Yanxia Sun, and Zenghui Wang. "Performance evaluation of machine learning methods for credit card fraud detection using SMOTE and AdaBoost." IEEE Access 9 (2021): 165286-165294.

[2] Credit Card Fraud Detection using AdaBoost
Algorithm in Comparison with Various Machine
Learning Algorithms to Measure Accuracy,
Sensitivity, Specificity, Precision and F-score 
2022 International Conference on Business Analytics for Technology and Security (ICBATS) | 978-1-6654-0920-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICBATS54253.2022.9759022

[3] Trivedi, Naresh Kumar, et al. "An efficient credit card fraud detection model based on machine learning methods." International Journal of Advanced Science and Technology 29.5 (2020): 3414-3424.

[4] Alamri, Maram, and Mourad Ykhlef. "Survey of credit card anomaly and fraud detection using sampling techniques." Electronics 11.23 (2022): 4003.

[5] Du, HaiChao, et al. "A novel method for detecting credit card fraud problems." Plos one 19.3 (2024): e0294537.

In [3]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

In [4]:
fraud_train_df = pd.read_csv("~/Documents/GitHub/Independent-Study-Fall-2024/Independent-Study-Fall-2024/data/fraudTrain.csv")
fraud_test_df = pd.read_csv("~/Documents/GitHub/Independent-Study-Fall-2024/Independent-Study-Fall-2024/data/fraudTest.csv")
fraud_train_df.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [5]:
fraud_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [6]:
fraud_train_df.describe()

Unnamed: 0.1,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud
count,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0,1296675.0
mean,648337.0,4.17192e+17,70.35104,48800.67,38.53762,-90.22634,88824.44,1349244000.0,38.53734,-90.22646,0.005788652
std,374318.0,1.308806e+18,160.316,26893.22,5.075808,13.75908,301956.4,12841280.0,5.109788,13.77109,0.07586269
min,0.0,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,1325376000.0,19.02779,-166.6712,0.0
25%,324168.5,180042900000000.0,9.65,26237.0,34.6205,-96.798,743.0,1338751000.0,34.73357,-96.89728,0.0
50%,648337.0,3521417000000000.0,47.52,48174.0,39.3543,-87.4769,2456.0,1349250000.0,39.36568,-87.43839,0.0
75%,972505.5,4642255000000000.0,83.14,72042.0,41.9404,-80.158,20328.0,1359385000.0,41.95716,-80.2368,0.0
max,1296674.0,4.992346e+18,28948.9,99783.0,66.6933,-67.9503,2906700.0,1371817000.0,67.51027,-66.9509,1.0


## Null Values in data

In [7]:
print(f"Null Values : \n{fraud_train_df.isnull().sum()}")

Null Values : 
Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64


In [8]:
print(f"Null Values : \n{fraud_test_df.isnull().sum()}")

Null Values : 
Unnamed: 0               0
trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64


- There are currently no null values in the training and testing dataset
- Dropping the unnamed 0 column as it is a duplicate of the index column

### Formatting Date to `yy-mm-dd` format

In [9]:
fraud_train_df['trans_date_trans_time']=pd.to_datetime(fraud_train_df['trans_date_trans_time'])
fraud_train_df['trans_date']=fraud_train_df['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
fraud_train_df['trans_date']=pd.to_datetime(fraud_train_df['trans_date'])
fraud_train_df['dob']=pd.to_datetime(fraud_train_df['dob'])

fraud_test_df['trans_date_trans_time']=pd.to_datetime(fraud_test_df['trans_date_trans_time'])
fraud_test_df['trans_date']=fraud_test_df['trans_date_trans_time'].dt.strftime('%Y-%m-%d')
fraud_test_df['trans_date']=pd.to_datetime(fraud_test_df['trans_date'])
fraud_test_df['dob']=pd.to_datetime(fraud_test_df['dob'])
fraud_test_df.trans_date.head(),fraud_test_df.dob.head(),fraud_train_df.trans_date.head(),fraud_train_df.dob.head()

(0   2020-06-21
 1   2020-06-21
 2   2020-06-21
 3   2020-06-21
 4   2020-06-21
 Name: trans_date, dtype: datetime64[ns],
 0   1968-03-19
 1   1990-01-17
 2   1970-10-21
 3   1987-07-25
 4   1955-07-06
 Name: dob, dtype: datetime64[ns],
 0   2019-01-01
 1   2019-01-01
 2   2019-01-01
 3   2019-01-01
 4   2019-01-01
 Name: trans_date, dtype: datetime64[ns],
 0   1988-03-09
 1   1978-06-21
 2   1962-01-19
 3   1967-01-12
 4   1986-03-28
 Name: dob, dtype: datetime64[ns])

In [10]:
fraud_train_df.drop(columns = "Unnamed: 0",inplace = True)
fraud_test_df.drop(columns = "Unnamed: 0",inplace = True)
fraud_train_df.columns

Index(['trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt',
       'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat',
       'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time', 'merch_lat',
       'merch_long', 'is_fraud', 'trans_date'],
      dtype='object')

In [24]:
from datetime import datetime
def calculate_age(dob):
    today = datetime.today()
    age = today.year - dob.year
    if (today.month, today.day) < (dob.month, dob.day):
        age -= 1
    return age

In [26]:
fraud_train_df['age'] = fraud_train_df['dob'].apply(calculate_age)
fraud_test_df['age'] = fraud_test_df['dob'].apply(calculate_age)

In [28]:
fraud_train_df.sample(3)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,trans_date,age
648905,2019-10-03 15:11:59,372509258176510,fraud_Lynch-Wisozk,home,115.73,Kristen,Hanson,F,26544 Andrea Glen,Goodrich,...,6951,Learning disability nurse,1985-06-18,5c045bd365a41ea45e61801b906ed5a2,1349277119,43.12382,-83.79827,0,2019-10-03,39
823415,2019-12-09 08:21:59,180031190491743,fraud_Kutch Group,grocery_net,53.98,Becky,Mckinney,F,250 Benjamin Hill Apt. 026,Mobile,...,270712,"Surveyor, land/geomatics",1972-01-05,3c6c9647077bff361130e4405d9c7f91,1355041319,30.016878,-87.409524,0,2019-12-09,52
480421,2019-07-29 12:43:15,342952484382519,"fraud_Hintz, Bauch and Smith",health_fitness,57.54,Kayla,Jones,F,6033 Young Track Suite 804,East Canaan,...,647,Comptroller,1987-09-26,61b61023cda070f78a293e59d00457ec,1343565795,42.518261,-73.37119,0,2019-07-29,37
