# Credit Card Fraud Detection using Machine Learning

## Introduction:

Step into the world of Credit Card Fraud Detection using Machine Learning. This is your guide to understanding how we employ smart algorithms and data analysis to keep your finances secure. We'll explore the nuts and bolts of our approach, making sure you're well-equipped to grasp the ins and outs of this crucial defense system.

## Dataset:

The credit card fraud detection model is trained and evaluated on a dataset containing labeled examples of transactions with corresponding label indicating whether the transaction is fraud or not.

link to the dataset: https://www.kaggle.com/datasets/anurag629/credit-card-fraud-transaction-data/data

Below are all the libraries that you will need for this project.

In [14]:
import pandas as pd
import numpy as np
import sklearn 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

Lets begin with loading the dataset.

In [4]:
data= pd.read_csv("CreditCardData.csv")
data.head()

Unnamed: 0,Transaction ID,Date,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud
0,#3577 209,14-Oct-20,Wednesday,19,Visa,Tap,£5,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0
1,#3039 221,14-Oct-20,Wednesday,17,MasterCard,PIN,£288,POS,Services,USA,USA,USA,F,49.6,Lloyds,0
2,#2694 780,14-Oct-20,Wednesday,14,Visa,Tap,£5,POS,Restaurant,India,India,India,F,42.2,Barclays,0
3,#2640 960,13-Oct-20,Tuesday,14,Visa,Tap,£28,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0
4,#2771 031,13-Oct-20,Tuesday,23,Visa,CVC,£91,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1


## EDA

Checking Null values in the data

In [27]:
data.isna().sum()

Transaction ID             0
Date                       0
Day of Week                0
Time                       0
Type of Card               0
Entry Mode                 0
Amount                     6
Type of Transaction        0
Merchant Group            10
Country of Transaction     0
Shipping Address           5
Country of Residence       0
Gender                     4
Age                        0
Bank                       0
Fraud                      0
dtype: int64

In [5]:
data.dropna(inplace=True)

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 99977 entries, 0 to 99999
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Day of Week             99977 non-null  object 
 1   Time                    99977 non-null  int64  
 2   Type of Card            99977 non-null  object 
 3   Entry Mode              99977 non-null  object 
 4   Amount                  99977 non-null  float64
 5   Type of Transaction     99977 non-null  object 
 6   Merchant Group          99977 non-null  object 
 7   Country of Transaction  99977 non-null  object 
 8   Shipping Address        99977 non-null  object 
 9   Country of Residence    99977 non-null  object 
 10  Gender                  99977 non-null  object 
 11  Age                     99977 non-null  float64
 12  Bank                    99977 non-null  object 
 13  Fraud                   99977 non-null  int64  
 14  Day                     99977 non-null  int

As the 'amount' column is not a float data type, we will convert it.

In [6]:
data['Amount'] = data['Amount'].str.replace('£', '').astype(float)

Also lets convert the day, month, year to datetime format

In [7]:
data['Date'] = pd.to_datetime(data['Date'])
data['Day'] = data['Date'].dt.day
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year
data.drop(['Date', 'Transaction ID'], axis=1, inplace=True)
data.head()

  data['Date'] = pd.to_datetime(data['Date'])


Unnamed: 0,Day of Week,Time,Type of Card,Entry Mode,Amount,Type of Transaction,Merchant Group,Country of Transaction,Shipping Address,Country of Residence,Gender,Age,Bank,Fraud,Day,Month,Year
0,Wednesday,19,Visa,Tap,5.0,POS,Entertainment,United Kingdom,United Kingdom,United Kingdom,M,25.2,RBS,0,14,10,2020
1,Wednesday,17,MasterCard,PIN,288.0,POS,Services,USA,USA,USA,F,49.6,Lloyds,0,14,10,2020
2,Wednesday,14,Visa,Tap,5.0,POS,Restaurant,India,India,India,F,42.2,Barclays,0,14,10,2020
3,Tuesday,14,Visa,Tap,28.0,POS,Entertainment,United Kingdom,India,United Kingdom,F,51.0,Barclays,0,13,10,2020
4,Tuesday,23,Visa,CVC,91.0,Online,Electronics,USA,USA,United Kingdom,M,38.0,Halifax,1,13,10,2020


Finally lets make dummies (0,1,2...) of all categorical variables.

In [8]:
cat_cols=['Day of Week','Type of Card','Entry Mode','Type of Transaction','Merchant Group','Country of Transaction','Shipping Address','Country of Residence','Gender','Bank']
data = pd.get_dummies(data, columns=cat_cols, drop_first=True,dtype=float)

In [47]:
data.head()

Unnamed: 0,Time,Amount,Age,Fraud,Day,Month,Year,Day of Week_Thursday,Day of Week_Tuesday,Day of Week_Wednesday,...,Country of Residence_USA,Country of Residence_United Kingdom,Gender_M,Bank_Barlcays,Bank_HSBC,Bank_Halifax,Bank_Lloyds,Bank_Metro,Bank_Monzo,Bank_RBS
0,19,5.0,25.2,0,14,10,2020,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,17,288.0,49.6,0,14,10,2020,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,14,5.0,42.2,0,14,10,2020,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14,28.0,51.0,0,13,10,2020,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,23,91.0,38.0,1,13,10,2020,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Now we will normalise the numerical data in order to improve the ML model performance

In [10]:
numerical_columns = ['Amount', 'Age']
scaler = StandardScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

Finally, lets start machine learning modelling.

In [11]:
X = data.drop('Fraud', axis=1)
y = data['Fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

First we will train a random forest classifier

In [15]:
rfc = RandomForestClassifier(n_estimators=10, max_depth=3, min_samples_split=2, random_state=0)
rfc.fit(X_train,y_train)
preds_rfc= rfc.predict(X_test)
rfc_acc = accuracy_score(y_test, preds_rfc)

print("Random Forest Accuracy:", rfc_acc)
print(classification_report(y_test,preds_rfc))

Random Forest Accuracy: 0.9477895579115824
              precision    recall  f1-score   support

           0       0.95      1.00      0.97     18535
           1       1.00      0.29      0.44      1461

    accuracy                           0.95     19996
   macro avg       0.97      0.64      0.71     19996
weighted avg       0.95      0.95      0.93     19996



Extra Tree classifier

In [16]:
etc = ExtraTreesClassifier(n_estimators=100, random_state=0)
etc.fit(X_train, y_train)
preds_etc=etc.predict(X_test)
etc_acc = accuracy_score(y_test, preds_etc)

print("Extra tree Accuracy:", etc_acc)
print(classification_report(y_test,preds_etc))

Extra tree Accuracy: 0.9772954590918184
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     18535
           1       0.90      0.77      0.83      1461

    accuracy                           0.98     19996
   macro avg       0.94      0.88      0.91     19996
weighted avg       0.98      0.98      0.98     19996

