# Credit Card Fraud Detection

### <font color='blue'>*Author: Ali Chehrazi*</font>

##  Description

In this project, different classification models were employed to detect fraudulent transactions. The data employed in this project can be found at the following link:

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4f/Credit-cards.jpg" width=400 />

### About Dataset

This dataset logs credit card transactions by European cardholders in September 2013 over two days. It reveals 492 fraud cases out of 284,807 transactions, highlighting a significant imbalance where frauds make up only 0.172% of total transactions. The dataset mainly contains numerical variables resulting from PCA transformation, and other features like 'Time', 'Amount', and the response variable 'Class' denoting fraud (1) or non-fraud (0). However, due to confidentiality, original features and background data are undisclosed


## Importing Libraries
First, let's import the required libraries.

In [1]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # Matplotlib is a plotting library for python
import seaborn as sns #Seaborn is a Python data visualization library based on matplotlib.
from sklearn.model_selection import train_test_split # Allows us to split our data into training and testing data
from sklearn.model_selection import GridSearchCV # Allows us to test parameters of classification algorithms and find the best one
from sklearn.linear_model import LogisticRegression # Logistic Regression classification algorithm
from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbors classification algorithm
from sklearn.ensemble import RandomForestClassifier # Random Forest classification algorithm
     


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/creditcardfraud/creditcard.csv


Let's read the database and save it into a data frame.

In [2]:
df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv', delimiter=',')
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


Let's check if ther is any Nan values in the data frame. 

In [3]:
df.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Let's check how many fraudulent transaction is in the database. 

In [4]:
df.Class.value_counts()

Class
0    284315
1       492
Name: count, dtype: int64

In [5]:
print('The percentage of fradulent transaction is '+str(492/284315*100)[:6]+'.')

The percentage of fradulent transaction is 0.1730.


## Train/test split

Let's split the data to train/test sets.

In [6]:
y=df.pop('Class')
X=df

In [7]:
X_train, X_test, Y_train, Y_test=train_test_split(X,y,test_size=0.2,random_state=0)

## Classification Models

Let's employ KNN, and Random Forest to see which one perfomrs better on this database.

### KNN

In [8]:
parameters = {'n_neighbors': [8,12,16],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

KNN = KNeighborsClassifier()
knn_cv=GridSearchCV(KNN,parameters,scoring='f1',cv=4)
knn_cv.fit(X_train,Y_train)
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)
knn_cv.score(X_test,Y_test)
print("accuracy test set:",knn_cv.score(X_test,Y_test))

tuned hpyerparameters :(best parameters)  {'algorithm': 'auto', 'n_neighbors': 8, 'p': 1}
accuracy : 0.12813894705670834
accuracy test set: 0.14678899082568805


tuned hpyerparameters :(best parameters)  {'algorithm': 'auto', 'n_neighbors': 8, 'p': 1}
accuracy : 0.99840242269955
accuracy test set: 0.9983673326077034


In [9]:
parameters= {"max_depth": [3,5,7],
              "n_estimators":[3,5,10],
              "max_features": [5,6,7,8]}

# Creating the classifier
RF = RandomForestClassifier()
RF_cv=GridSearchCV(RF,parameters,scoring='f1',cv=4)
RF_cv.fit(X_train,Y_train)
print("tuned hpyerparameters :(best parameters) ",RF_cv.best_params_)
print("accuracy :",RF_cv.best_score_)
RF_cv.score(X_test,Y_test)
print("accuracy test set:",RF_cv.score(X_test,Y_test))

tuned hpyerparameters :(best parameters)  {'max_depth': 7, 'max_features': 7, 'n_estimators': 10}
accuracy : 0.8415926717497526
accuracy test set: 0.8216216216216216
