## Credit Card Fraud Detection Model

Scenario: A financial firm is transitioning its fraud detection activities from traditional methods to machine learning. To start off, they require a straightforward and immediate model to assist the anti-fraud team partially. This project may serve as a crucial initial step in their larger effort to enhance the system further over time.

Data source: Dummy data from Kanggle. Features define:
- Time: Number of seconds elapsed between this transaction and the first transaction in the dataset
- V1-V28: the result of a PCA Dimensionality reduction
- Amount: Transaction amount
- Class: 1 for fraudulent transactions, 0 otherwise (dependent variable)

Project complexity level: Low

Implementation steps:
1. Data Preprocessing
2. Training the Model:
3. Evaluating the Model
4. Finding

---

In [84]:
import numpy as np
import pandas as pd
#import seaborn as sns
from matplotlib import pyplot as plt
from collections import Counter
import itertools

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, confusion_matrix

### 1. Data preprocessing

In [85]:
# df = pd.read_csv("creditcard.csv")
df = pd.read_csv("/Users/nguyenhien/Desktop/OneDrive/2. Learning/2.3 Data Science/@python/1. Machine Learning/Course/ML_UC SanDiego_Edx/Python/New project/data/creditcard.csv")
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [86]:
# Checking dataset shape, any field with null and describe the lable
df.shape, df.isnull().sum()

((284807, 31),
 Time      0
 V1        0
 V2        0
 V3        0
 V4        0
 V5        0
 V6        0
 V7        0
 V8        0
 V9        0
 V10       0
 V11       0
 V12       0
 V13       0
 V14       0
 V15       0
 V16       0
 V17       0
 V18       0
 V19       0
 V20       0
 V21       0
 V22       0
 V23       0
 V24       0
 V25       0
 V26       0
 V27       0
 V28       0
 Amount    0
 Class     0
 dtype: int64)

In [87]:
df['Class'].value_counts(normalize=True)

Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64

The dataset seems to be an imbalanced dataset that may cause the unaccuracy. Therefore, to figue out the issue of imbalance dataset, out of resample like oversampling or undersampling, I have a designated approach as below:
1. Evaluation metric: I will try precision, recall, F1-score which provide a better evaluation of the model's performance in an imbalanced dataset
2. Ensemble methods: Random Forest which can be effective in handling imbalanced datasets.

However, after trying the resample technique for the new dataset, I will also trying Decision Tree algorithm and Accuracy score in evaluation to make the model more comperable

In [88]:
# Using standard technique to 
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1,1))
# Clean non neccessary features
df.drop('Time', inplace = True, axis = 1)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0.244964,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.342475,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,1.160686,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0.140534,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,-0.073403,0


In [89]:
# Splitting data at 0.8-0.2
y = df.Class
x = df.drop(['Class'],axis=1)


In [90]:
# To resampling the dataset, I use SMOTE

from imblearn.over_sampling import SMOTE

x_re,y_re = SMOTE().fit_resample(x,y)
print("shape of resampled x: ", x_re.shape)
print("shape of resampled y: ", y_re.shape)

value_counts = Counter(y_re)
print(value_counts)

shape of resampled x:  (568630, 29)
shape of resampled y:  (568630,)
Counter({0: 284315, 1: 284315})


In [91]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
x_train_re, x_test_re, y_train_re, y_test_re = train_test_split(x_re, y_re, test_size=0.2, random_state=42)

### 2.Training the model
#### 2.1. Random Forest

In [92]:
model_rf = RandomForestClassifier()
model_rf.fit(x_train, y_train)
y_pred_rf = model_rf.predict(x_test)

rf_score = model_rf.score(x_test, y_test)*100

rf_precision_score = precision_score(y_test,y_pred_rf )
rf_recall_score = recall_score(y_test,y_pred_rf )
rf_f1_score = f1_score(y_test,y_pred_rf )
rf_accuracy_score = accuracy_score(y_test,y_pred_rf )

rf_precision_score, rf_recall_score, rf_f1_score, rf_accuracy_score

(0.9743589743589743,
 0.7755102040816326,
 0.8636363636363635,
 0.9995786664794073)

#### 2.2. Decision Tree

In [93]:
model_dt = DecisionTreeClassifier()
model_dt.fit(x_train, y_train)
y_pred_dt = model_dt.predict(x_test)

dt_score = model_dt.score(x_test, y_test)*100

dt_precision_score = precision_score(y_test,y_pred_dt )
dt_recall_score = recall_score(y_test,y_pred_dt )
dt_f1_score = f1_score(y_test,y_pred_dt )
dt_accuracy_score = accuracy_score(y_test,y_pred_dt )

dt_precision_score, dt_recall_score, dt_f1_score, dt_accuracy_score

(0.7264150943396226,
 0.7857142857142857,
 0.7549019607843137,
 0.9991222218320986)

### 3. Evaluating model

So sánh các model với nhau được kết quả như dưới.

In [107]:
# score_frame = pd.DataFrame('Model', 'Precision Score', 'Recall Score', 'F1 Score', 'Accuracy Score')
score_frame = {
    'Model': ['Decision tree', 'Random Forest'],
    'Precision Score': [dt_precision_score,rf_precision_score],
    'Recall Score': [dt_recall_score,rf_recall_score],
    'F1 Score': [dt_f1_score, rf_f1_score],
    'Accuracy Score': [dt_accuracy_score,rf_accuracy_score]}

score_frame = pd.DataFrame(score_frame)
score_frame

Unnamed: 0,Model,Precision Score,Recall Score,F1 Score,Accuracy Score
0,Decision tree,0.726415,0.785714,0.754902,0.999122
1,Random Forest,0.974359,0.77551,0.863636,0.999579


In both models, the high Accuracy Score, close to 1, suggests that the models are achieving a high percentage of correct predictions overall. However, the discrepancy with the other metrics indicates that the Accuracy Score may not be the most reliable metric for evaluating model performance.

Therefore, to compare the models, we focus on comparing the remaining 3 scoring values. It is evident, as predicted, that the Random Forest model produces more accurate predictions with the dataset that has class-imbalance issues.

So, what about the resampling technique?

In [96]:
# Random forest for resampling dataset
model_rf_re = RandomForestClassifier()
model_rf_re.fit(x_train_re, y_train_re)
y_pred_rf_re = model_rf_re.predict(x_test_re)

rf_score_re = model_rf_re.score(x_test_re, y_test_re)*100

rf_precision_score_re = precision_score(y_test_re,y_pred_rf_re )
rf_recall_score_re = recall_score(y_test_re,y_pred_rf_re )
rf_f1_score_re = f1_score(y_test_re,y_pred_rf_re )
rf_accuracy_score_re = accuracy_score(y_test_re,y_pred_rf_re )


In [97]:
# Decision Tree for resampling dataset
model_dt_re = DecisionTreeClassifier()
model_dt_re.fit(x_train_re, y_train_re)
y_pred_dt_re = model_dt_re.predict(x_test_re)

dt_score_re = model_dt_re.score(x_test_re, y_test_re)*100

dt_precision_score_re = precision_score(y_test_re,y_pred_dt_re )
dt_recall_score_re = recall_score(y_test_re,y_pred_dt_re )
dt_f1_score_re = f1_score(y_test_re,y_pred_dt_re )
dt_accuracy_score_re = accuracy_score(y_test_re,y_pred_dt_re )

In [108]:
re_frame = {
    'Model': ['Decision tree resam', 'Random Forest resam'],
    'Precision Score': [dt_precision_score_re,rf_precision_score_re],
    'Recall Score': [dt_recall_score_re,rf_recall_score_re],
    'F1 Score': [dt_f1_score_re, rf_f1_score_re],
    'Accuracy Score': [dt_accuracy_score_re,rf_accuracy_score_re]}

re_frame = pd.DataFrame(re_frame)

score_frame = pd.concat([score_frame,re_frame], ignore_index=True)
score_frame

Unnamed: 0,Model,Precision Score,Recall Score,F1 Score,Accuracy Score
0,Decision tree,0.726415,0.785714,0.754902,0.999122
1,Random Forest,0.974359,0.77551,0.863636,0.999579
2,Decision tree resam,0.997406,0.998807,0.998106,0.998101
3,Random Forest resam,0.999789,1.0,0.999895,0.999894


### 5. Finding

- Resampling the imbalanced dataset significantly improved the performance of both the Decision Tree and Random Forest models. The models trained on the resampled dataset exhibit better accuracy, precision, recall, and F1 scores compared to those trained on the original imbalanced dataset.
- The Random Forest model performs exceptionally well on the resampled dataset, achieving near-perfect accuracy and F1 score. This indicates that the Random Forest model is well-suited for handling imbalanced datasets, as it can effectively learn from the minority class (fraud) examples.