# Fraud Detection: Modelling and Reporting

Author: Li Zhao-Zhi

**Background**

Data-driven fraud detection reveals fraud patterns and predicts future fraud activities to safeguard customer assets in banking. This research analyses anonimised real-world fraud datasets, reports insights via dashboards, and models fraud patterns to predict future fraud activities.

This covers the entire lifecycle from data cleansing, exploratory data analysis (EDA), modelling, to performance metrics evaluation. The interpretation of key steps is available in markdown documentation to illustrate model design.

**Datasets**

The IEEE-CIS Fraud Detection datasets simulate credit card fraud. Visit [IEEE-CIS Fraud Detection](https://www.kaggle.com/competitions/ieee-fraud-detection/data) on Kaggle for the source data and refer to [this link](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203) for column definitions.

**Keywords**

Fraud Detection, Logistic Regression, Random Forest, Isolation Forest, XGBoost, LightGBM

## Setup: Libraries and Datasets

Import required data science libraries for data cleaning, exploratory data analysis, and modelling and read in required data. 

In [1]:
# import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# scikit-learn maching learning


In [2]:
# read the data files
train_identity = pd.read_csv(r"D:\My Docs\Strategic Planning\Career Growth\Analytics Projects\IEEE CIS Fraud Detection\datasets\train_identity.csv")
train_identity.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,chrome 62.0,,,,F,F,T,T,desktop,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,chrome 62.0,24.0,1280x800,match_status:2,T,F,T,T,desktop,MacOS


In [3]:
train_identity.shape

(144233, 41)

In [4]:
train_transaction = pd.read_csv(r"D:\My Docs\Strategic Planning\Career Growth\Analytics Projects\IEEE CIS Fraud Detection\datasets\train_transaction.csv")
train_transaction.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
train_transaction.shape

(590540, 394)

From the above, we could tell `train_identity` has 144,233 rows, 41 columns whilst `train_transaction` has 590,540 rows, 394 columns. 

--- 

In [6]:
test_identity = pd.read_csv(r"D:\My Docs\Strategic Planning\Career Growth\Analytics Projects\IEEE CIS Fraud Detection\datasets\test_identity.csv")
test_identity.head()

Unnamed: 0,TransactionID,id-01,id-02,id-03,id-04,id-05,id-06,id-07,id-08,id-09,...,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38,DeviceType,DeviceInfo
0,3663586,-45.0,280290.0,,,0.0,0.0,,,,...,chrome 67.0 for android,,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
1,3663588,0.0,3579.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 67.0 for android,24.0,1280x720,match_status:2,T,F,T,T,mobile,LGLS676 Build/MXB48T
2,3663597,-5.0,185210.0,,,1.0,0.0,,,,...,ie 11.0 for tablet,,,,F,T,T,F,desktop,Trident/7.0
3,3663601,-45.0,252944.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 67.0 for android,,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
4,3663602,-95.0,328680.0,,,7.0,-33.0,,,,...,chrome 67.0 for android,,,,F,F,T,F,mobile,SM-G9650 Build/R16NW


In [7]:
test_identity.shape

(141907, 41)

In [8]:
test_transaction = pd.read_csv(r"D:\My Docs\Strategic Planning\Career Growth\Analytics Projects\IEEE CIS Fraud Detection\datasets\test_transaction.csv")
test_transaction.head()

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,3663549,18403224,31.95,W,10409,111.0,150.0,visa,226.0,debit,...,,,,,,,,,,
1,3663550,18403263,49.0,W,4272,111.0,150.0,visa,226.0,debit,...,,,,,,,,,,
2,3663551,18403310,171.0,W,4476,574.0,150.0,visa,226.0,debit,...,,,,,,,,,,
3,3663552,18403310,284.95,W,10989,360.0,150.0,visa,166.0,debit,...,,,,,,,,,,
4,3663553,18403317,67.95,W,18018,452.0,150.0,mastercard,117.0,debit,...,,,,,,,,,,


In [9]:
test_transaction.shape

(506691, 393)

## Data cleansing

Before conducting analysis, data cleansing is necessary to deal with missing values, outliers, data types, and other issues of computing and technical concerns.