# Imports

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style("dark")

# Data

In [6]:
fraud_df = pd.read_csv("data/fraud_detection_dataset.csv")

# Data dictionary

- Available on **`README.md`**

# EDA

In [7]:
# checking the target distribution
fraud_df["isFraud"].value_counts(normalize=True)

0    0.998709
1    0.001291
Name: isFraud, dtype: float64

As every fraud detection problem, we have a lot more information that there is not fraud than the data with fraud (1). 99% of the data is categorized as no fraud, so, if we try to guess where there is not fraud, we'll get right answer 99.88% of the time.

Let's change the name variables to snake case, just for preference

In [8]:
fraud_df.columns = ["step", "type", "amount", "name_orig", "old_balance_orig", "new_balance_orig", "name_dest", "old_balance_dest", "new_balance_dest", "is_fraud", "is_flagged_fraud"]

# change the order, last columns with the target
fraud_df = fraud_df[['step', 'type', 'amount', 'name_orig', 'old_balance_orig',
       'new_balance_orig', 'name_dest', 'old_balance_dest', 'new_balance_dest', 'is_flagged_fraud',
       'is_fraud']]

Let's separate our dataset in categorical variables and numerical ones. As we have just 11 variables, we can clearly see who is categorical and who is numerical:

- categorical: ["type", "name_orig", "name_dest", "is_flagged_fraud", "is_fraud"]
- numerical: ["step", "amount", "old_balance_orig", "new_balance_orig", "old_balance_dest", "new_balance_dest"]

In [9]:
# separating the dataset
fraud_categorical_df = fraud_df[["type", "name_orig", "name_dest", "is_flagged_fraud", "is_fraud"]]
fraud_numerical_df = fraud_df[["step", "amount", "old_balance_orig", "new_balance_orig", "old_balance_dest", "new_balance_dest", "is_fraud"]]  # will contain the target for a exploration step

Those dataframes:

In [10]:
fraud_categorical_df

Unnamed: 0,type,name_orig,name_dest,is_flagged_fraud,is_fraud
0,PAYMENT,C1231006815,M1979787155,0,0
1,PAYMENT,C1666544295,M2044282225,0,0
2,TRANSFER,C1305486145,C553264065,0,1
3,CASH_OUT,C840083671,C38997010,0,1
4,PAYMENT,C2048537720,M1230701703,0,0
...,...,...,...,...,...
6362615,CASH_OUT,C786484425,C776919290,0,1
6362616,TRANSFER,C1529008245,C1881841831,0,1
6362617,CASH_OUT,C1162922333,C1365125890,0,1
6362618,TRANSFER,C1685995037,C2080388513,0,1


In [11]:
fraud_numerical_df

Unnamed: 0,step,amount,old_balance_orig,new_balance_orig,old_balance_dest,new_balance_dest,is_fraud
0,1,9839.64,170136.00,160296.36,0.00,0.00,0
1,1,1864.28,21249.00,19384.72,0.00,0.00,0
2,1,181.00,181.00,0.00,0.00,0.00,1
3,1,181.00,181.00,0.00,21182.00,0.00,1
4,1,11668.14,41554.00,29885.86,0.00,0.00,0
...,...,...,...,...,...,...,...
6362615,743,339682.13,339682.13,0.00,0.00,339682.13,1
6362616,743,6311409.28,6311409.28,0.00,0.00,0.00,1
6362617,743,6311409.28,6311409.28,0.00,68488.84,6379898.11,1
6362618,743,850002.52,850002.52,0.00,0.00,0.00,1


Here, we have a very unbalanced dataset, so, is very important that we can identify some relationships between fraud = 1 and all other variables. Let's investigate that

## EDA Categorical

I'll drop the names categories for now, those variables seems to be anonimized, and probably will not represent any information. I may go back to those variables later if needed.

In [12]:
fraud_categorical_dropped_df = fraud_categorical_df.drop(["name_orig", "name_dest"], axis=1)
fraud_categorical_dropped_df

Unnamed: 0,type,is_flagged_fraud,is_fraud
0,PAYMENT,0,0
1,PAYMENT,0,0
2,TRANSFER,0,1
3,CASH_OUT,0,1
4,PAYMENT,0,0
...,...,...,...
6362615,CASH_OUT,0,1
6362616,TRANSFER,0,1
6362617,CASH_OUT,0,1
6362618,TRANSFER,0,1


In [21]:
# fraud by type
fraud_by_type_df = fraud_categorical_dropped_df[["type", "is_fraud"]].value_counts().reset_index()
fraud_by_type_df.columns = ["type", "is_fraud", "qtt"]
fraud_by_type_df.sort_values(["type"])

Unnamed: 0,type,is_fraud,qtt
2,CASH_IN,0,1399284
0,CASH_OUT,0,2233384
5,CASH_OUT,1,4116
4,DEBIT,0,41432
1,PAYMENT,0,2151495
3,TRANSFER,0,528812
6,TRANSFER,1,4097


Here we can observe some points:

- **CASH_IN, DEBIT, PAYMENT: Every one of these categories are not fraud**
- CASH_OUT: this variable has a ratio of 99.25% not fraud
- TRANSFER: this variable has a ratio of 99.81% not fraud
