# Introduction

As the world is moving towards ubiquitous digitization in the financial sector, the risk of fraud grows faster than ever, posing significant challenges to both financial instutions and customers. As a result, the need for robust fraud detection systems capable of identifying and mitigating fraudulent activities is more important than ever.

## Project Description
This notebook aims to provide a comprehensive exploratory data analysis on the Bank Fraud Detection Base dataset, published at NeurIPS 2022.

## Dataset Description
The dataset is available at https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022/data.

This synthetic tabular dataset comprises 1M instances, where each instance represents a credit card application. The dataset contains 31 features and a corresponding binary target variable indicating whether the application is fraudulent or not. The features cover various information associated with the applicant or the application. The dataset contains a combination of numerical and categorical features, and there are no missing values in the dataset. The dataset is highly imbalanced, with only ~1% of the instances labeled as fraudulent. The dataset is also generated based off real-world data to protect the privacy of potential applicants.

A detailed description of the dataset can be found on https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf.

# Imports and Data Loading

In [25]:
# Import libraries
import numpy as np

# Set the maximum number of columns and rows to display
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

import matplotlib.pyplot as plt
import matplotlib.colors

# Better visualizations for colorblind readers
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette('colorblind')

from sklearn.model_selection import train_test_split

import statsmodels.api as sm

In [27]:
# Load the dataset
total_df = pd.read_csv('./Data/Base.csv')

# Define features and target
X = total_df.drop(columns='fraud_bool')
y = total_df['fraud_bool']

# Perform stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

train = pd.concat([X_train, y_train], axis=1).copy()
test = pd.concat([X_test, y_test], axis=1).copy()

# Exploratory Data Analysis

## Target

In [47]:
print(f"Target: 'fraud_bool'")
print(f"Data type: {y_train.dtype}")
print(f"Unique values: {y_train.dropna().unique()}")
print(f"NaN values: {y_train.isna().sum()}")
print(f"Null values: {y_train.isnull().sum()}")

# Get count and distribution
count_distribution = y_train.value_counts()
proportion_distribution = y_train.value_counts(normalize=True)

print("\nCount and Distribution of 'fraud_bool':")
for value in count_distribution.index:
    count = count_distribution[value]
    proportion = proportion_distribution[value]
    print(f"Value {value}: {count} ({proportion:.2%})")

Target: 'fraud_bool'
Data type: int64
Unique values: [0 1]
NaN values: 0
Null values: 0

Count and Distribution of 'fraud_bool':
Value 0: 791177 (98.90%)
Value 1: 8823 (1.10%)


## Features

In [56]:
# Shows the first 5 observations of the training data
X_train.head()

Unnamed: 0,income,name_email_similarity,prev_address_months_count,current_address_months_count,customer_age,days_since_request,intended_balcon_amount,payment_type,zip_count_4w,velocity_6h,velocity_24h,velocity_4w,bank_branch_count_8w,date_of_birth_distinct_emails_4w,employment_status,credit_risk_score,email_is_free,housing_status,phone_home_valid,phone_mobile_valid,bank_months_count,has_other_cards,proposed_credit_limit,foreign_request,source,session_length_in_minutes,device_os,keep_alive_session,device_distinct_emails_8w,device_fraud_count,month
39111,0.7,0.229712,-1,63,50,0.02472,50.674001,AA,1305,12764.326278,6418.672862,5998.527006,7,13,CA,97,1,BC,0,1,24,1,1500.0,0,INTERNET,3.58055,linux,0,1,0,0
822700,0.2,0.928428,199,24,70,0.014153,15.631407,AA,833,9717.635327,6342.913428,4814.609668,1,5,CC,144,0,BD,1,0,20,0,500.0,0,INTERNET,7.087779,other,1,1,0,6
914415,0.1,0.65863,95,2,40,0.045801,-1.410133,AB,237,2201.833206,2753.815567,3076.055489,14,5,CA,87,1,BC,1,1,26,0,200.0,0,INTERNET,0.547804,other,1,1,0,7
581307,0.8,0.774858,-1,122,30,0.005569,-0.539938,AB,895,5377.25466,4551.599208,4223.827504,0,5,CA,206,0,BE,0,1,1,1,500.0,0,INTERNET,4.671407,other,1,1,0,4
603136,0.9,0.99346,103,9,20,0.010832,-0.501067,AB,4105,7428.775954,4872.930234,4250.760719,14,11,CB,114,1,BC,1,1,1,1,200.0,0,INTERNET,9.293206,linux,0,1,0,4


In [55]:
X_train.shape

(800000, 31)

In [53]:
X_train.dtypes

income                              float64
name_email_similarity               float64
prev_address_months_count             int64
current_address_months_count          int64
customer_age                          int64
days_since_request                  float64
intended_balcon_amount              float64
payment_type                         object
zip_count_4w                          int64
velocity_6h                         float64
velocity_24h                        float64
velocity_4w                         float64
bank_branch_count_8w                  int64
date_of_birth_distinct_emails_4w      int64
employment_status                    object
credit_risk_score                     int64
email_is_free                         int64
housing_status                       object
phone_home_valid                      int64
phone_mobile_valid                    int64
bank_months_count                     int64
has_other_cards                       int64
proposed_credit_limit           

In [61]:
num_feats = X_train.select_dtypes(include='number').columns
cat_feats = X_train.select_dtypes(exclude='number').columns

thresh = 25

cont_feats = []
disc_feats = []

for feat in num_feats:
    if total_df[feat].nunique() >= thresh:
        cont_feats.append(feat)
    else:
        disc_feats.append(feat)

print("Total Features:", X_train.shape[1])
print("\nContinuous Features ({}): {}".format(len(cont_feats), cont_feats))
print("\nDiscrete Features ({}): {}".format(len(disc_feats), disc_feats))
print("\nCategorical Features ({}): {}".format(len(cat_feats), cat_feats))

Total Features: 31

Continuous Features (14): ['name_email_similarity', 'prev_address_months_count', 'current_address_months_count', 'days_since_request', 'intended_balcon_amount', 'zip_count_4w', 'velocity_6h', 'velocity_24h', 'velocity_4w', 'bank_branch_count_8w', 'date_of_birth_distinct_emails_4w', 'credit_risk_score', 'bank_months_count', 'session_length_in_minutes']

Discrete Features (12): ['income', 'customer_age', 'email_is_free', 'phone_home_valid', 'phone_mobile_valid', 'has_other_cards', 'proposed_credit_limit', 'foreign_request', 'keep_alive_session', 'device_distinct_emails_8w', 'device_fraud_count', 'month']

Categorical Features (5): Index(['payment_type', 'employment_status', 'housing_status', 'source',
       'device_os'],
      dtype='object')


In [65]:
# The datasheet details that the following categories can be negative to represent missing values
cols_missing= [
    'prev_address_months_count', 'current_address_months_count',
    'bank_months_count', 'session_length_in_minutes',
    'device_distinct_emails_8w', 'intended_balcon_amount'
]

# Replace all negative values with NaN
X_train[cols_missing] = X_train[cols_missing].mask(X_train[cols_missing] < 0, np.nan)

# Calculate missing values percentage and display as a table
missing_values = (X_train.isna().sum() / len(X_train) * 100).loc[lambda x: x > 0]
missing_table = pd.DataFrame(missing_values, columns=["Missing %"]).sort_values(by="Missing %")

# Print the missing values table
print("Missing Values Table:\n", missing_table)


Missing Values Table:
                               Missing %
device_distinct_emails_8w      0.036250
session_length_in_minutes      0.202875
current_address_months_count   0.423875
bank_months_count             25.325500
prev_address_months_count     71.315625
intended_balcon_amount        74.233500


In [13]:
# Shows summary statistics for numerical columns
print(train_df.describe())

          fraud_bool         income  name_email_similarity  \
count  800000.000000  800000.000000          800000.000000   
mean        0.011029       0.562860               0.493798   
std         0.104437       0.290343               0.289099   
min         0.000000       0.100000               0.000001   
25%         0.000000       0.300000               0.225325   
50%         0.000000       0.600000               0.492314   
75%         0.000000       0.800000               0.755595   
max         1.000000       0.900000               0.999999   

       prev_address_months_count  current_address_months_count   customer_age  \
count              800000.000000                 800000.000000  800000.000000   
mean                   16.700988                     86.614125      33.700075   
std                    44.017921                     88.391093      12.028264   
min                    -1.000000                     -1.000000      10.000000   
25%                    -1.000000    

In [19]:
# Display missing values for columns
print(train_df.isnull().sum())

fraud_bool                          0
income                              0
name_email_similarity               0
prev_address_months_count           0
current_address_months_count        0
customer_age                        0
days_since_request                  0
intended_balcon_amount              0
payment_type                        0
zip_count_4w                        0
velocity_6h                         0
velocity_24h                        0
velocity_4w                         0
bank_branch_count_8w                0
date_of_birth_distinct_emails_4w    0
employment_status                   0
credit_risk_score                   0
email_is_free                       0
housing_status                      0
phone_home_valid                    0
phone_mobile_valid                  0
bank_months_count                   0
has_other_cards                     0
proposed_credit_limit               0
foreign_request                     0
source                              0
session_leng

In [21]:
# Get the distribution of the fraud_bool
distribution = train_df['fraud_bool'].value_counts(normalize=True)
print("\nDistribution of the the fraud bool:")
for value, proportion in distribution.items():
    print(f"Value {value}: {proportion:.2%}")


Distribution of the the fraud bool:
Value 0: 98.90%
Value 1: 1.10%


In [22]:
# Number of unique values of the float data
train_df.select_dtypes(include=['float64']).nunique()  

income                            9
name_email_similarity        799289
days_since_request           793121
intended_balcon_amount       796805
velocity_6h                  799150
velocity_24h                 799310
velocity_4w                  798908
proposed_credit_limit            12
session_length_in_minutes    796391
dtype: int64

In [23]:
# Create numerical dataframe
num_df = train_df.select_dtypes(include=['float64']).drop(columns=['income', 'proposed_credit_limit'])

# Create categorical dataframe
cat_df = train_df.select_dtypes(include=['int64', 'object']).copy()
cat_df[['income', 'proposed_credit_limit']] = train_df[['income', 'proposed_credit_limit']]
cat_df['income'] = cat_df['income'].round(1)
cat_df['proposed_credit_limit'] = cat_df['proposed_credit_limit'].round(0).astype('int64')


In [24]:
# Display missing values for columns
print(train_df.isnull().sum())

fraud_bool                          0
income                              0
name_email_similarity               0
prev_address_months_count           0
current_address_months_count        0
customer_age                        0
days_since_request                  0
intended_balcon_amount              0
payment_type                        0
zip_count_4w                        0
velocity_6h                         0
velocity_24h                        0
velocity_4w                         0
bank_branch_count_8w                0
date_of_birth_distinct_emails_4w    0
employment_status                   0
credit_risk_score                   0
email_is_free                       0
housing_status                      0
phone_home_valid                    0
phone_mobile_valid                  0
bank_months_count                   0
has_other_cards                     0
proposed_credit_limit               0
foreign_request                     0
source                              0
session_leng