# Bank Fraud Detection

*About this Dataset:*
- This dataset contains synthetic transaction data designed to help build and evaluate fraud detection models. The data represents realistic financial transactions, capturing important details about users, their transactions, and various risk factors. It's perfect for training machine learning models, especially for binary classification tasks like identifying fraudulent transactions.

- The dataset has a mix of numerical, categorical, and temporal data, making it ideal for testing various models, including decision trees, gradient boosting machines like XGBoost, and other machine learning algorithms.

- Key features of the dataset include transaction amounts, user details, transaction types, and fraud labels, which can be used to develop fraud detection systems, analyze transaction patterns, and detect potential anomalies in financial transactions.

### Columns

- Transaction_ID: Unique identifier for each transaction.
- User_ID: Unique identifier for the user.
- Transaction_Amount: Amount of money involved in the transaction.
- Transaction_Type: Type of transaction (e.g., Online, In-Store, ATM).
- Timestamp: Date and time of the transaction.
- Account_Balance: User's account balance before the transaction.
- Device_Type: Type of device used for the transaction (e.g., Mobile, Desktop).
- Location: Geographical location of the transaction.
- Merchant_Category: Type of merchant (e.g., Retail, Food, Travel).
- IP_Address_Flag: Indicates if the IP address was flagged as suspicious (0 or 1).
- Previous_Fraudulent_Activity: Number of previous fraudulent activities by the user.
- Daily_Transaction_Count: Number of transactions made by the user that day.
- Avg_Transaction_Amount_7d: User's average transaction amount over the past 7 days.
- Failed_Transaction_Count_7d: Number of failed transactions in the last 7 days.
- Card_Type: Type of payment card used (e.g., Credit, Debit, Prepaid).
- Card_Age: Age of the card in months.
- Transaction_Distance: Distance between the user's usual location and transaction location.
- Authentication_Method: Method used by the user to authenticate (e.g., PIN, Biometric).
- Risk_Score: Fraud risk score calculated for the transaction.
- Is_Weekend: Indicates whether the transaction occurred on a weekend (0 or 1).
- Fraud_Label: Target variable: 0 = Not Fraud, 1 = Fraud.

### Start with Importing libraries/modules

In the case that the lib has not been installed yet, use *pip* to install.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# To convert large numbers into human-readable format
from numerize.numerize import numerize

# For suppressing warnings
from warnings import filterwarnings
filterwarnings('ignore')

In [2]:
# Setting the default size and dpi level of the plot
plt.rcParams["figure.figsize"] = (12, 5)
plt.rcParams["figure.dpi"] = 250

# Setting the default size of the plot title and axis labels
plt.rcParams["axes.titlesize"] = 15
plt.rcParams["axes.labelsize"] = 12

# Setting the default weight of the plot title and axis labels
plt.rcParams["axes.titleweight"] = 'bold'
plt.rcParams["axes.labelweight"] = 'bold'

# To show all columns
pd.set_option("display.max_columns", None)

### Reading the Data

In [10]:
df = pd.read_csv("C:/Users/[Username]/Documents/GitHub/Fraud-Detection-Transactions/dataset/synthetic_fraud_dataset.csv")
df.head()

Unnamed: 0,Transaction_ID,User_ID,Transaction_Amount,Transaction_Type,Timestamp,Account_Balance,Device_Type,Location,Merchant_Category,IP_Address_Flag,Previous_Fraudulent_Activity,Daily_Transaction_Count,Avg_Transaction_Amount_7d,Failed_Transaction_Count_7d,Card_Type,Card_Age,Transaction_Distance,Authentication_Method,Risk_Score,Is_Weekend,Fraud_Label
0,TXN_33553,USER_1834,39.79,POS,2023-08-14 19:30:00,93213.17,Laptop,Sydney,Travel,0,0,7,437.63,3,Amex,65,883.17,Biometric,0.8494,0,0
1,TXN_9427,USER_7875,1.19,Bank Transfer,2023-06-07 04:01:00,75725.25,Mobile,New York,Clothing,0,0,13,478.76,4,Mastercard,186,2203.36,Password,0.0959,0,1
2,TXN_199,USER_2734,28.96,Online,2023-06-20 15:25:00,1588.96,Tablet,Mumbai,Restaurants,0,0,14,50.01,4,Visa,226,1909.29,Biometric,0.84,0,1
3,TXN_12447,USER_2617,254.32,ATM Withdrawal,2023-12-07 00:31:00,76807.2,Tablet,New York,Clothing,0,0,8,182.48,4,Visa,76,1311.86,OTP,0.7935,0,1
4,TXN_39489,USER_2014,31.28,POS,2023-11-11 23:44:00,92354.66,Mobile,Mumbai,Electronics,0,1,14,328.69,4,Mastercard,140,966.98,Password,0.3819,1,1


#### Renaming Columns
Question, why do we need to rename columns? This makes it so that the attributes (columns) are simpler to understand. This is important when analyzing data and making predictions later on.

In [11]:
df.columns = [
    'transaction_id', 'user_id', 'transaction_amount', 'transaction_method', 'transaction_timestamp', 'account_balance', 
    'device_type', 'transaction_location', 'merchant_category', 'ip_address_flag', 'previous_fraudulent_activities', 
    'daily_transaction_count', 'avg_transaction_amount_7d', 'failed_transaction_count_7d', 'card_type', 'card_age_months', 
    'transaction_distance', 'authentication_method', 'fraud_risk_score', 'is_weekend', 'fraud_label'
]

# After renaming, we now describe the statistics of each numerical column
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
transaction_amount,50000.0,99.411012,98.687292,0.0,28.6775,69.66,138.8525,1174.14
account_balance,50000.0,50294.065981,28760.458557,500.48,25355.995,50384.43,75115.135,99998.31
ip_address_flag,50000.0,0.0502,0.21836,0.0,0.0,0.0,0.0,1.0
previous_fraudulent_activities,50000.0,0.0984,0.297858,0.0,0.0,0.0,0.0,1.0
daily_transaction_count,50000.0,7.48524,4.039637,1.0,4.0,7.0,11.0,14.0
avg_transaction_amount_7d,50000.0,255.271924,141.382279,10.0,132.0875,256.085,378.0325,500.0
failed_transaction_count_7d,50000.0,2.00354,1.414273,0.0,1.0,2.0,3.0,4.0
card_age_months,50000.0,119.99994,68.985817,1.0,60.0,120.0,180.0,239.0
transaction_distance,50000.0,2499.164155,1442.013834,0.25,1256.4975,2490.785,3746.395,4999.93
fraud_risk_score,50000.0,0.501556,0.287774,0.0001,0.254,0.50225,0.749525,1.0


In [13]:
# then, for the statistics of each categorical columns
# also our renamed columns are here
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
transaction_id,50000,50000,TXN_5311,1
user_id,50000,8963,USER_6599,16
transaction_method,50000,4,POS,12549
transaction_timestamp,50000,47724,2023-06-04 06:35:00,4
device_type,50000,3,Tablet,16779
transaction_location,50000,5,Tokyo,10208
merchant_category,50000,5,Clothing,10033
card_type,50000,4,Mastercard,12693
authentication_method,50000,4,Biometric,12591


In [14]:
# check datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   transaction_id                  50000 non-null  object 
 1   user_id                         50000 non-null  object 
 2   transaction_amount              50000 non-null  float64
 3   transaction_method              50000 non-null  object 
 4   transaction_timestamp           50000 non-null  object 
 5   account_balance                 50000 non-null  float64
 6   device_type                     50000 non-null  object 
 7   transaction_location            50000 non-null  object 
 8   merchant_category               50000 non-null  object 
 9   ip_address_flag                 50000 non-null  int64  
 10  previous_fraudulent_activities  50000 non-null  int64  
 11  daily_transaction_count         50000 non-null  int64  
 12  avg_transaction_amount_7d       

In [15]:
# check the number of null values in each column
df.isnull().sum()

transaction_id                    0
user_id                           0
transaction_amount                0
transaction_method                0
transaction_timestamp             0
account_balance                   0
device_type                       0
transaction_location              0
merchant_category                 0
ip_address_flag                   0
previous_fraudulent_activities    0
daily_transaction_count           0
avg_transaction_amount_7d         0
failed_transaction_count_7d       0
card_type                         0
card_age_months                   0
transaction_distance              0
authentication_method             0
fraud_risk_score                  0
is_weekend                        0
fraud_label                       0
dtype: int64

In [16]:
# check the number of duplicate rows
df.duplicated().sum()

np.int64(0)

As we can see, there are no null values and duplicated columns, which indicates a good sign that the dataset is somehow ideal. 