---
title: "Fraud Detection in Bank Transaztions"
author: "Darwhin Gomez"
date: "`r Sys.Date()`"  #
output:
  pdf_document:
    toc: true
    toc_depth: 4

---


# Project Proposal: Fraud Detection in Bank Transactions

## 2. **Research Question**
- **Null Hypothesis (H₀):**  
  *"Anomaly detection techniques, specifically clustering and isolation-based models, do not significantly identify outliers in transaction data that correspond to fraudulent activities."*

- **Alternative Hypothesis (H₁):**  
  *"Anomaly detection techniques, specifically clustering and isolation-based models, can effectively identify outliers in transaction data that correspond to fraudulent activities."*


## 2. **Justification**
Fraud detection remains a critical challenge in the financial industry. As fraudulent activities can involve abnormal transaction behaviors, detecting outliers or anomalies in large transaction datasets is key to identifying potential fraud. This project uses unsupervised learning techniques such as **clustering** and **anomaly detection** to flag suspicious transactions based on transaction data. Since the dataset does not contain labeled fraud data, the task is framed as an **unsupervised anomaly detection** problem where outliers are presumed to represent fraudulent transactions.



## 3. **Data Source**
https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection/data

The dataset, **bank_transaction_data_2.csv**, contains transaction records for multiple customer accounts. The dataset includes the following features:

- **TransactionID**: Unique alphanumeric identifier for each transaction.
- **AccountID**: Unique identifier for each account, with multiple transactions per account.
- **TransactionAmount**: Monetary value of each transaction.
- **TransactionDate**: Timestamp of each transaction.
- **TransactionType**: Categorical field indicating 'Credit' or 'Debit' transactions.
- **Location**: Geographic location of the transaction, represented by U.S. city names.
- **DeviceID**: Alphanumeric identifier for devices used to perform the transaction.
- **IP Address**: IPv4 address associated with the transaction.
- **MerchantID**: Unique identifier for merchants.
- **AccountBalance**: Balance in the account post-transaction.
- **PreviousTransactionDate**: Timestamp of the last transaction for the account.
- **Channel**: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
- **CustomerAge**: Age of the account holder.
- **CustomerOccupation**: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired).
- **TransactionDuration**: Duration of the transaction in seconds.
- **LoginAttempts**: Number of login attempts before the transaction, with higher values indicating potential anomalies.



## 4. **Tools and Libraries**
- **Pandas**: For data manipulation and cleaning.
- **NumPy**: For numerical operations.
- **Matplotlib** and **Seaborn**: For data visualization (distributions, scatter plots, box plots, etc.).
- **Scikit-Learn**:
  - **Isolation Forest**: For unsupervised anomaly detection.
  - **DBSCAN**: For density-based clustering and outlier detection.
  - **KMeans**: For clustering and identifying potential fraud patterns.
  - **StandardScaler**: For feature scaling and normalization.
- **Datetime**: For handling and extracting features from the `TransactionDate` and `PreviousTransactionDate` columns.





## 5. **Proposed Methodology**

### **Data Preprocessing**
- **Handle missing values**: Ensure no missing values are present (if any, apply appropriate imputation or removal).
- **Categorical feature encoding**: Convert categorical variables (e.g., `TransactionType`, `Location`, `Channel`) into numerical values via **One-Hot Encoding** or **Label Encoding**.
- **Datetime feature extraction**: Convert `TransactionDate` and `PreviousTransactionDate` into `datetime` format, and extract useful features such as the transaction hour, day, and time difference between consecutive transactions.
- **Standardize numerical features**: Normalize or standardize continuous variables like `TransactionAmount`, `AccountBalance`, and `TransactionDuration` to ensure consistency across the models.

### **Exploratory Data Analysis (EDA)**
- **Univariate analysis**: Visualize distributions for features like `TransactionAmount`, `CustomerAge`, `AccountBalance`, and `TransactionType`.
- **Bivariate analysis**: Investigate relationships between transaction features and account-level variables, such as how `TransactionAmount` varies with `CustomerAge`, `TransactionType`, or `AccountBalance`.
- **Class distribution**: Although the dataset lacks a direct fraud flag, examining the distribution of certain features, such as `TransactionAmount` and `LoginAttempts`, can provide insights into potential fraud patterns.
- **Outlier detection**: Visualize features such as `TransactionAmount` and `TransactionDuration` to identify outliers that might represent suspicious activity.

### **Modeling**
- **Isolation Forest**: Use the **Isolation Forest** algorithm, an unsupervised anomaly detection method, to identify transactions that differ significantly from the majority of data points. These outliers are likely to represent potential fraudulent transactions.
- **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**: Cluster the data using DBSCAN, identifying dense regions of transactions. Points not assigned to any cluster are treated as "noise" (potential fraud).
- **KMeans Clustering**: Apply **KMeans** to segment transactions into groups based on similar behavior, and flag transactions that don't belong well to any cluster.

### **Evaluation**
- **Fraud flagging**: Based on the output of the models, transactions that are flagged as outliers or noise will be considered as potential fraud.
- **Visual inspection**: Visualize flagged anomalies on scatter plots and boxplots to validate if they align with typical fraud characteristics (e.g., large transaction amounts, multiple login attempts, etc.).
- **Transaction patterns**: Assess the patterns of flagged transactions, such as unusual amounts or rapid transaction frequency, to evaluate the model's effectiveness in identifying suspicious behavior.



## 6.  **Exploratory Data Analysis**

In [None]:
# Data manipulation and analysis
import pandas as pd  # For data manipulation and cleaning (loading data, filtering, etc.)
import numpy as np   # For numerical operations (e.g., mathematical calculations, arrays)

# Data visualization
import matplotlib.pyplot as plt  # For basic plotting (e.g., histograms, scatter plots)
import seaborn as sns          # For advanced statistical plots and better aesthetics

# Machine learning models
from sklearn.ensemble import IsolationForest  # For anomaly detection (Isolation Forest)
from sklearn.cluster import DBSCAN           # For density-based clustering (DBSCAN)
from sklearn.cluster import KMeans           # For KMeans clustering
from sklearn.preprocessing import StandardScaler  # For standardizing features (important for clustering and anomaly detection)

# For working with dates and times
import datetime  # For handling and manipulating datetime objects (e.g., converting transaction date columns)

# For handling warnings (optional)
import warnings
warnings.filterwarnings('ignore')  # To ignore warnings during model training (can be helpful for large datasets)

# Load the dataset with a meaningful variable name
bank_trans_data = pd.read_csv("bank_transactions_data_2.csv")

# Display the first few rows of the dataset to understand its structure
print("Dataset Preview:")
print(bank_trans_data.head())


# Check the basic info to understand data types and identify any missing values
print("\nDataset Information:")
print(bank_trans_data.info())


# Display summary statistics to get an overview of the data distributions
print("\nSummary Statistics:")
print(bank_trans_data.describe())



# Check for any missing values in the dataset
missing_values = bank_trans_data.isnull().sum()
print("\nMissing Values in Each Column:")
print(missing_values[missing_values > 0])


unique_counts = bank_trans_data.nunique()
print("Unique Counts for Each Column:")
print(unique_counts)
# Check the data types of all columns
print(bank_trans_data.dtypes)

# Define columns to exclude from categorical analysis
excluded_cols = ['IP Address','TransactionID', 'AccountID', 'MerchantID', 'TransactionDate', 'DeviceID', 'PreviousTransactionDate']

# Select categorical columns, excluding the identifier-like columns
categorical_cols = bank_trans_data.select_dtypes(include=['object']).columns.difference(excluded_cols)

# Display value counts for each remaining categorical column
for col in categorical_cols:
    print(f"\nValue Counts for {col}:")
    print(bank_trans_data[col].value_counts())



# Initial Visualization of Transaction Amount Distribution
sns.histplot(bank_trans_data['TransactionAmount'], kde=True)
plt.title('Transaction Amount Distribution')
plt.show()


# Boxplot to detect outliers in Transaction Amount
plt.figure(figsize=(10, 6))
sns.boxplot(x=bank_trans_data['TransactionAmount'], color='orange')
plt.title('Boxplot of Transaction Amount')
plt.xlabel('Transaction Amount')
plt.show()
#  **Account Balance**
plt.figure(figsize=(10, 6))
sns.histplot(bank_trans_data['AccountBalance'], kde=True, color='red', bins=50)
plt.title('Account Balance Distribution')
plt.xlabel('Account Balance')
plt.ylabel('Frequency')
plt.show()


# Boxplot for Account Balance
plt.figure(figsize=(10, 6))
sns.boxplot(x=bank_trans_data['AccountBalance'], color='cyan')
plt.title('Boxplot of Account Balance')
plt.xlabel('Account Balance')
plt.show()


# 5. **Login Attempts**
plt.figure(figsize=(10, 6))
sns.histplot(bank_trans_data['LoginAttempts'], kde=True, color='brown', bins=30)
plt.title('Login Attempts Distribution')
plt.xlabel('Login Attempts')
plt.ylabel('Frequency')
plt.show()

# Boxplot for Login Attempts
plt.figure(figsize=(10, 6))
sns.boxplot(x=bank_trans_data['LoginAttempts'], color='pink')
plt.title('Boxplot of Login Attempts')
plt.xlabel('Login Attempts')
plt.show()


# 6. **Transaction Duration**
plt.figure(figsize=(10, 6))
sns.histplot(bank_trans_data['TransactionDuration'], kde=True, color='teal', bins=30)
plt.title('Transaction Duration Distribution')
plt.xlabel('Transaction Duration (seconds)')
plt.ylabel('Frequency')
plt.show()


# Boxplot for Transaction Duration
plt.figure(figsize=(10, 6))
sns.boxplot(x=bank_trans_data['TransactionDuration'], color='grey')
plt.title('Boxplot of Transaction Duration')
plt.xlabel('Transaction Duration (seconds)')
plt.show()
# 3. **Customer Age**
plt.figure(figsize=(10, 6))
sns.histplot(bank_trans_data['CustomerAge'], kde=True, color='green', bins=30)
plt.title('Customer Age Distribution')
plt.xlabel('Customer Age')
plt.ylabel('Frequency')
plt.show()


# Boxplot for Customer Age
plt.figure(figsize=(10, 6))
sns.boxplot(x=bank_trans_data['CustomerAge'], color='purple')
plt.title('Boxplot of Customer Age')
plt.xlabel('Customer Age')
plt.show()