# Financial Transactions Fraud Detection: Exploratory Data Analysis

## 1. Introduction

### This notebook explores the Financial Transactions Dataset to identify patterns in fraudulent transactions. The dataset includes transaction details, fraud labels, and anomaly scores, making it ideal for fraud detection modeling.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 2. Data Loading and Inspection

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats

# Scikit-learn imports
from sklearn.experimental import enable_halving_search_cv  # Required for HalvingRandomSearchCV
from sklearn.model_selection import train_test_split, HalvingRandomSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, roc_auc_score, 
                            average_precision_score, confusion_matrix)
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Load the dataset
data = pd.read_csv('/kaggle/input/financial-transactions-dataset-for-fraud-detection/financial_fraud_detection_dataset.csv')

# Display the first few rows
data.head()

## Dataset Overview

Based on data.head(), the dataset has the following columns:

1. transaction_id: Unique identifier for each transaction (string).
2. timestamp: Transaction datetime (e.g., “2023-08-22T09:22:43.516168”).
3. sender_account: Sender’s account ID (string).
4. receiver_account: Receiver’s account ID (string).
5. amount: Transaction amount (float).
6. transaction_type: Type of transaction (e.g., “withdrawal”, “deposit”, “transfer”).
7. merchant_category: Merchant category (e.g., “utilities”, “online”, “other”).
8. location: Transaction location (e.g., “Tokyo”, “Toronto”).
9. device_used: Device used (e.g., “mobile”, “atm”, “pos”).
10. is_fraud: Fraud label (boolean: False/True).
11. fraud_type: Type of fraud (string, NaN for non-fraud cases).
12. time_since_last_transaction: Time since the last transaction (float, likely in seconds or normalized).
13. spending_deviation_score: Score indicating deviation from typical spending (float).
14. velocity_score: Transaction velocity score (integer).
15. geo_anomaly_score: Score for geographic anomalies (float).
16. payment_channel: Payment method (e.g., “card”, “ACH”, “wire_transfer”).
17. ip_address: IP address of the transaction (string).
18. device_hash: Unique device identifier (string).

## Key Observations:

1. Target Variable: is_fraud (boolean) is the label for fraud detection.
2. Numerical Features: amount, time_since_last_transaction, spending_deviation_score, velocity_score, geo_anomaly_score.
3. Categorical Features: transaction_type, merchant_category, location, device_used, payment_channel.
4. Temporal Feature: timestamp (needs conversion to datetime).
5. Potential Features for EDA: fraud_type (for fraud cases), ip_address, device_hash (for patterns, though likely high cardinality).
6. Missing Values: fraud_type is NaN for non-fraud cases, which is expected. We’ll check others with data.isnull().sum().

In [None]:
# Display basic information
print("\nDataset Info:")
print(data.info())

In [None]:
# Check dataset shape
print("\nDataset Shape:", data.shape)

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Summary statistics for numerical columns
print("\nSummary Statistics:")
data.describe()

### Summary Statistics (numerical columns):
1. amount: Ranges from 0.01 to 3,520.57, mean ~358.93, highly skewed (std ~469.93). Fraud transactions may involve extreme amounts.
2. time_since_last_transaction: Ranges from -8,777.81 to 8,757.76, mean ~1.53. Negative values are unusual (possibly data errors or reversed time differences). Missing values need handling.
3. spending_deviation_score: Mean ~0, std ~1, range -5.26 to 5.02. Likely a normalized score, useful for fraud detection.
4. velocity_score: Integer, range 1 to 20, mean ~10.5. Higher values may indicate rapid transactions, a fraud signal.
5. geo_anomaly_score: Range 0 to 1, mean ~0.5. Likely a probability-like score, critical for fraud analysis.

In [None]:
# Investigate missing time_since_last_transaction
print("\nMissing time_since_last_transaction Analysis:")
print("Fraud rate in rows with missing time_since_last_transaction:")

missing_time_fraud = data[data['time_since_last_transaction'].isnull()]['is_fraud'].mean() * 100
non_missing_time_fraud = data[data['time_since_last_transaction'].notnull()]['is_fraud'].mean() * 100

In [None]:
print(f"Missing: {missing_time_fraud:.2f}%")
print(f"Non-Missing: {non_missing_time_fraud:.2f}%")

In [None]:
# Convert timestamp to datetime
data['timestamp'] = pd.to_datetime(data['timestamp'], format='ISO8601')

# Extract hour
data['hour'] = data['timestamp'].dt.hour

## 3. Exploratory Data Analysis

In [None]:
def plot_amount_histogram():
    plt.figure(figsize=(12, 6))
    sns.histplot(data=data[data['is_fraud'] == False], x='amount', kde=True, 
                 label='Non-Fraud', stat='density', bins=50, log_scale=True, alpha=0.5)
    sns.histplot(data=data[data['is_fraud'] == True], x='amount', kde=True, 
                 label='Fraud', stat='density', bins=50, log_scale=True, alpha=0.5)
    sns.rugplot(data=data[data['is_fraud'] == True], x='amount', color='red', alpha=0.1)
    
    plt.title('Distribution of Transaction Amounts (Log Scale)', fontsize=14, pad=10)
    plt.xlabel('Transaction Amount (USD, Log Scale)', fontsize=12)
    plt.ylabel('Density', fontsize=12)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_amount_boxplot():
    plt.figure(figsize=(10, 6))
    sns.boxplot(x='is_fraud', y='amount', data=data, showfliers=False)
    sns.stripplot(x='is_fraud', y='amount', data=data[data['is_fraud'] == True], 
                  color='red', alpha=0.2, size=3, jitter=True)
    
    plt.yscale('log')
    plt.title('Transaction Amounts by Fraud Status (Log Scale)', fontsize=14, pad=10)
    plt.xlabel('Fraud Status', fontsize=12)
    plt.ylabel('Transaction Amount (USD, Log Scale)', fontsize=12)
    plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_hourly_fraud_rate():
    hourly_fraud = data.groupby('hour')['is_fraud'].mean() * 100
    hourly_volume = data.groupby('hour').size()
    
    # Calculate confidence intervals (binomial)
    ci = []
    for hour in hourly_fraud.index:
        n = hourly_volume[hour]
        p = hourly_fraud[hour] / 100
        se = np.sqrt(p * (1 - p) / n)
        ci.append(stats.norm.interval(0.95, loc=p * 100, scale=se * 100))
    ci = np.array(ci)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=hourly_fraud.index, y=hourly_fraud.values, 
                             mode='lines+markers', name='Fraud Rate',
                             line=dict(color='royalblue')))
    fig.add_trace(go.Scatter(x=hourly_fraud.index, y=ci[:, 0], 
                             mode='lines', name='95% CI Lower', 
                             line=dict(color='lightblue', dash='dash'), showlegend=False))
    fig.add_trace(go.Scatter(x=hourly_fraud.index, y=ci[:, 1], 
                             mode='lines', name='95% CI Upper', 
                             line=dict(color='lightblue', dash='dash'), fill='tonexty'))
    
    # Add annotations for peak and dip
    fig.add_annotation(x=12, y=3.66, text='Peak (3.66%)', showarrow=True, arrowhead=2)
    fig.add_annotation(x=20, y=3.51, text='Dip (3.51%)', showarrow=True, arrowhead=2)
    
    # Add secondary y-axis for volume
    fig.add_trace(go.Bar(x=hourly_volume.index, y=hourly_volume.values, 
                        name='Transaction Volume', opacity=0.3, yaxis='y2'))
    
    fig.update_layout(
        title='Hourly Fraud Rate with Transaction Volume',
        xaxis_title='Hour of Day',
        yaxis_title='Fraud Rate (%)',
        yaxis2=dict(title='Transaction Volume', overlaying='y', side='right'),
        legend=dict(x=0.01, y=0.99),
        template='plotly_white'
    )
    fig.show()

In [None]:
def plot_categorical_fraud():
    plt.figure(figsize=(12, 6))
    fraud_by_type = data.groupby(['transaction_type', 'is_fraud']).size().unstack().fillna(0)
    fraud_by_type = fraud_by_type.div(fraud_by_type.sum(axis=1), axis=0) * 100
    
    fraud_by_type.plot(kind='bar', stacked=True, figsize=(12, 6))
    plt.title('Fraud Proportion by Transaction Type', fontsize=14, pad=10)
    plt.xlabel('Transaction Type', fontsize=12)
    plt.ylabel('Proportion (%)', fontsize=12)
    plt.legend(['Non-Fraud', 'Fraud'], title='Fraud Status')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_location_device_heatmap():
    heatmap_data = data.pivot_table(values='is_fraud', index='location', 
                                   columns='device_used', aggfunc='mean') * 100
    plt.figure(figsize=(10, 8))
    sns.heatmap(heatmap_data, annot=True, fmt='.2f', cmap='RdBu_r', 
                cbar_kws={'label': 'Fraud Rate (%)'})
    plt.title('Fraud Rate by Location and Device Used', fontsize=14, pad=10)
    plt.xlabel('Device Used', fontsize=12)
    plt.ylabel('Location', fontsize=12)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_correlation_heatmap():
    numerical_cols = ['amount', 'spending_deviation_score', 'velocity_score', 
                     'geo_anomaly_score', 'time_since_last_transaction']
    corr_matrix = data[numerical_cols].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
                vmin=-1, vmax=1, center=0)
    plt.title('Correlation Matrix of Numerical Features', fontsize=14, pad=10)
    plt.tight_layout()
    plt.show()

In [None]:
def plot_fraud_volume_timeseries():
    data['date'] = data['timestamp'].dt.date
    fraud_volume = data[data['is_fraud'] == True].groupby('date').size()
    fraud_volume = fraud_volume.rolling(window=7, min_periods=1).mean()
    
    plt.figure(figsize=(12, 6))
    fraud_volume.plot()
    plt.title('7-Day Rolling Average of Fraudulent Transactions', fontsize=14, pad=10)
    plt.xlabel('Date', fontsize=12)
    plt.ylabel('Number of Fraudulent Transactions', fontsize=12)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
plot_amount_histogram()

In [None]:
plot_amount_boxplot()

In [None]:
plot_hourly_fraud_rate()

In [None]:
plot_categorical_fraud()

In [None]:
plot_location_device_heatmap()

In [None]:
plot_correlation_heatmap()

In [None]:
plot_fraud_volume_timeseries()

## Feature Engineering

In [None]:
# Modified Feature Engineering and Pipeline
def create_features(data):
    """Enhanced feature engineering with categorical handling"""
    # Time features
    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data['hour'] = data['timestamp'].dt.hour
    data['day_of_week'] = data['timestamp'].dt.dayofweek
    data['day_of_month'] = data['timestamp'].dt.day
    data['month'] = data['timestamp'].dt.month
    data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)
    data['is_night'] = ((data['hour'] >= 22) | (data['hour'] <= 6)).astype(int)
    
    # Transaction features
    sender_freq = data['sender_account'].value_counts().to_dict()
    data['sender_transaction_count'] = data['sender_account'].map(sender_freq)
    data['amount_log'] = np.log1p(data['amount'])
    data['amount_to_avg'] = data['amount'] / data.groupby('sender_account')['amount'].transform('mean')
    
    # Anomaly scores
    data['combined_anomaly_score'] = data['spending_deviation_score'] * data['geo_anomaly_score']
    data['velocity_geo_score'] = data['velocity_score'] * data['geo_anomaly_score']
    data['time_since_last_transaction'].fillna(-1, inplace=True)
    data['time_since_last_transaction_log'] = np.log1p(data['time_since_last_transaction'].clip(0))
    
    # Identify categorical columns (excluding those we'll drop)
    cat_cols = ['transaction_type', 'merchant_category', 'location', 
               'device_used', 'payment_channel']
    
    # Columns to drop
    drop_cols = ['transaction_id', 'timestamp', 'sender_account', 
                'receiver_account', 'fraud_type', 'ip_address', 
                'device_hash', 'date']
    
    # Separate features before encoding
    X = data.drop(columns=['is_fraud'] + drop_cols, errors='ignore')
    y = data['is_fraud']
    
    return X, y, cat_cols

# Get features and target
X, y, categorical_cols = create_features(data)

# Preprocessing Pipeline ===================================================
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Identify numeric columns (exclude categoricals and target)
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# 2. Data Resampling ======================================================
def balance_data(X, y):
    """Modified resampling that preserves column structure"""
    X_fraud = X[y == 1]
    X_nonfraud = X[y == 0]
    
    n_samples = int(len(X_nonfraud) * 0.1)
    X_fraud_upsampled = resample(X_fraud, n_samples=n_samples, random_state=42)
    X_nonfraud_downsampled = resample(X_nonfraud, 
                                    n_samples=int(len(X_nonfraud)*0.5), 
                                    random_state=42)
    
    X_resampled = pd.concat([X_nonfraud_downsampled, X_fraud_upsampled])
    y_resampled = pd.Series([0]*len(X_nonfraud_downsampled) + [1]*len(X_fraud_upsampled))
    
    idx = np.random.permutation(len(X_resampled))
    return X_resampled.iloc[idx], y_resampled.iloc[idx]

X_resampled, y_resampled = balance_data(X, y)

# 3. Model Training =======================================================
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Create full pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(
        warm_start=True,
        random_state=42,
        class_weight='balanced',
        n_jobs=1
    ))
])

# Hyperparameter tuning
param_distributions = {
    'classifier__n_estimators': [80, 100, 120],
    'classifier__max_depth': [10, 15, None],
    'classifier__min_samples_split': [2, 3]
}

search = HalvingRandomSearchCV(
    model_pipeline,
    param_distributions,
    factor=2,
    cv=3,
    scoring='average_precision',
    n_jobs=1,
    verbose=2,
    random_state=42,
    aggressive_elimination=True
).fit(X_train, y_train)

# 4. Evaluation ==========================================================
best_model = search.best_estimator_

# Evaluate on test set
print("\n=== Test Set Evaluation ===")
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print("Best Parameters:", search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_prob))
print("Average Precision:", average_precision_score(y_test, y_prob))

# Feature Importance (requires getting feature names from preprocessor)
onehot_columns = best_model.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_cols)
all_features = numeric_cols + list(onehot_columns)

importances = best_model.named_steps['classifier'].feature_importances_
feat_imp = pd.DataFrame({'Feature': all_features, 'Importance': importances}).sort_values('Importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feat_imp.head(20))
plt.title("Top 20 Feature Importances")
plt.tight_layout()
plt.show()