# Evaluating Fraud Detection Techniques in Banking and Insurance Using Data Science

## Comprehensive Analysis Using Machine Learning Approaches

**Author:** [Your Name]  
**Institution:** [Your Institution]  
**Date:** August 2025  
**Dataset:** Credit Card Fraud Detection Dataset (Kaggle MLG-ULB)

---

## Table of Contents

1. [Introduction and Literature Review](#introduction)
2. [Data Loading and Exploration](#data-loading)
3. [Data Preprocessing and Feature Engineering](#preprocessing)
4. [Model Implementation](#models)
5. [Model Evaluation and Comparison](#evaluation)
6. [Hyperparameter Tuning](#tuning)
7. [Ethical Considerations](#ethics)
8. [Conclusions](#conclusions)
9. [References](#references)

---

<a id="introduction"></a>

## 1. Introduction and Literature Review

### Problem Statement

Financial fraud detection represents a critical challenge in modern banking and insurance sectors, with global fraud losses reaching billions annually. This analysis evaluates multiple machine learning approaches for detecting fraudulent credit card transactions, contributing to the growing body of research on automated fraud detection systems.

### Research Objectives

1. Compare the performance of traditional and ensemble machine learning algorithms for fraud detection
2. Evaluate feature engineering techniques for imbalanced financial datasets
3. Assess model interpretability and ethical implications in fraud detection systems
4. Provide recommendations for practical implementation in banking environments

### Literature Context

Recent systematic reviews highlight the effectiveness of ensemble methods and neural networks in fraud detection (Zareapoor et al., 2024). Support Vector Machines and Artificial Neural Networks have emerged as particularly effective approaches for credit card fraud detection (Ahmad et al., 2022). The challenge of class imbalance in fraud datasets has driven research toward advanced sampling techniques and cost-sensitive learning approaches (Borketey, 2024).


In [None]:
# Cell 1: Import Required Libraries and Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
import sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, 
                           roc_curve, precision_recall_curve, f1_score, accuracy_score,
                           precision_score, recall_score)
from sklearn.pipeline import Pipeline
from sklearn.utils import resample
import xgboost as xgb
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# Deep Learning Libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Statistical Libraries
from scipy import stats
from scipy.stats import chi2_contingency

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure plotting
plt.style.use('default')  # Use default style instead of deprecated seaborn-v0_8
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"XGBoost version: {xgb.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print("All required packages are now available!")

ModuleNotFoundError: No module named 'xgboost'

<a id="data-loading"></a>

## 2. Data Loading and Exploration

**Dataset Information:**

- Load the Credit Card Fraud Detection dataset from Kaggle
- Dataset source: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- Note: This dataset contains transactions made by credit cards in September 2013 by European cardholders
- Features V1-V28 are the result of PCA transformation to protect user privacy

The following analysis will load the dataset and perform comprehensive exploratory data analysis to understand the structure, distribution, and characteristics of the fraud detection dataset.


In [None]:
# Cell 2: Data Loading and Initial Exploration

# Load the dataset
# Note: Adjust the path according to your local setup
try:
    df = pd.read_csv('creditcard.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Dataset not found. Please ensure 'creditcard.csv' is in your working directory.")
    print("Download from: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud")
    # Create a sample dataset for demonstration purposes
    np.random.seed(42)
    n_samples = 10000
    n_features = 30
    
    # Generate synthetic data similar to the original dataset
    data = np.random.randn(n_samples, n_features)
    
    # Create realistic 'Time' and 'Amount' features
    time_data = np.random.uniform(0, 172800, n_samples)  # 48 hours in seconds
    amount_data = np.random.lognormal(3, 1.5, n_samples)  # Log-normal distribution for amounts
    
    # Create highly imbalanced target (0.17% fraud rate similar to original)
    fraud_indices = np.random.choice(n_samples, size=int(0.0017 * n_samples), replace=False)
    target = np.zeros(n_samples)
    target[fraud_indices] = 1
    
    # Combine features
    feature_columns = [f'V{i}' for i in range(1, 29)]
    df = pd.DataFrame(data[:, :28], columns=feature_columns)
    df['Time'] = time_data
    df['Amount'] = amount_data
    df['Class'] = target.astype(int)
    
    print("Using synthetic dataset for demonstration purposes.")

# Display basic information about the dataset
print(f"\nDataset Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nColumn names:")
print(df.columns.tolist())

In [None]:
# Cell 3: Basic Dataset Information and Statistical Summary

# Basic dataset information
print("=== DATASET OVERVIEW ===")
print(f"Dataset dimensions: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Total features: {len(df.columns) - 1}")
print(f"Target variable: Class (0: Normal, 1: Fraud)")

# Check for missing values
print(f"\nMissing values per column:")
missing_values = df.isnull().sum()
if missing_values.sum() == 0:
    print("No missing values detected - excellent data quality!")
else:
    print(missing_values[missing_values > 0])

# Target variable distribution
fraud_count = df['Class'].value_counts()
fraud_percentage = df['Class'].value_counts(normalize=True) * 100

print(f"\n=== CLASS DISTRIBUTION ===")
print(f"Normal transactions: {fraud_count[0]:,} ({fraud_percentage[0]:.3f}%)")
print(f"Fraudulent transactions: {fraud_count[1]:,} ({fraud_percentage[1]:.3f}%)")
print(f"Imbalance ratio: {fraud_count[0]/fraud_count[1]:.1f}:1")

# Statistical summary
print(f"\n=== STATISTICAL SUMMARY ===")
print(df.describe())

# Data types
print(f"\n=== DATA TYPES ===")
print(df.dtypes.value_counts())

In [None]:
# Cell 4: Advanced Exploratory Data Analysis with Visualizations

# Create comprehensive visualizations to understand data distribution,
# feature relationships, and fraud patterns.

# Create figure with multiple subplots
fig = plt.figure(figsize=(16, 12))

# 1. Class distribution
plt.subplot(2, 3, 1)
fraud_counts = df['Class'].value_counts()
colors = ['#2ecc71', '#e74c3c']
plt.pie(fraud_counts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%', 
        colors=colors, startangle=90)
plt.title('Transaction Distribution', fontsize=14, fontweight='bold')

# 2. Amount distribution by class
plt.subplot(2, 3, 2)
normal_amounts = df[df['Class'] == 0]['Amount']
fraud_amounts = df[df['Class'] == 1]['Amount']

plt.hist(normal_amounts, bins=50, alpha=0.7, label='Normal', color='#2ecc71', density=True)
plt.hist(fraud_amounts, bins=50, alpha=0.7, label='Fraud', color='#e74c3c', density=True)
plt.xlabel('Transaction Amount')
plt.ylabel('Density')
plt.title('Amount Distribution by Class')
plt.legend()
plt.yscale('log')

# 3. Time distribution analysis
plt.subplot(2, 3, 3)
plt.hist(df['Time'], bins=50, alpha=0.7, color='#3498db')
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency')
plt.title('Transaction Time Distribution')

# 4. Amount vs Time scatter plot
plt.subplot(2, 3, 4)
normal_data = df[df['Class'] == 0].sample(n=min(5000, len(df[df['Class'] == 0])))
fraud_data = df[df['Class'] == 1]

plt.scatter(normal_data['Time'], normal_data['Amount'], alpha=0.5, 
           label='Normal', color='#2ecc71', s=1)
plt.scatter(fraud_data['Time'], fraud_data['Amount'], alpha=0.8, 
           label='Fraud', color='#e74c3c', s=10)
plt.xlabel('Time (seconds)')
plt.ylabel('Amount')
plt.title('Amount vs Time by Class')
plt.legend()
plt.yscale('log')

# 5. Correlation heatmap for key features
plt.subplot(2, 3, 5)
# Select a subset of features for correlation analysis
key_features = ['Time', 'Amount'] + [f'V{i}' for i in range(1, 11)] + ['Class']
corr_matrix = df[key_features].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix')

# 6. Box plot for amount by class
plt.subplot(2, 3, 6)
df_plot = df.copy()
df_plot['Class_Label'] = df_plot['Class'].map({0: 'Normal', 1: 'Fraud'})
sns.boxplot(data=df_plot, x='Class_Label', y='Amount')
plt.title('Amount Distribution by Class')
plt.yscale('log')

plt.tight_layout()
plt.show()

# Print statistical insights
print("=== KEY INSIGHTS FROM EDA ===")
print(f"Average transaction amount (Normal): ${normal_amounts.mean():.2f}")
print(f"Average transaction amount (Fraud): ${fraud_amounts.mean():.2f}")
print(f"Median transaction amount (Normal): ${normal_amounts.median():.2f}")
print(f"Median transaction amount (Fraud): ${fraud_amounts.median():.2f}")
print(f"Maximum transaction amount: ${df['Amount'].max():.2f}")
print(f"Time span: {df['Time'].max() - df['Time'].min():.0f} seconds ({(df['Time'].max() - df['Time'].min())/3600:.1f} hours)")

<a id="preprocessing"></a>

## 3. Data Preprocessing and Feature Engineering

This section implements comprehensive preprocessing including:

- Feature scaling and normalization
- Handling class imbalance using multiple techniques
- Feature selection and engineering
- Data splitting with stratification

The preprocessing pipeline is designed to optimize model performance while maintaining data integrity and addressing the significant class imbalance present in fraud detection datasets.


In [None]:
# Cell 5: Feature Engineering and Data Preprocessing

class FraudDetectionPreprocessor:
    """
    Custom preprocessing class for fraud detection with comprehensive
    feature engineering and data preparation capabilities.
    """
    
    def __init__(self, scaling_method='robust'):
        self.scaling_method = scaling_method
        self.scaler = None
        self.feature_names = None
        
    def create_time_features(self, df):
        """Extract time-based features from the Time column"""
        df_processed = df.copy()
        
        # Convert time to hours (assuming Time is in seconds)
        df_processed['Time_Hour'] = (df_processed['Time'] % (24 * 3600)) / 3600
        
        # Cyclical encoding for hour
        df_processed['Time_Hour_Sin'] = np.sin(2 * np.pi * df_processed['Time_Hour'] / 24)
        df_processed['Time_Hour_Cos'] = np.cos(2 * np.pi * df_processed['Time_Hour'] / 24)
        
        return df_processed
    
    def create_amount_features(self, df):
        """Engineer features from the Amount column"""
        df_processed = df.copy()
        
        # Log transformation for amount (add 1 to handle zero values)
        df_processed['Amount_Log'] = np.log1p(df_processed['Amount'])
        
        # Z-score for amount (outlier detection)
        df_processed['Amount_Zscore'] = stats.zscore(df_processed['Amount'])
        
        # Boolean indicators
        df_processed['Is_High_Amount'] = (df_processed['Amount'] > df_processed['Amount'].quantile(0.95)).astype(int)
        
        return df_processed
    
    def create_pca_features(self, df):
        """Create additional features from PCA components"""
        df_processed = df.copy()
        
        # Select V features (PCA components)
        v_features = [col for col in df.columns if col.startswith('V')]
        
        # Create aggregate features
        df_processed['V_Sum'] = df_processed[v_features].sum(axis=1)
        df_processed['V_Mean'] = df_processed[v_features].mean(axis=1)
        df_processed['V_Std'] = df_processed[v_features].std(axis=1)
        df_processed['V_Max'] = df_processed[v_features].max(axis=1)
        df_processed['V_Min'] = df_processed[v_features].min(axis=1)
        df_processed['V_Range'] = df_processed['V_Max'] - df_processed['V_Min']
        
        # Count of extreme values
        threshold = 3
        df_processed['V_Extreme_Count'] = (np.abs(df_processed[v_features]) > threshold).sum(axis=1)
        
        return df_processed
    
    def fit_transform(self, X, y=None):
        """Fit the preprocessor and transform the data"""
        # Feature engineering
        X_processed = self.create_time_features(X)
        X_processed = self.create_amount_features(X_processed)
        X_processed = self.create_pca_features(X_processed)
        
        # Select numerical features for scaling
        numerical_features = X_processed.select_dtypes(include=[np.number]).columns
        numerical_features = [col for col in numerical_features if col != 'Class']
        
        # Initialize and fit scaler
        if self.scaling_method == 'robust':
            self.scaler = RobustScaler()
        else:
            self.scaler = StandardScaler()
        
        X_scaled = X_processed[numerical_features].copy()
        X_scaled[numerical_features] = self.scaler.fit_transform(X_scaled[numerical_features])
        
        self.feature_names = numerical_features
        
        return X_scaled
    
    def transform(self, X):
        """Transform new data using fitted preprocessor"""
        # Feature engineering
        X_processed = self.create_time_features(X)
        X_processed = self.create_amount_features(X_processed)
        X_processed = self.create_pca_features(X_processed)
        
        # Scale features
        X_scaled = X_processed[self.feature_names].copy()
        X_scaled[self.feature_names] = self.scaler.transform(X_scaled[self.feature_names])
        
        return X_scaled

# Apply preprocessing
print("=== FEATURE ENGINEERING AND PREPROCESSING ===")

# Initialize preprocessor
preprocessor = FraudDetectionPreprocessor(scaling_method='robust')

# Separate features and target
X = df.drop('Class', axis=1)
y = df['Class']

# Apply preprocessing
X_processed = preprocessor.fit_transform(X)

print(f"Original features: {X.shape[1]}")
print(f"Engineered features: {X_processed.shape[1]}")
print(f"Feature names: {list(X_processed.columns)}")

# Display feature engineering results
print(f"\nNew features created:")
new_features = [col for col in X_processed.columns if col not in X.columns]
print(new_features)

In [None]:
# Cell 6: Data Splitting and Class Imbalance Handling

# Stratified train-test split to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.2, random_state=42, stratify=y
)

print("=== DATA SPLITTING RESULTS ===")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training fraud rate: {y_train.mean():.4f}")
print(f"Test fraud rate: {y_test.mean():.4f}")

# Class imbalance handling strategies
print(f"\n=== CLASS IMBALANCE HANDLING ===")

# 1. SMOTE (Synthetic Minority Oversampling Technique)
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Original training set shape: {X_train.shape}")
print(f"SMOTE resampled shape: {X_train_smote.shape}")
print(f"Original fraud rate: {y_train.mean():.4f}")
print(f"SMOTE fraud rate: {y_train_smote.mean():.4f}")

# 2. Random Undersampling
undersampler = RandomUnderSampler(random_state=42, sampling_strategy=0.5)
X_train_under, y_train_under = undersampler.fit_resample(X_train, y_train)

print(f"Undersampled shape: {X_train_under.shape}")
print(f"Undersampled fraud rate: {y_train_under.mean():.4f}")

# Visualization of class distributions
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

datasets = [
    (y_train, "Original Training", '#3498db'),
    (y_train_smote, "SMOTE Resampled", '#2ecc71'),
    (y_train_under, "Undersampled", '#e74c3c'),
    (y_test, "Test Set", '#9b59b6')
]

for idx, (y_data, title, color) in enumerate(datasets):
    counts = pd.Series(y_data).value_counts()
    axes[idx].pie(counts.values, labels=['Normal', 'Fraud'], autopct='%1.1f%%',
                  colors=['lightblue', color], startangle=90)
    axes[idx].set_title(title)

plt.tight_layout()
plt.show()

# Store different training sets for model comparison
training_sets = {
    'original': (X_train, y_train),
    'smote': (X_train_smote, y_train_smote),
    'undersampled': (X_train_under, y_train_under)
}

print(f"\nTraining sets prepared for model evaluation:")
for name, (X_set, y_set) in training_sets.items():
    print(f"- {name}: {X_set.shape[0]} samples, fraud rate: {y_set.mean():.4f}")

<a id="models"></a>

## 4. Model Implementation and Evaluation

This section implements and evaluates multiple machine learning approaches:

### 4.1 Random Forest Classifier

- Ensemble method particularly effective for fraud detection
- Handles imbalanced data well with class weighting
- Provides feature importance insights

### 4.2 XGBoost Classifier

- Gradient boosting algorithm optimized for structured data
- Excellent performance on imbalanced datasets
- Built-in regularization and early stopping

### 4.3 Support Vector Machine

- Effective for high-dimensional data
- Can capture complex non-linear patterns with RBF kernel
- Good generalization capabilities

### 4.4 Neural Network

- Deep learning approach with multiple hidden layers
- Batch normalization and dropout for regularization
- Suitable for large-scale fraud detection systems
