# Credit Card Fraud Detection
## Using Unsupervised Anomaly Detection

**Author:** Farès HAMDI  
**Date:** 2025  

---

### About this project

This notebook explores how **unsupervised learning** can help detect fraudulent credit card transactions. The idea is simple: frauds are rare and unusual, so they should stand out as **anomalies** in the data.

We use two classic anomaly detection algorithms:
- **Isolation Forest**: isolates anomalies by randomly partitioning the data
- **Local Outlier Factor (LOF)**: compares local density of points

The `Class` column (0 = legit, 1 = fraud) is only used to **evaluate** how well the algorithms perform. In a real scenario, we often don't have reliable labels.

## 1. Setup and Imports

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

# Import our modules
from data_loading import load_data, print_dataset_summary, get_amount_statistics, get_top_correlations
from preprocessing import preprocess_pipeline
from models import train_all_models
from evaluation import evaluate_model, print_evaluation_report, compare_models, save_predictions
from visualization import (
    setup_plot_style,
    plot_eda_overview,
    plot_pca_projection,
    plot_score_distributions,
    plot_precision_recall_curves,
    plot_detection_comparison,
    plot_results_summary
)

# Settings
warnings.filterwarnings('ignore')
setup_plot_style()

# Output directory for figures
FIGURES_DIR = Path.cwd().parent / 'outputs' / 'figures'
FIGURES_DIR.mkdir(parents=True, exist_ok=True)

print("Setup complete!")

## 2. Load the Data

The dataset contains ~284,000 transactions made by European cardholders in September 2013.  
Only 492 of them are frauds — that's about **0.17%**. A needle in a haystack.

In [None]:
# Load the dataset
DATA_PATH = Path.cwd().parent / 'data' / 'creditcard.csv'

df = load_data(DATA_PATH)

In [None]:
# Dataset summary
print_dataset_summary(df)

In [None]:
# First look at the data
df.head()

## 3. Exploratory Data Analysis

Let's explore the data to understand its characteristics.

In [None]:
# Transaction amount statistics
stats = get_amount_statistics(df)

print("Transaction Amounts:")
print(f"  Overall - Mean: €{stats['overall']['mean']:.2f}, Median: €{stats['overall']['median']:.2f}")
print(f"  Legitimate - Mean: €{stats['legitimate']['mean']:.2f}")
print(f"  Fraudulent - Mean: €{stats['fraudulent']['mean']:.2f}")

In [None]:
# Top features correlated with fraud
correlations = get_top_correlations(df, n=5)

print("Features most correlated with fraud:")
print(correlations.head(10))

In [None]:
# EDA visualizations
plot_eda_overview(df, save_path=FIGURES_DIR / '01_eda_overview.png')

**Observations:**
- Extreme class imbalance: ~284k legitimate vs ~500 frauds
- Most transactions are small amounts
- Transactions spread across 48 hours
- Fraudulent transactions tend to have lower amounts

## 4. Data Preprocessing

The V1-V28 features are already scaled (from PCA). We need to standardize `Time` and `Amount`.

In [None]:
# Run preprocessing pipeline
preprocessed = preprocess_pipeline(df)

X = preprocessed['X']
X_scaled = preprocessed['X_scaled']
y = preprocessed['y']
X_2d = preprocessed['X_2d']
pca = preprocessed['pca']

In [None]:
# Visualize 2D projection
plot_pca_projection(X_2d, y, pca, save_path=FIGURES_DIR / '02_pca_projection.png')

**Notice:** Frauds (red) don't form a clear cluster. They're scattered around, which makes detection tricky.

## 5. Train Anomaly Detection Models

We'll train two unsupervised models:
1. **Isolation Forest**: Isolates anomalies using random partitioning
2. **Local Outlier Factor**: Compares local density to neighbors

In [None]:
# Train all models
model_results = train_all_models(X_scaled, y)

# Extract results
iso_results = model_results['isolation_forest']
lof_results = model_results['lof']

pred_iso = iso_results['predictions']
pred_lof = lof_results['predictions']
scores_iso = iso_results['scores']
scores_lof = lof_results['scores']

## 6. Model Evaluation

Now let's evaluate how well each model performs at detecting actual frauds.

**Important metrics for imbalanced data:**
- **Recall**: Of all actual frauds, how many did we catch?
- **Precision**: Of all predicted anomalies, how many are actual frauds?
- **Average Precision**: Area under the precision-recall curve

In [None]:
# Evaluate Isolation Forest
results_iso = evaluate_model(y, pred_iso, scores_iso, "Isolation Forest")
print_evaluation_report(results_iso)

In [None]:
# Evaluate LOF
results_lof = evaluate_model(y, pred_lof, scores_lof, "Local Outlier Factor")
print_evaluation_report(results_lof)

In [None]:
# Compare models
compare_models([results_iso, results_lof])

## 7. Visualizations

In [None]:
# Score distributions
plot_score_distributions(y, scores_iso, scores_lof, 
                         save_path=FIGURES_DIR / '03_score_distributions.png')

In [None]:
# Precision-Recall curves
plot_precision_recall_curves(y, scores_iso, scores_lof, results_iso, results_lof,
                             save_path=FIGURES_DIR / '04_precision_recall.png')

In [None]:
# Detection comparison in 2D
plot_detection_comparison(X_2d, y, pred_iso, pred_lof,
                          save_path=FIGURES_DIR / '05_detection_comparison.png')

In [None]:
# Results summary
plot_results_summary(results_iso, results_lof,
                     save_path=FIGURES_DIR / '06_results_summary.png')

## 8. Save Predictions

In [None]:
# Save predictions to CSV
save_predictions(
    y,
    {'isolation_forest': pred_iso, 'lof': pred_lof},
    {'isolation_forest': scores_iso, 'lof': scores_lof},
    filepath=Path.cwd().parent / 'outputs' / 'predictions.csv'
)

## 9. Conclusion

### Key Findings

Both algorithms face the classic **precision-recall tradeoff**:
- To catch more frauds (higher recall), we inevitably flag more legitimate transactions as suspicious (more false positives)
- Both models achieve similar performance (~70% recall, ~7% precision)

### Challenges

1. **Extreme class imbalance** (0.17% frauds)
2. **Frauds don't cluster together** — they're spread throughout the feature space
3. **Anonymized features** limit interpretability

### Possible Improvements

- **Ensemble methods**: Combine multiple algorithms
- **Threshold tuning**: Adjust based on business costs
- **Feature engineering**: Create new features from existing ones
- **Supervised learning**: If reliable labels are available
- **Temporal validation**: Train on past data, test on future data

### Production Considerations

In production, you'd need to:
- Set the threshold based on the cost of missing a fraud vs. blocking a legitimate transaction
- Continuously monitor and retrain as fraud patterns evolve
- Implement real-time scoring capabilities

In [None]:
# Summary of generated files
print("\n" + "=" * 60)
print("GENERATED FILES")
print("=" * 60)
print("\nFigures (in outputs/figures/):")
print("  - 01_eda_overview.png")
print("  - 02_pca_projection.png")
print("  - 03_score_distributions.png")
print("  - 04_precision_recall.png")
print("  - 05_detection_comparison.png")
print("  - 06_results_summary.png")
print("\nData (in outputs/):")
print("  - predictions.csv")
print("\n" + "=" * 60)
print("Done!")
print("=" * 60)