# Task 2.2: Model Building - Credit Card Fraud Detection

## Objective
Build and evaluate classification models for bank credit card fraud detection:
1. Train a **baseline** model (Logistic Regression)
2. Train an **ensemble** model (Random Forest)
3. Compare models and select the best

## Dataset Characteristics
- Features are **anonymized PCA components** (V1-V28) plus `Time` and `Amount`
- All features are **numeric** (simpler preprocessing)
- Highly imbalanced (typical for credit card fraud)

In [None]:
# Standard imports
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

# Project imports
from src.data.loader import load_creditcard_data
from src.modeling.pipelines import build_creditcard_pipeline, get_model_name
from src.modeling.train import train_and_evaluate, cross_validate_model, save_model, compare_models
from src.modeling.metrics import (
    plot_confusion_matrix,
    plot_precision_recall_curve,
    get_classification_report_df,
)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

RANDOM_STATE = 42

## 1. Load Credit Card Data

In [None]:
# Load the credit card dataset
DATA_PATH = project_root / "data" / "raw" / "creditcard.csv"

df = load_creditcard_data(DATA_PATH)
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Check class distribution
print("Class Distribution:")
class_counts = df['Class'].value_counts()
print(class_counts)
print(f"\nFraud rate: {df['Class'].mean() * 100:.4f}%")
print(f"Imbalance ratio: 1:{class_counts[0] / class_counts[1]:.0f}")

In [None]:
# Define features and target
# V1-V28 are PCA components, Time and Amount are the only interpretable features
FEATURE_COLS = [col for col in df.columns if col != 'Class']
TARGET = 'Class'

print(f"Features: {len(FEATURE_COLS)} columns")
print(f"Feature names: {FEATURE_COLS}")

In [None]:
# Prepare X and y
X = df[FEATURE_COLS].copy()
y = df[TARGET].copy()

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

In [None]:
from IPython.display import Markdown, display

fraud_rate = df['Class'].mean() * 100
imbalance_ratio = class_counts[0] / class_counts[1]

display(
    Markdown(
        f"""
### Interpretation: Credit Card Dataset Overview

- **Dataset size**: `{df.shape[0]:,}` transactions Ã— `{df.shape[1]}` columns
- **Fraud rate**: `{fraud_rate:.4f}%` (highly imbalanced)
- **Imbalance ratio**: approximately **1:{imbalance_ratio:.0f}**
- **Features**: `V1`-`V28` (PCA components, anonymized), `Time`, `Amount`

**Note on interpretability**
- The V1-V28 features are the result of PCA transformation for privacy.
- Only `Time` and `Amount` have direct business meaning.
- SHAP analysis will still show which components drive predictions, but interpretation is abstract.
"""
    )
)

## 2. Train-Test Split (Stratified)

In [None]:
# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"\nTraining class distribution:")
print(y_train.value_counts())
print(f"\nTest class distribution:")
print(y_test.value_counts())

## 3. Baseline Model: Logistic Regression

In [None]:
# Build logistic regression pipeline with SMOTE
lr_pipeline = build_creditcard_pipeline(
    model_type="logistic",
    use_smote=True,
    random_state=RANDOM_STATE,
)

print("Logistic Regression Pipeline:")
print(lr_pipeline)

In [None]:
# Train and evaluate
print("Training Logistic Regression...")
lr_model, lr_metrics, lr_threshold = train_and_evaluate(
    lr_pipeline,
    X_train, y_train,
    X_test, y_test,
    threshold=0.5,
)

print("\nLogistic Regression Results:")
for key, value in lr_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Confusion matrix
y_pred_lr = lr_model.predict(X_test)

fig, ax = plt.subplots(figsize=(6, 5))
plot_confusion_matrix(y_test.values, y_pred_lr, title="Logistic Regression - Confusion Matrix", ax=ax)
plt.tight_layout()
plt.show()

In [None]:
# Classification report
print("Classification Report - Logistic Regression:")
display(get_classification_report_df(y_test.values, y_pred_lr))

## 4. Ensemble Model: Random Forest

In [None]:
# Build Random Forest pipeline with SMOTE
rf_pipeline = build_creditcard_pipeline(
    model_type="random_forest",
    use_smote=True,
    random_state=RANDOM_STATE,
    n_estimators=100,
    max_depth=10,
)

print("Random Forest Pipeline:")
print(rf_pipeline)

In [None]:
# Train and evaluate
print("Training Random Forest...")
rf_model, rf_metrics, rf_threshold = train_and_evaluate(
    rf_pipeline,
    X_train, y_train,
    X_test, y_test,
    threshold=0.5,
)

print("\nRandom Forest Results:")
for key, value in rf_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Confusion matrix
y_pred_rf = rf_model.predict(X_test)

fig, ax = plt.subplots(figsize=(6, 5))
plot_confusion_matrix(y_test.values, y_pred_rf, title="Random Forest - Confusion Matrix", ax=ax)
plt.tight_layout()
plt.show()

In [None]:
# Classification report
print("Classification Report - Random Forest:")
display(get_classification_report_df(y_test.values, y_pred_rf))

## 5. Model Comparison

In [None]:
# Compare all models
results = {
    "Logistic Regression + SMOTE": lr_metrics,
    "Random Forest + SMOTE": rf_metrics,
}

comparison_df = compare_models(results)
print("Model Comparison (sorted by AUC-PR):")
display(comparison_df.round(4))

In [None]:
# Precision-Recall curves comparison
fig, ax = plt.subplots(figsize=(10, 6))

y_proba_lr = lr_model.predict_proba(X_test)[:, 1]
y_proba_rf = rf_model.predict_proba(X_test)[:, 1]

plot_precision_recall_curve(y_test.values, y_proba_lr, model_name="Logistic Regression", ax=ax)
plot_precision_recall_curve(y_test.values, y_proba_rf, model_name="Random Forest", ax=ax)

ax.legend(loc="best")
ax.set_title("Precision-Recall Curves - Credit Card Models")
plt.tight_layout()
plt.show()

In [None]:
# Side-by-side confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_confusion_matrix(y_test.values, y_pred_lr, title="Logistic Regression", ax=axes[0])
plot_confusion_matrix(y_test.values, y_pred_rf, title="Random Forest", ax=axes[1])

plt.tight_layout()
plt.show()

In [None]:
from IPython.display import Markdown, display

best_model_name = comparison_df.index[0]
best_auc_pr = comparison_df.iloc[0]['auc_pr']

lr_auc = lr_metrics.get('auc_pr', 0)
rf_auc = rf_metrics.get('auc_pr', 0)

display(
    Markdown(
        f"""
### Interpretation: Model Comparison (Credit Card)

| Model | AUC-PR | F1 | Precision | Recall |
|-------|--------|----|-----------|---------|
| Logistic Regression | {lr_metrics.get('auc_pr', 0):.4f} | {lr_metrics.get('f1', 0):.4f} | {lr_metrics.get('precision', 0):.4f} | {lr_metrics.get('recall', 0):.4f} |
| Random Forest | {rf_metrics.get('auc_pr', 0):.4f} | {rf_metrics.get('f1', 0):.4f} | {rf_metrics.get('precision', 0):.4f} | {rf_metrics.get('recall', 0):.4f} |

**Best Model**: `{best_model_name}` with AUC-PR = `{best_auc_pr:.4f}`

**Observations**
- Credit card fraud detection with PCA features often shows strong baseline performance.
- The ensemble model captures additional patterns in the anonymized feature space.
- Both models benefit from SMOTE to handle the extreme class imbalance.
"""
    )
)

## 6. Feature Importance (Random Forest)

In [None]:
# Extract feature importance
rf_classifier = rf_model.named_steps['classifier']
importances = rf_classifier.feature_importances_

importance_df = pd.DataFrame({
    'feature': FEATURE_COLS,
    'importance': importances
}).sort_values('importance', ascending=False)

print("Top 15 Features by Importance:")
display(importance_df.head(15))

In [None]:
# Plot top 15 features
top_n = 15
top_features = importance_df.head(top_n)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(top_features['feature'], top_features['importance'], color='#e74c3c')
ax.set_xlabel('Importance')
ax.set_ylabel('Feature')
ax.set_title(f'Top {top_n} Feature Importances (Random Forest - Credit Card)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
from IPython.display import Markdown, display

top5 = importance_df.head(5)['feature'].tolist()

display(
    Markdown(
        f"""
### Interpretation: Feature Importance (Credit Card)

**Top 5 most important features**:
1. `{top5[0]}`
2. `{top5[1]}`
3. `{top5[2]}`
4. `{top5[3]}`
5. `{top5[4]}`

**Note on interpretation**
- Since V1-V28 are PCA-transformed, we cannot directly interpret what they represent.
- If `Amount` or `Time` rank highly, those are actionable features.
- SHAP analysis in Task 3 will provide more granular insights into prediction drivers.
"""
    )
)

## 7. Save Best Model

In [None]:
# Save models
models_dir = project_root / "models"

# Save Random Forest
rf_path = save_model(rf_model, "creditcard_random_forest", rf_metrics, models_dir)
print(f"Random Forest saved to: {rf_path}")

# Save Logistic Regression
lr_path = save_model(lr_model, "creditcard_logistic_regression", lr_metrics, models_dir)
print(f"Logistic Regression saved to: {lr_path}")

## 8. Summary: Task 2 (Credit Card) Complete

In [None]:
from IPython.display import Markdown, display

best_model_name = comparison_df.index[0]
best_metrics = results[best_model_name]

display(
    Markdown(
        f"""
## Summary: Task 2 - Credit Card Fraud Model

### Models Trained
1. **Logistic Regression + SMOTE** (baseline)
2. **Random Forest + SMOTE** (ensemble)

### Best Model: `{best_model_name}`
- **AUC-PR**: `{best_metrics.get('auc_pr', 0):.4f}`
- **F1-Score**: `{best_metrics.get('f1', 0):.4f}`
- **Precision**: `{best_metrics.get('precision', 0):.4f}`
- **Recall**: `{best_metrics.get('recall', 0):.4f}`

### Dataset Characteristics
- `{df.shape[0]:,}` transactions with `{fraud_rate:.4f}%` fraud rate
- Features are PCA-transformed (V1-V28) plus Time and Amount
- Extreme imbalance handled with SMOTE

### Files Saved
- `models/creditcard_random_forest.joblib`
- `models/creditcard_logistic_regression.joblib`

### Next Steps
- Proceed to Task 3 for SHAP explainability analysis
- Focus SHAP analysis on the e-commerce model (more interpretable features)
"""
    )
)