# Task 2.1: Model Building - E-commerce Fraud Detection

## Objective
Build, train, and evaluate classification models to detect fraudulent e-commerce transactions:
1. Train a **baseline** interpretable model (Logistic Regression)
2. Train an **ensemble** model (Random Forest)
3. Compare models using metrics appropriate for imbalanced data
4. Select the best model with justification

## Evaluation Metrics
- **AUC-PR (Average Precision)**: Primary metric for imbalanced data
- **F1-Score**: Balance between precision and recall
- **Precision / Recall**: Trade-off analysis
- **Confusion Matrix**: Detailed breakdown

In [None]:
# Standard imports
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

# Project imports
from src.modeling.pipelines import build_fraud_pipeline, get_model_name
from src.modeling.train import train_and_evaluate, cross_validate_model, save_model, compare_models
from src.modeling.metrics import (
    plot_confusion_matrix,
    plot_precision_recall_curve,
    get_classification_report_df,
)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

RANDOM_STATE = 42

## 1. Load Feature-Engineered Data

In [None]:
# Load the feature-engineered dataset from Task 1
DATA_PATH = project_root / "data" / "processed" / "fraud_featured.parquet"

if DATA_PATH.exists():
    df = pd.read_parquet(DATA_PATH)
    print(f"Loaded feature-engineered data: {df.shape}")
else:
    raise FileNotFoundError(f"Please run Task 1 notebooks first to create: {DATA_PATH}")

df.head()

In [None]:
# Define features and target
NUMERIC_FEATURES = [
    'purchase_value', 'age', 'hour_of_day', 'day_of_week', 'is_weekend',
    'time_since_signup', 'tx_count_user_id_1h', 'tx_count_user_id_24h',
    'user_total_transactions'
]

CATEGORICAL_FEATURES = ['source', 'browser', 'sex', 'country']

TARGET = 'class'

# Prepare X and y
X = df[NUMERIC_FEATURES + CATEGORICAL_FEATURES].copy()
y = df[TARGET].copy()

print(f"Features: {X.shape[1]} columns")
print(f"Target distribution:\n{y.value_counts()}")

## 2. Train-Test Split (Stratified)

In [None]:
# Stratified split to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")
print(f"\nTraining class distribution:")
print(y_train.value_counts())
print(f"\nTest class distribution:")
print(y_test.value_counts())

## 3. Baseline Model: Logistic Regression

In [None]:
# Build logistic regression pipeline with SMOTE
lr_pipeline = build_fraud_pipeline(
    model_type="logistic",
    use_smote=True,
    numeric_features=NUMERIC_FEATURES,
    categorical_features=CATEGORICAL_FEATURES,
    random_state=RANDOM_STATE,
)

print("Logistic Regression Pipeline:")
print(lr_pipeline)

In [None]:
# Train and evaluate
print("Training Logistic Regression...")
lr_model, lr_metrics, lr_threshold = train_and_evaluate(
    lr_pipeline,
    X_train, y_train,
    X_test, y_test,
    threshold=0.5,
)

print("\nLogistic Regression Results:")
for key, value in lr_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Confusion matrix for baseline
y_pred_lr = lr_model.predict(X_test)

fig, ax = plt.subplots(figsize=(6, 5))
plot_confusion_matrix(y_test.values, y_pred_lr, title="Logistic Regression - Confusion Matrix", ax=ax)
plt.tight_layout()
plt.show()

In [None]:
# Classification report
print("Classification Report - Logistic Regression:")
display(get_classification_report_df(y_test.values, y_pred_lr))

In [None]:
from IPython.display import Markdown, display

display(
    Markdown(
        f"""
### Interpretation: Baseline Model (Logistic Regression)

- **AUC-PR**: `{lr_metrics.get('auc_pr', 0):.4f}` (primary metric for imbalanced data)
- **F1-Score**: `{lr_metrics.get('f1', 0):.4f}`
- **Precision**: `{lr_metrics.get('precision', 0):.4f}` | **Recall**: `{lr_metrics.get('recall', 0):.4f}`
- **Confusion Matrix**: TP={lr_metrics.get('tp', 0):,}, FP={lr_metrics.get('fp', 0):,}, FN={lr_metrics.get('fn', 0):,}, TN={lr_metrics.get('tn', 0):,}

**Assessment**
- Logistic Regression provides an interpretable baseline with coefficients we can analyze.
- The model benefits from SMOTE to handle class imbalance during training.
- We'll compare this with an ensemble model to see if added complexity improves performance.
"""
    )
)

## 4. Ensemble Model: Random Forest

In [None]:
# Build Random Forest pipeline with SMOTE
rf_pipeline = build_fraud_pipeline(
    model_type="random_forest",
    use_smote=True,
    numeric_features=NUMERIC_FEATURES,
    categorical_features=CATEGORICAL_FEATURES,
    random_state=RANDOM_STATE,
    n_estimators=100,
    max_depth=10,
)

print("Random Forest Pipeline:")
print(rf_pipeline)

In [None]:
# Train and evaluate
print("Training Random Forest...")
rf_model, rf_metrics, rf_threshold = train_and_evaluate(
    rf_pipeline,
    X_train, y_train,
    X_test, y_test,
    threshold=0.5,
)

print("\nRandom Forest Results:")
for key, value in rf_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Confusion matrix for ensemble
y_pred_rf = rf_model.predict(X_test)

fig, ax = plt.subplots(figsize=(6, 5))
plot_confusion_matrix(y_test.values, y_pred_rf, title="Random Forest - Confusion Matrix", ax=ax)
plt.tight_layout()
plt.show()

In [None]:
# Classification report
print("Classification Report - Random Forest:")
display(get_classification_report_df(y_test.values, y_pred_rf))

In [None]:
from IPython.display import Markdown, display

display(
    Markdown(
        f"""
### Interpretation: Ensemble Model (Random Forest)

- **AUC-PR**: `{rf_metrics.get('auc_pr', 0):.4f}`
- **F1-Score**: `{rf_metrics.get('f1', 0):.4f}`
- **Precision**: `{rf_metrics.get('precision', 0):.4f}` | **Recall**: `{rf_metrics.get('recall', 0):.4f}`
- **Confusion Matrix**: TP={rf_metrics.get('tp', 0):,}, FP={rf_metrics.get('fp', 0):,}, FN={rf_metrics.get('fn', 0):,}, TN={rf_metrics.get('tn', 0):,}

**Assessment**
- Random Forest can capture non-linear relationships and feature interactions.
- The ensemble approach typically improves over linear models for complex fraud patterns.
- We can extract feature importance for interpretability (see next section).
"""
    )
)

## 5. Model Comparison

In [None]:
# Compare all models
results = {
    "Logistic Regression + SMOTE": lr_metrics,
    "Random Forest + SMOTE": rf_metrics,
}

comparison_df = compare_models(results)
print("Model Comparison (sorted by AUC-PR):")
display(comparison_df.round(4))

In [None]:
# Precision-Recall curves comparison
fig, ax = plt.subplots(figsize=(10, 6))

y_proba_lr = lr_model.predict_proba(X_test)[:, 1]
y_proba_rf = rf_model.predict_proba(X_test)[:, 1]

plot_precision_recall_curve(y_test.values, y_proba_lr, model_name="Logistic Regression", ax=ax)
plot_precision_recall_curve(y_test.values, y_proba_rf, model_name="Random Forest", ax=ax)

ax.legend(loc="best")
ax.set_title("Precision-Recall Curves - Model Comparison")
plt.tight_layout()
plt.show()

In [None]:
# Side-by-side confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

plot_confusion_matrix(y_test.values, y_pred_lr, title="Logistic Regression", ax=axes[0])
plot_confusion_matrix(y_test.values, y_pred_rf, title="Random Forest", ax=axes[1])

plt.tight_layout()
plt.show()

In [None]:
from IPython.display import Markdown, display

# Determine best model
best_model_name = comparison_df.index[0]
best_auc_pr = comparison_df.iloc[0]['auc_pr']
best_f1 = comparison_df.iloc[0]['f1']

lr_auc = lr_metrics.get('auc_pr', 0)
rf_auc = rf_metrics.get('auc_pr', 0)
improvement = ((rf_auc - lr_auc) / lr_auc * 100) if lr_auc > 0 else 0

display(
    Markdown(
        f"""
### Interpretation: Model Comparison

| Model | AUC-PR | F1 | Precision | Recall |
|-------|--------|----|-----------|---------|
| Logistic Regression | {lr_metrics.get('auc_pr', 0):.4f} | {lr_metrics.get('f1', 0):.4f} | {lr_metrics.get('precision', 0):.4f} | {lr_metrics.get('recall', 0):.4f} |
| Random Forest | {rf_metrics.get('auc_pr', 0):.4f} | {rf_metrics.get('f1', 0):.4f} | {rf_metrics.get('precision', 0):.4f} | {rf_metrics.get('recall', 0):.4f} |

**Best Model**: `{best_model_name}` with AUC-PR = `{best_auc_pr:.4f}`

**Key Observations**
- The ensemble model shows {'improvement' if rf_auc > lr_auc else 'comparable performance'} over the baseline ({improvement:+.1f}% in AUC-PR).
- Both models were trained with SMOTE to handle class imbalance.
- The PR curves show the precision-recall trade-off at different thresholds.
"""
    )
)

## 6. Feature Importance (Random Forest)

In [None]:
# Extract feature importance from Random Forest
# Get feature names after preprocessing
preprocessor = rf_model.named_steps['preprocessor']
cat_encoder = preprocessor.named_transformers_['cat']
cat_feature_names = list(cat_encoder.get_feature_names_out(CATEGORICAL_FEATURES))
all_feature_names = NUMERIC_FEATURES + cat_feature_names

# Get importances
rf_classifier = rf_model.named_steps['classifier']
importances = rf_classifier.feature_importances_

# Create DataFrame
importance_df = pd.DataFrame({
    'feature': all_feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

print("Top 15 Features by Importance:")
display(importance_df.head(15))

In [None]:
# Plot top 15 features
top_n = 15
top_features = importance_df.head(top_n)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(top_features['feature'], top_features['importance'], color='#3498db')
ax.set_xlabel('Importance')
ax.set_ylabel('Feature')
ax.set_title(f'Top {top_n} Feature Importances (Random Forest)')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
from IPython.display import Markdown, display

top5 = importance_df.head(5)['feature'].tolist()

display(
    Markdown(
        f"""
### Interpretation: Feature Importance

**Top 5 most important features**:
1. `{top5[0]}`
2. `{top5[1]}`
3. `{top5[2]}`
4. `{top5[3]}`
5. `{top5[4]}`

**Insights**
- Engineered features (time-based, velocity) often rank among the most important.
- This validates our Task 1 feature engineering effort.
- We'll explore these relationships further with SHAP in Task 3.
"""
    )
)

## 7. Cross-Validation (Optional but Recommended)

In [None]:
# Stratified 5-fold CV for more reliable estimates
print("Running 5-fold Stratified Cross-Validation for Random Forest...")
print("(This may take a few minutes)")

cv_results = cross_validate_model(
    rf_pipeline,
    X, y,
    cv=5,
    random_state=RANDOM_STATE,
)

print("\nCross-Validation Results (5 folds):")
print(f"  AUC-PR: {cv_results['mean_metrics']['auc_pr']:.4f} ± {cv_results['std_metrics']['auc_pr']:.4f}")
print(f"  F1:     {cv_results['mean_metrics']['f1']:.4f} ± {cv_results['std_metrics']['f1']:.4f}")
print(f"  ROC-AUC: {cv_results['mean_metrics']['roc_auc']:.4f} ± {cv_results['std_metrics']['roc_auc']:.4f}")

In [None]:
# Show per-fold results
cv_df = pd.DataFrame(cv_results['fold_metrics'])
display(cv_df[['fold', 'auc_pr', 'f1', 'precision', 'recall', 'roc_auc']].round(4))

## 8. Save Best Model

In [None]:
# Save the best model (Random Forest)
models_dir = project_root / "models"

# Save Random Forest (typically best)
rf_path = save_model(rf_model, "fraud_random_forest", rf_metrics, models_dir)
print(f"Random Forest saved to: {rf_path}")

# Also save Logistic Regression (for comparison / interpretability)
lr_path = save_model(lr_model, "fraud_logistic_regression", lr_metrics, models_dir)
print(f"Logistic Regression saved to: {lr_path}")

## 9. Summary: Task 2 (E-commerce) Complete

In [None]:
from IPython.display import Markdown, display

best_model_name = comparison_df.index[0]
best_metrics = results[best_model_name]

cv_auc_mean = cv_results['mean_metrics']['auc_pr']
cv_auc_std = cv_results['std_metrics']['auc_pr']

display(
    Markdown(
        f"""
## Summary: Task 2 - E-commerce Fraud Model

### Models Trained
1. **Logistic Regression + SMOTE** (baseline, interpretable)
2. **Random Forest + SMOTE** (ensemble, captures non-linearity)

### Best Model: `{best_model_name}`
- **Hold-out Test AUC-PR**: `{best_metrics.get('auc_pr', 0):.4f}`
- **Hold-out Test F1**: `{best_metrics.get('f1', 0):.4f}`
- **5-Fold CV AUC-PR**: `{cv_auc_mean:.4f} ± {cv_auc_std:.4f}`

### Model Selection Justification
- The ensemble model (Random Forest) was selected based on higher AUC-PR.
- Cross-validation confirms stable performance across folds.
- Feature importance analysis shows that engineered features contribute meaningfully.

### Files Saved
- `models/fraud_random_forest.joblib`: Best model
- `models/fraud_logistic_regression.joblib`: Baseline model

### Next Steps
- Run `05_modeling_creditcard.ipynb` for the credit card dataset
- Proceed to Task 3 for SHAP explainability analysis
"""
    )
)