# IEEE-CIS Fraud Detection - Model Training with MLflow

## Overview
This notebook trains and evaluates multiple models for fraud detection with MLflow experiment tracking.
We implement production-grade practices including time-based splits, class imbalance handling,
and comprehensive metric logging.

## Modeling Strategy

### Why Time-Based Split (Not Random)
In production, models are trained on historical data and predict future transactions.
Random splits would leak future information and give overly optimistic results.
Time-based splits simulate real deployment conditions.

### Handling Class Imbalance
With ~3.5% fraud rate, we use:
- **Class weights**: Scale loss function to penalize minority class errors more
- **SMOTE**: Considered but avoided due to potential for creating unrealistic samples
- **Threshold tuning**: Adjust classification threshold for desired precision-recall trade-off

### Models Compared
1. **LightGBM**: Fast, handles categorical features natively, good for large datasets
2. **XGBoost**: Robust, excellent regularization, widely used in production
3. **Random Forest**: Baseline ensemble, interpretable feature importance

### Evaluation Metrics
- **Primary**: Precision-Recall AUC (robust to class imbalance)
- **Secondary**: F1-score, ROC-AUC, Recall at high precision
- **Business Context**: False positives cost customer friction; false negatives cost fraud losses

In [None]:
# Standard library imports
import os
import sys
import warnings
import pickle
import json
from pathlib import Path
from datetime import datetime

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    roc_auc_score, precision_recall_curve, average_precision_score,
    f1_score, precision_score, recall_score, confusion_matrix,
    classification_report, roc_curve
)
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import xgboost as xgb

# MLflow
import mlflow
import mlflow.sklearn
import mlflow.lightgbm
import mlflow.xgboost

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Define paths
BASE_PATH = Path('..').resolve()
PROCESSED_PATH = BASE_PATH / 'Data' / 'processed'
FEATURES_PATH = BASE_PATH / 'Data' / 'features'
OUTPUT_PATH = BASE_PATH / 'outputs'
MODELS_PATH = OUTPUT_PATH / 'models'
MLRUNS_PATH = BASE_PATH / 'mlruns'

# Create directories
MODELS_PATH.mkdir(parents=True, exist_ok=True)

# Add src to path
sys.path.insert(0, str(BASE_PATH / 'src'))

print(f"Base Path: {BASE_PATH}")
print(f"MLflow Tracking URI: {MLRUNS_PATH}")

In [None]:
# Configure MLflow
mlflow.set_tracking_uri(f"file:///{MLRUNS_PATH}")
experiment_name = "fraud_detection_ieee"

# Create or get experiment
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
    experiment_id = mlflow.create_experiment(
        experiment_name,
        artifact_location=str(MLRUNS_PATH / 'artifacts')
    )
else:
    experiment_id = experiment.experiment_id

mlflow.set_experiment(experiment_name)
print(f"MLflow Experiment: {experiment_name} (ID: {experiment_id})")

## 1. Load Processed Data

In [None]:
# Load processed data
print("Loading processed data...")
train_df = pd.read_parquet(PROCESSED_PATH / 'train_processed.parquet')
print(f"Training data shape: {train_df.shape}")

# Load feature artifacts
with open(FEATURES_PATH / 'feature_artifacts.pkl', 'rb') as f:
    feature_artifacts = pickle.load(f)

feature_cols = feature_artifacts['feature_cols']
print(f"Number of features: {len(feature_cols)}")

In [None]:
# Verify target distribution
print("\nTarget Distribution:")
print(train_df['isFraud'].value_counts())
print(f"\nFraud Rate: {train_df['isFraud'].mean()*100:.2f}%")

## 2. Time-Based Train/Validation Split

### Why Time-Based Split?
In production fraud detection:
1. Models are trained on historical data
2. Models predict on future, unseen transactions
3. Random splits would include future transactions in training (data leakage)
4. Time-based splits simulate real deployment and reveal concept drift issues

We use TransactionDT (time delta) to create a temporal split:
- Training: First 80% of transactions (by time)
- Validation: Last 20% of transactions (by time)

In [None]:
def time_based_split(df, time_col='TransactionDT', train_ratio=0.8):
    """
    Split data based on time to simulate production conditions.
    
    This is critical for fraud detection because:
    1. Fraud patterns evolve over time (concept drift)
    2. Production models always predict on future data
    3. Random splits give overly optimistic performance estimates
    
    Args:
        df: DataFrame with time column
        time_col: column containing temporal information
        train_ratio: proportion of data for training
    
    Returns:
        train_df, val_df DataFrames
    """
    # Sort by time
    df_sorted = df.sort_values(time_col).reset_index(drop=True)
    
    # Calculate split point
    split_idx = int(len(df_sorted) * train_ratio)
    
    train = df_sorted.iloc[:split_idx]
    val = df_sorted.iloc[split_idx:]
    
    print(f"Time-based split:")
    print(f"  Training: {len(train):,} samples ({len(train)/len(df)*100:.1f}%)")
    print(f"  Validation: {len(val):,} samples ({len(val)/len(df)*100:.1f}%)")
    print(f"  Training fraud rate: {train['isFraud'].mean()*100:.2f}%")
    print(f"  Validation fraud rate: {val['isFraud'].mean()*100:.2f}%")
    
    return train, val

# Apply time-based split
train_data, val_data = time_based_split(train_df)

In [None]:
# Prepare features and target
# Remove any features not in the feature list
available_features = [c for c in feature_cols if c in train_data.columns]
print(f"Available features: {len(available_features)}")

X_train = train_data[available_features]
y_train = train_data['isFraud']
X_val = val_data[available_features]
y_val = val_data['isFraud']

print(f"\nX_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")

## 3. Class Weight Calculation

### Why Class Weights Instead of SMOTE?

**SMOTE (Synthetic Minority Over-sampling Technique)**:
- Creates synthetic samples by interpolating between existing minority samples
- Risk: Can create unrealistic fraud patterns that don't exist in real data
- Memory intensive for large datasets

**Class Weights**:
- Adjusts the loss function to penalize minority class errors more heavily
- No synthetic data creation, preserves data integrity
- Computationally efficient
- Preferred for fraud detection where false patterns can be dangerous

We calculate balanced class weights: `weight = n_samples / (n_classes * n_class_samples)`

In [None]:
def calculate_class_weights(y):
    """
    Calculate balanced class weights for imbalanced classification.
    
    Formula: weight_i = n_samples / (n_classes * n_samples_i)
    This gives higher weight to minority class.
    
    Args:
        y: target array
    
    Returns:
        dict mapping class labels to weights
    """
    from sklearn.utils.class_weight import compute_class_weight
    
    classes = np.unique(y)
    weights = compute_class_weight('balanced', classes=classes, y=y)
    class_weights = dict(zip(classes, weights))
    
    print(f"Class weights:")
    print(f"  Class 0 (Legitimate): {class_weights[0]:.4f}")
    print(f"  Class 1 (Fraud): {class_weights[1]:.4f}")
    
    return class_weights

class_weights = calculate_class_weights(y_train)

# For LightGBM scale_pos_weight
scale_pos_weight = class_weights[1] / class_weights[0]
print(f"\nScale pos weight for LightGBM/XGBoost: {scale_pos_weight:.4f}")

## 4. Evaluation Metrics Functions

Comprehensive metrics for fraud detection evaluation.

In [None]:
def evaluate_model(y_true, y_pred_proba, y_pred=None, threshold=0.5):
    """
    Comprehensive model evaluation for fraud detection.
    
    Args:
        y_true: actual labels
        y_pred_proba: predicted probabilities
        y_pred: predicted labels (optional, computed from threshold if not provided)
        threshold: classification threshold
    
    Returns:
        dict of metrics
    """
    if y_pred is None:
        y_pred = (y_pred_proba >= threshold).astype(int)
    
    metrics = {
        'roc_auc': roc_auc_score(y_true, y_pred_proba),
        'pr_auc': average_precision_score(y_true, y_pred_proba),
        'f1': f1_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'threshold': threshold
    }
    
    # Confusion matrix values
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    metrics['true_negatives'] = int(tn)
    metrics['false_positives'] = int(fp)
    metrics['false_negatives'] = int(fn)
    metrics['true_positives'] = int(tp)
    
    # Business metrics
    # False Positive Rate: legitimate transactions flagged as fraud (customer friction)
    metrics['fpr'] = fp / (fp + tn) if (fp + tn) > 0 else 0
    # False Negative Rate: fraud transactions missed (fraud loss)
    metrics['fnr'] = fn / (fn + tp) if (fn + tp) > 0 else 0
    
    return metrics

def find_optimal_threshold(y_true, y_pred_proba, metric='f1'):
    """
    Find optimal classification threshold based on specified metric.
    
    For fraud detection:
    - Higher threshold = fewer false positives, more false negatives
    - Lower threshold = more false positives, fewer false negatives
    
    Args:
        y_true: actual labels
        y_pred_proba: predicted probabilities
        metric: optimization target ('f1', 'precision', 'recall')
    
    Returns:
        optimal_threshold, best_score
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_pred_proba)
    
    # Calculate F1 for each threshold
    f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
    
    if metric == 'f1':
        best_idx = np.argmax(f1_scores)
        return thresholds[best_idx], f1_scores[best_idx]
    elif metric == 'precision':
        # Find threshold for precision >= 0.5 with best recall
        valid_idx = precision >= 0.5
        if valid_idx.sum() > 0:
            best_idx = np.where(valid_idx)[0][np.argmax(recall[valid_idx])]
            return thresholds[min(best_idx, len(thresholds)-1)], precision[best_idx]
    
    return 0.5, 0.0

In [None]:
def plot_evaluation_curves(y_true, y_pred_proba, model_name, save_path=None):
    """
    Plot ROC and Precision-Recall curves.
    
    For imbalanced fraud detection:
    - ROC AUC can be misleading (looks good even with poor performance)
    - PR AUC is more informative for the minority class
    
    Args:
        y_true: actual labels
        y_pred_proba: predicted probabilities
        model_name: name for title/legend
        save_path: optional path to save figure
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_pred_proba)
    roc_auc = roc_auc_score(y_true, y_pred_proba)
    
    axes[0].plot(fpr, tpr, color='darkorange', lw=2, 
                 label=f'ROC curve (AUC = {roc_auc:.4f})')
    axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
    axes[0].set_xlim([0.0, 1.0])
    axes[0].set_ylim([0.0, 1.05])
    axes[0].set_xlabel('False Positive Rate', fontsize=12)
    axes[0].set_ylabel('True Positive Rate', fontsize=12)
    axes[0].set_title(f'{model_name} - ROC Curve', fontsize=14, fontweight='bold')
    axes[0].legend(loc='lower right')
    
    # Precision-Recall Curve
    precision, recall, _ = precision_recall_curve(y_true, y_pred_proba)
    pr_auc = average_precision_score(y_true, y_pred_proba)
    baseline = y_true.sum() / len(y_true)
    
    axes[1].plot(recall, precision, color='green', lw=2,
                 label=f'PR curve (AUC = {pr_auc:.4f})')
    axes[1].axhline(y=baseline, color='navy', linestyle='--', 
                    label=f'Baseline = {baseline:.4f}')
    axes[1].set_xlim([0.0, 1.0])
    axes[1].set_ylim([0.0, 1.05])
    axes[1].set_xlabel('Recall', fontsize=12)
    axes[1].set_ylabel('Precision', fontsize=12)
    axes[1].set_title(f'{model_name} - Precision-Recall Curve', fontsize=14, fontweight='bold')
    axes[1].legend(loc='upper right')
    
    plt.tight_layout()
    
    if save_path:
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
    
    plt.show()
    
    return fig

## 5. Model Training with MLflow Tracking

### MLflow Benefits for Fraud Detection:
1. **Experiment Tracking**: Compare multiple model configurations
2. **Reproducibility**: Log parameters, code versions, data versions
3. **Model Registry**: Version and stage models for production
4. **Artifact Storage**: Save models, plots, feature importance

In [None]:
def train_and_log_model(model, model_name, X_train, y_train, X_val, y_val, 
                        params, model_type='sklearn'):
    """
    Train model and log everything to MLflow.
    
    Args:
        model: model instance
        model_name: name for logging
        X_train, y_train: training data
        X_val, y_val: validation data
        params: model parameters dict
        model_type: 'sklearn', 'lightgbm', or 'xgboost'
    
    Returns:
        trained model, metrics dict, run_id
    """
    with mlflow.start_run(run_name=model_name) as run:
        run_id = run.info.run_id
        print(f"\n{'='*60}")
        print(f"Training: {model_name}")
        print(f"MLflow Run ID: {run_id}")
        print(f"{'='*60}")
        
        # Log parameters
        mlflow.log_params(params)
        mlflow.log_param('n_features', X_train.shape[1])
        mlflow.log_param('n_train_samples', X_train.shape[0])
        mlflow.log_param('n_val_samples', X_val.shape[0])
        mlflow.log_param('fraud_rate_train', y_train.mean())
        mlflow.log_param('fraud_rate_val', y_val.mean())
        
        # Train model
        start_time = datetime.now()
        
        if model_type == 'lightgbm':
            # LightGBM with early stopping
            model.fit(
                X_train, y_train,
                eval_set=[(X_val, y_val)],
                callbacks=[lgb.early_stopping(100, verbose=False)]
            )
        elif model_type == 'xgboost':
            # XGBoost with early stopping
            model.fit(
                X_train, y_train,
                eval_set=[(X_val, y_val)],
                verbose=False
            )
        else:
            model.fit(X_train, y_train)
        
        training_time = (datetime.now() - start_time).total_seconds()
        mlflow.log_metric('training_time_seconds', training_time)
        print(f"Training time: {training_time:.2f} seconds")
        
        # Predictions
        y_pred_proba = model.predict_proba(X_val)[:, 1]
        
        # Find optimal threshold
        optimal_threshold, best_f1 = find_optimal_threshold(y_val, y_pred_proba)
        print(f"Optimal threshold: {optimal_threshold:.4f}")
        
        # Evaluate with default and optimal thresholds
        metrics_default = evaluate_model(y_val, y_pred_proba, threshold=0.5)
        metrics_optimal = evaluate_model(y_val, y_pred_proba, threshold=optimal_threshold)
        
        # Log metrics
        for key, value in metrics_default.items():
            if isinstance(value, (int, float)):
                mlflow.log_metric(f'{key}_default', value)
        
        for key, value in metrics_optimal.items():
            if isinstance(value, (int, float)):
                mlflow.log_metric(f'{key}_optimal', value)
        
        # Print metrics
        print(f"\nMetrics (threshold=0.5):")
        print(f"  ROC-AUC: {metrics_default['roc_auc']:.4f}")
        print(f"  PR-AUC: {metrics_default['pr_auc']:.4f}")
        print(f"  F1: {metrics_default['f1']:.4f}")
        print(f"  Precision: {metrics_default['precision']:.4f}")
        print(f"  Recall: {metrics_default['recall']:.4f}")
        
        print(f"\nMetrics (optimal threshold={optimal_threshold:.4f}):")
        print(f"  F1: {metrics_optimal['f1']:.4f}")
        print(f"  Precision: {metrics_optimal['precision']:.4f}")
        print(f"  Recall: {metrics_optimal['recall']:.4f}")
        
        # Generate and log plots
        fig = plot_evaluation_curves(
            y_val, y_pred_proba, model_name,
            save_path=OUTPUT_PATH / 'visuals' / f'{model_name.lower().replace(" ", "_")}_curves.png'
        )
        mlflow.log_artifact(OUTPUT_PATH / 'visuals' / f'{model_name.lower().replace(" ", "_")}_curves.png')
        
        # Log model
        if model_type == 'lightgbm':
            mlflow.lightgbm.log_model(model, 'model')
        elif model_type == 'xgboost':
            mlflow.xgboost.log_model(model, 'model')
        else:
            mlflow.sklearn.log_model(model, 'model')
        
        # Add model to results
        metrics_optimal['model'] = model
        metrics_optimal['run_id'] = run_id
        metrics_optimal['model_name'] = model_name
        metrics_optimal['optimal_threshold'] = optimal_threshold
        
        return model, metrics_optimal, run_id

### 5.1 LightGBM Model

**Why LightGBM for Fraud Detection:**
- Handles large datasets efficiently (gradient-based one-side sampling)
- Native categorical feature support
- Excellent handling of sparse features
- Fast training and inference

In [None]:
# LightGBM parameters
lgb_params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'n_estimators': 1000,
    'learning_rate': 0.05,
    'max_depth': 8,
    'num_leaves': 64,
    'min_child_samples': 100,
    'scale_pos_weight': scale_pos_weight,  # Handle class imbalance
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'random_state': RANDOM_SEED,
    'n_jobs': -1,
    'verbose': -1
}

lgb_model = lgb.LGBMClassifier(**lgb_params)

lgb_model, lgb_metrics, lgb_run_id = train_and_log_model(
    lgb_model, 'LightGBM',
    X_train, y_train, X_val, y_val,
    lgb_params, model_type='lightgbm'
)

### 5.2 XGBoost Model

**Why XGBoost:**
- Strong regularization (prevents overfitting on noisy fraud data)
- Handles missing values natively
- Production-proven in many fraud detection systems
- GPU acceleration available for scaling

In [None]:
# XGBoost parameters
xgb_params = {
    'objective': 'binary:logistic',
    'n_estimators': 1000,
    'learning_rate': 0.05,
    'max_depth': 8,
    'min_child_weight': 100,
    'scale_pos_weight': scale_pos_weight,  # Handle class imbalance
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'random_state': RANDOM_SEED,
    'n_jobs': -1,
    'early_stopping_rounds': 100,
    'eval_metric': 'auc'
}

xgb_model = xgb.XGBClassifier(**xgb_params)

xgb_model, xgb_metrics, xgb_run_id = train_and_log_model(
    xgb_model, 'XGBoost',
    X_train, y_train, X_val, y_val,
    xgb_params, model_type='xgboost'
)

### 5.3 Random Forest Model

**Why Random Forest:**
- Interpretable feature importance
- Robust to outliers and noise
- Good baseline for comparison
- Handles non-linear relationships

In [None]:
# Random Forest parameters
rf_params = {
    'n_estimators': 200,
    'max_depth': 12,
    'min_samples_split': 100,
    'min_samples_leaf': 50,
    'max_features': 'sqrt',
    'class_weight': 'balanced',  # Handle class imbalance
    'random_state': RANDOM_SEED,
    'n_jobs': -1
}

rf_model = RandomForestClassifier(**rf_params)

rf_model, rf_metrics, rf_run_id = train_and_log_model(
    rf_model, 'Random Forest',
    X_train, y_train, X_val, y_val,
    rf_params, model_type='sklearn'
)

## 6. Model Comparison and Selection

In [None]:
# Compare all models
results = pd.DataFrame([
    {
        'Model': 'LightGBM',
        'ROC-AUC': lgb_metrics['roc_auc'],
        'PR-AUC': lgb_metrics['pr_auc'],
        'F1': lgb_metrics['f1'],
        'Precision': lgb_metrics['precision'],
        'Recall': lgb_metrics['recall'],
        'Threshold': lgb_metrics['optimal_threshold'],
        'Run_ID': lgb_run_id
    },
    {
        'Model': 'XGBoost',
        'ROC-AUC': xgb_metrics['roc_auc'],
        'PR-AUC': xgb_metrics['pr_auc'],
        'F1': xgb_metrics['f1'],
        'Precision': xgb_metrics['precision'],
        'Recall': xgb_metrics['recall'],
        'Threshold': xgb_metrics['optimal_threshold'],
        'Run_ID': xgb_run_id
    },
    {
        'Model': 'Random Forest',
        'ROC-AUC': rf_metrics['roc_auc'],
        'PR-AUC': rf_metrics['pr_auc'],
        'F1': rf_metrics['f1'],
        'Precision': rf_metrics['precision'],
        'Recall': rf_metrics['recall'],
        'Threshold': rf_metrics['optimal_threshold'],
        'Run_ID': rf_run_id
    }
])

print("\n" + "="*80)
print("MODEL COMPARISON RESULTS")
print("="*80)
print(results.to_string(index=False))

In [None]:
# Visualization: Model Comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of key metrics
metrics_to_plot = ['ROC-AUC', 'PR-AUC', 'F1', 'Precision', 'Recall']
x = np.arange(len(metrics_to_plot))
width = 0.25

for i, model in enumerate(['LightGBM', 'XGBoost', 'Random Forest']):
    values = results[results['Model'] == model][metrics_to_plot].values[0]
    axes[0].bar(x + i*width, values, width, label=model)

axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_xticks(x + width)
axes[0].set_xticklabels(metrics_to_plot)
axes[0].legend()
axes[0].set_ylim(0, 1)

# PR-AUC comparison (primary metric)
colors = ['#3498db', '#2ecc71', '#e74c3c']
axes[1].bar(results['Model'], results['PR-AUC'], color=colors, edgecolor='black')
axes[1].set_ylabel('Precision-Recall AUC', fontsize=12)
axes[1].set_title('Primary Metric: PR-AUC Comparison', fontsize=14, fontweight='bold')
for i, (model, pr_auc) in enumerate(zip(results['Model'], results['PR-AUC'])):
    axes[1].text(i, pr_auc + 0.01, f'{pr_auc:.4f}', ha='center', fontsize=11)

plt.tight_layout()
plt.savefig(OUTPUT_PATH / 'visuals' / 'model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Select best model based on PR-AUC (primary metric for imbalanced classification)
best_model_idx = results['PR-AUC'].idxmax()
best_model_name = results.loc[best_model_idx, 'Model']
best_model_pr_auc = results.loc[best_model_idx, 'PR-AUC']
best_run_id = results.loc[best_model_idx, 'Run_ID']

# Get the actual best model object
if best_model_name == 'LightGBM':
    best_model = lgb_model
    best_metrics = lgb_metrics
elif best_model_name == 'XGBoost':
    best_model = xgb_model
    best_metrics = xgb_metrics
else:
    best_model = rf_model
    best_metrics = rf_metrics

print(f"\nBest Model: {best_model_name}")
print(f"PR-AUC: {best_model_pr_auc:.4f}")
print(f"Run ID: {best_run_id}")

## 7. Save Best Model

Save the best model with all artifacts needed for deployment.

In [None]:
# Save best model locally
model_artifacts = {
    'model': best_model,
    'model_name': best_model_name,
    'optimal_threshold': best_metrics['optimal_threshold'],
    'metrics': best_metrics,
    'feature_cols': available_features,
    'run_id': best_run_id,
    'training_date': datetime.now().isoformat()
}

with open(MODELS_PATH / 'best_model.pkl', 'wb') as f:
    pickle.dump(model_artifacts, f)

print(f"Best model saved to: {MODELS_PATH / 'best_model.pkl'}")

In [None]:
# Save comparison results
results.to_csv(OUTPUT_PATH / 'metrics' / 'model_comparison.csv', index=False)
print(f"Results saved to: {OUTPUT_PATH / 'metrics' / 'model_comparison.csv'}")

## 8. Cross-Validation Analysis

Perform stratified cross-validation on the best model to estimate performance variance.

In [None]:
# Stratified K-Fold cross-validation
# Note: In production, we'd use time-series cross-validation
# Here we demonstrate stratified CV for completeness

print("Performing 5-fold stratified cross-validation on best model...")

# Combine train and val for CV
X_full = pd.concat([X_train, X_val], axis=0)
y_full = pd.concat([y_train, y_val], axis=0)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED)

# Use a fresh model instance for CV
if best_model_name == 'LightGBM':
    cv_model = lgb.LGBMClassifier(**lgb_params)
elif best_model_name == 'XGBoost':
    cv_model = xgb.XGBClassifier(**{k:v for k,v in xgb_params.items() if k != 'early_stopping_rounds'})
else:
    cv_model = RandomForestClassifier(**rf_params)

# Cross-validation scores
cv_scores = cross_val_score(cv_model, X_full, y_full, cv=cv, scoring='roc_auc', n_jobs=-1)

print(f"\nCross-Validation ROC-AUC Scores:")
print(f"  Scores: {cv_scores}")
print(f"  Mean: {cv_scores.mean():.4f}")
print(f"  Std: {cv_scores.std():.4f}")

## 9. Summary and Next Steps

### Key Findings

1. **Model Selection**: Selected best model based on PR-AUC (most appropriate for imbalanced fraud detection)

2. **Class Imbalance Handling**: Used class weights to adjust for ~3.5% fraud rate

3. **Time-Based Split**: Simulates production conditions and reveals temporal patterns

4. **MLflow Tracking**: All experiments logged for reproducibility and comparison

### Business Implications

- **Precision vs Recall Trade-off**: 
  - Higher threshold = fewer false positives (better customer experience)
  - Lower threshold = fewer false negatives (catch more fraud)
  - Optimal threshold depends on business costs of each error type

### Next Steps

1. Proceed to model interpretation notebook for explainability
2. Deploy model using MLflow model registry
3. Implement monitoring for concept drift

In [None]:
print("\n" + "="*60)
print("MODELING COMPLETE")
print("="*60)
print(f"\nBest Model: {best_model_name}")
print(f"PR-AUC: {best_model_pr_auc:.4f}")
print(f"Optimal Threshold: {best_metrics['optimal_threshold']:.4f}")
print(f"\nModel artifacts saved to: {MODELS_PATH}")
print(f"MLflow experiments at: {MLRUNS_PATH}")
print(f"\nTo view MLflow UI, run: mlflow ui --backend-store-uri {MLRUNS_PATH}")
print("\nNext steps: Proceed to 04_model_interpretation.ipynb")