# Fraud Detection ML Pipeline - Complete Workflow

## Overview
This notebook demonstrates the complete end-to-end machine learning pipeline for fraud detection, consolidating all modules and workflows from the project.

### Project Structure:
```
1.Fraud_Detection/
‚îú‚îÄ‚îÄ config/           # Configuration parameters
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data/         # Data loading and splitting
‚îÇ   ‚îú‚îÄ‚îÄ preprocessing/ # Feature engineering and preprocessing
‚îÇ   ‚îú‚îÄ‚îÄ models/       # Model training
‚îÇ   ‚îú‚îÄ‚îÄ evaluation/   # Model evaluation
‚îÇ   ‚îî‚îÄ‚îÄ utils/        # Utilities
‚îú‚îÄ‚îÄ scripts/          # Training and prediction scripts
‚îú‚îÄ‚îÄ api/             # Flask API for deployment
‚îú‚îÄ‚îÄ models/          # Saved model artifacts
‚îú‚îÄ‚îÄ reports/         # Evaluation reports and plots
‚îî‚îÄ‚îÄ notebooks/       # Analysis notebooks
```

### Workflow Sections:
1. **Configuration Setup** - Project paths and hyperparameters
2. **Data Ingestion** - Load data from SQLite database
3. **Exploratory Data Analysis (EDA)** - Understand the data
4. **Data Splitting** - Train/eval/test splits
5. **Feature Engineering** - Create derived features
6. **Data Preprocessing** - Encoding, scaling, imputation
7. **Model Training** - XGBoost with class imbalance handling
8. **Model Evaluation** - Comprehensive metrics and visualizations
9. **Predictions** - Batch and real-time predictions
10. **Model Persistence** - Save and load artifacts
11. **API Deployment** - Flask API structure

## Section 1: Configuration Setup
**Source File**: `config/config.py`

Setting up project paths, database connections, and hyperparameters.

In [None]:
# ============================================================================
# CONFIGURATION SETUP
# Source: config/config.py
# ============================================================================

import sys
from pathlib import Path
import pandas as pd
import numpy as np
import sqlite3
import joblib
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    precision_score, recall_score, f1_score, accuracy_score,
    precision_recall_curve, roc_curve, average_precision_score
)
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Models
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Class imbalance
from imblearn.over_sampling import SMOTE

# Model interpretation
try:
    import shap
    SHAP_AVAILABLE = True
except ImportError:
    SHAP_AVAILABLE = False
    print("SHAP not available")

# Hyperparameter tuning
try:
    import optuna
    OPTUNA_AVAILABLE = True
except ImportError:
    OPTUNA_AVAILABLE = False
    print("Optuna not available")

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Project Configuration
PROJECT_ROOT = Path.cwd()
DATABASE_PATH = PROJECT_ROOT.parent.parent.parent / "Database.db"

# Data split configuration
TRAIN_SIZE = 4_000_000  # First 4M records for training
EVAL_SIZE = 1_000_000   # Next 1M records for evaluation

# Directory paths
DIRECTORIES = {
    "models": PROJECT_ROOT / "models",
    "logs": PROJECT_ROOT / "logs",
    "reports": PROJECT_ROOT / "reports",
    "notebooks": PROJECT_ROOT / "notebooks",
    "data": PROJECT_ROOT / "data",
}

# Model file paths
MODEL_PATHS = {
    "preprocessor": DIRECTORIES["models"] / "preprocessor.pkl",
    "model": DIRECTORIES["models"] / "model.pkl",
    "feature_names": DIRECTORIES["models"] / "feature_names.pkl",
}

# Column type mappings
COLUMN_TYPES = {
    "step": "int64",
    "type": "category",
    "amount": "float64",
    "nameOrig": "string",
    "oldbalanceOrg": "float64",
    "newbalanceOrig": "float64",
    "nameDest": "string",
    "oldbalanceDest": "float64",
    "newbalanceDest": "float64",
    "isFraud": "int64",
    "isFlaggedFraud": "int64",
}

# Database table name
DB_TABLE_NAME = "Fraud_detection"

# Model hyperparameters
MODEL_CONFIG = {
    "primary_model": "xgboost",
    "use_smote": True,
    "random_state": 42,
    "test_size": 0.2,
    "cv_folds": 5,
    "scoring_metric": "roc_auc",
}

# XGBoost hyperparameters
XGBOOST_PARAMS = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 6,
    "learning_rate": 0.1,
    "n_estimators": 100,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "min_child_weight": 3,
    "scale_pos_weight": 100,  # For class imbalance
    "random_state": 42,
}

# Feature engineering configuration
FEATURE_CONFIG = {
    "use_account_frequency": False,
    "use_time_features": True,
    "use_balance_features": True,
    "use_transaction_features": True,
}

# SMOTE configuration
SMOTE_CONFIG = {
    "k_neighbors": 5,
    "random_state": 42,
}

# Evaluation configuration
EVALUATION_CONFIG = {
    "optimize_threshold": True,
    "target_recall": 0.90,  # Target recall for fraud detection
    "precision_weight": 0.3,
    "recall_weight": 0.7,
}

# Ensure directories exist
for dir_path in DIRECTORIES.values():
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"Configuration loaded!")
print(f"Project Root: {PROJECT_ROOT}")
print(f"Database Path: {DATABASE_PATH}")
print(f"Database exists: {DATABASE_PATH.exists()}")

## Section 2: Data Ingestion
**Source File**: `src/data/data_loader.py`

Functions to load data from SQLite database with chunking for large datasets.

In [None]:
# ============================================================================
# DATA LOADING FUNCTIONS
# Source: src/data/data_loader.py
# ============================================================================

def load_data_from_db(
    db_path=None,
    table_name=None,
    chunk_size=100000,
    max_rows=None
):
    """
    Load data from SQLite database with chunking for large datasets.
    """
    if db_path is None:
        db_path = DATABASE_PATH
    if table_name is None:
        table_name = DB_TABLE_NAME
    
    print(f"Loading data from {db_path} table {table_name}")
    
    if not db_path.exists():
        raise FileNotFoundError(f"Database file not found: {db_path}")
    
    try:
        conn = sqlite3.connect(str(db_path))
        
        # Get total row count
        cursor = conn.cursor()
        cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
        total_rows = cursor.fetchone()[0]
        print(f"Total rows in table: {total_rows:,}")
        
        # Determine how many rows to load
        rows_to_load = min(total_rows, max_rows) if max_rows else total_rows
        
        # Load data in chunks
        chunks = []
        offset = 0
        
        while offset < rows_to_load:
            current_chunk_size = min(chunk_size, rows_to_load - offset)
            query = f"SELECT * FROM {table_name} LIMIT {current_chunk_size} OFFSET {offset}"
            
            chunk = pd.read_sql_query(query, conn)
            chunks.append(chunk)
            
            offset += current_chunk_size
            if offset % 500000 == 0:
                print(f"Loaded {offset:,}/{rows_to_load:,} rows")
        
        conn.close()
        
        # Concatenate all chunks
        df = pd.concat(chunks, ignore_index=True)
        print(f"Successfully loaded {len(df):,} rows")
        
        return df
        
    except Exception as e:
        print(f"Error loading data: {str(e)}")
        raise


def convert_column_types(df, column_types=None):
    """Convert column types according to schema."""
    if column_types is None:
        column_types = COLUMN_TYPES
    
    print("Converting column types")
    df_converted = df.copy()
    
    for column, dtype in column_types.items():
        if column in df_converted.columns:
            try:
                if dtype == "float64":
                    df_converted[column] = pd.to_numeric(df_converted[column], errors="coerce")
                elif dtype == "int64":
                    df_converted[column] = pd.to_numeric(df_converted[column], errors="coerce").astype("Int64")
                elif dtype == "category":
                    df_converted[column] = df_converted[column].astype("category")
                elif dtype == "string":
                    df_converted[column] = df_converted[column].astype("string")
                else:
                    df_converted[column] = df_converted[column].astype(dtype)
                
            except Exception as e:
                print(f"Warning: Failed to convert {column} to {dtype}: {str(e)}")
    
    return df_converted


def validate_data(df):
    """Perform basic data validation checks."""
    print("Validating data")
    
    # Check required columns
    required_columns = ["step", "type", "amount", "nameOrig", "oldbalanceOrg", 
                        "newbalanceOrig", "nameDest", "oldbalanceDest", 
                        "newbalanceDest", "isFraud", "isFlaggedFraud"]
    
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    # Check for empty dataframe
    if df.empty:
        raise ValueError("DataFrame is empty")
    
    # Log basic statistics
    print(f"Data shape: {df.shape}")
    print(f"Missing values per column:\n{df.isnull().sum()}")
    print(f"Fraud class distribution:\n{df['isFraud'].value_counts()}")
    
    return df


def load_and_prepare_data(
    db_path=None,
    table_name=None,
    chunk_size=100000,
    max_rows=None,
    convert_types=True,
    validate=True
):
    """Complete data loading pipeline: load, convert types, and validate."""
    print("Starting data loading pipeline")
    
    # Load data
    df = load_data_from_db(db_path, table_name, chunk_size, max_rows)
    
    # Convert types
    if convert_types:
        df = convert_column_types(df)
    
    # Validate
    if validate:
        df = validate_data(df)
    
    print("Data loading pipeline completed successfully")
    return df

# Load sample data for demonstration (use max_rows to limit for faster execution)
print("Loading data sample...")
df = load_and_prepare_data(max_rows=50000)  # Limit for notebook demo
print(f"\nData loaded successfully!")
print(f"Shape: {df.shape}")
df.head()

## Section 3: Exploratory Data Analysis (EDA)

Understanding the data distribution, patterns, and fraud characteristics.

In [None]:
# ============================================================================
# EXPLORATORY DATA ANALYSIS
# ============================================================================

def perform_eda(df):
    """Perform comprehensive exploratory data analysis."""
    print("=" * 60)
    print("EXPLORATORY DATA ANALYSIS")
    print("=" * 60)
    
    # Basic information
    print("\n1. Basic Data Information:")
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    print(f"Data types:\n{df.dtypes}")
    
    # Missing values
    print("\n2. Missing Values:")
    missing = df.isnull().sum()
    print(missing[missing > 0] if missing.any() else "No missing values")
    
    # Class distribution
    print("\n3. Fraud Class Distribution:")
    fraud_counts = df['isFraud'].value_counts()
    fraud_pct = df['isFraud'].value_counts(normalize=True) * 100
    print(f"Counts:\n{fraud_counts}")
    print(f"Percentages:\n{fraud_pct.round(2)}")
    print(f"Imbalance ratio: {fraud_counts.min() / fraud_counts.max():.6f}")
    
    # Transaction types
    print("\n4. Transaction Types:")
    type_counts = df['type'].value_counts()
    print(type_counts)
    
    # Fraud by transaction type
    print("\n5. Fraud by Transaction Type:")
    fraud_by_type = df.groupby('type')['isFraud'].agg(['count', 'sum', 'mean'])
    fraud_by_type.columns = ['Total', 'Fraud', 'Fraud_Rate']
    fraud_by_type['Fraud_Rate'] = (fraud_by_type['Fraud_Rate'] * 100).round(2)
    print(fraud_by_type)
    
    # Amount statistics
    print("\n6. Amount Statistics:")
    print(df['amount'].describe())
    
    # Amount by fraud status
    print("\n7. Amount by Fraud Status:")
    amount_stats = df.groupby('isFraud')['amount'].agg(['mean', 'median', 'std', 'max'])
    print(amount_stats)
    
    return fraud_counts, fraud_by_type, amount_stats


def create_eda_visualizations(df):
    """Create EDA visualizations."""
    reports_dir = DIRECTORIES["reports"]
    reports_dir.mkdir(parents=True, exist_ok=True)
    
    # Set up the figure
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Fraud Detection - Exploratory Data Analysis', fontsize=16, y=1.02)
    
    # 1. Class distribution
    ax1 = axes[0, 0]
    sns.countplot(data=df, x='isFraud', ax=ax1)
    ax1.set_title('Fraud Class Distribution')
    ax1.set_xlabel('Is Fraud')
    ax1.set_ylabel('Count')
    
    # Add percentage labels
    total = len(df)
    for p in ax1.patches:
        percentage = f'{100 * p.get_height() / total:.2f}%'
        ax1.annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='bottom')
    
    # 2. Transaction types
    ax2 = axes[0, 1]
    type_counts = df['type'].value_counts()
    ax2.pie(type_counts.values, labels=type_counts.index, autopct='%1.1f%%')
    ax2.set_title('Transaction Types Distribution')
    
    # 3. Amount distribution (log scale)
    ax3 = axes[0, 2]
    sns.histplot(data=df, x='amount', log_scale=True, bins=50, ax=ax3)
    ax3.set_title('Transaction Amount Distribution (Log Scale)')
    ax3.set_xlabel('Amount (log scale)')
    ax3.set_ylabel('Count')
    
    # 4. Amount by fraud status
    ax4 = axes[1, 0]
    sns.boxplot(data=df, x='isFraud', y='amount', ax=ax4)
    ax4.set_title('Transaction Amount by Fraud Status')
    ax4.set_xlabel('Is Fraud')
    ax4.set_ylabel('Amount')
    ax4.set_yscale('log')
    
    # 5. Fraud rate by transaction type
    ax5 = axes[1, 1]
    fraud_rate_by_type = df.groupby('type')['isFraud'].mean() * 100
    fraud_rate_by_type.sort_values(ascending=False).plot(kind='bar', ax=ax5)
    ax5.set_title('Fraud Rate by Transaction Type')
    ax5.set_xlabel('Transaction Type')
    ax5.set_ylabel('Fraud Rate (%)')
    ax5.tick_params(axis='x', rotation=45)
    
    # 6. Time series (step) analysis
    ax6 = axes[1, 2]
    fraud_by_step = df.groupby('step')['isFraud'].mean().rolling(window=24).mean()
    ax6.plot(fraud_by_step.index, fraud_by_step.values)
    ax6.set_title('Fraud Rate Over Time (24-hour rolling average)')
    ax6.set_xlabel('Step (Hour)')
    ax6.set_ylabel('Fraud Rate')
    
    plt.tight_layout()
    plt.savefig(reports_dir / "eda_analysis.png", dpi=300, bbox_inches='tight')
    plt.show()
    
    # Additional detailed analysis
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Balance analysis
    ax1 = axes[0]
    df_sample = df.sample(min(10000, len(df)), random_state=42)
    sns.scatterplot(data=df_sample, x='oldbalanceOrg', y='newbalanceOrig', 
                    hue='isFraud', alpha=0.6, ax=ax1)
    ax1.set_title('Origin Account Balance Changes')
    ax1.set_xlabel('Old Balance Origin')
    ax1.set_ylabel('New Balance Origin')
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    
    # Hour of day analysis
    ax2 = axes[1]
    if 'step' in df.columns:
        df['hour_of_day'] = df['step'] % 24
        fraud_by_hour = df.groupby('hour_of_day')['isFraud'].mean() * 100
        fraud_by_hour.plot(kind='bar', ax=ax2)
        ax2.set_title('Fraud Rate by Hour of Day')
        ax2.set_xlabel('Hour of Day')
        ax2.set_ylabel('Fraud Rate (%)')
    
    plt.tight_layout()
    plt.savefig(reports_dir / "eda_detailed.png", dpi=300, bbox_inches='tight')
    plt.show()


# Perform EDA
fraud_counts, fraud_by_type, amount_stats = perform_eda(df)

# Create visualizations
create_eda_visualizations(df)

## Section 4: Data Splitting
**Source File**: `src/data/data_splitter.py`

Split dataset into train, evaluation, and test sets with proper stratification.

In [None]:
# ============================================================================
# DATA SPLITTING FUNCTIONS
# Source: src/data/data_splitter.py
# ============================================================================

def split_data(
    df,
    train_size=TRAIN_SIZE,
    eval_size=EVAL_SIZE,
    random_state=42,
    preserve_order=False
):
    """
    Split dataset into train, evaluation, and test sets.
    """
    print(f"Splitting data: train={train_size:,}, eval={eval_size:,}")
    
    total_rows = len(df)
    print(f"Total rows: {total_rows:,}")
    
    if preserve_order and 'step' in df.columns:
        # Sort by step to preserve temporal order
        df_sorted = df.sort_values('step').reset_index(drop=True)
        print("Preserving temporal order based on 'step' column")
    else:
        df_sorted = df.copy()
        if not preserve_order:
            # Shuffle for random split
            df_sorted = df_sorted.sample(frac=1, random_state=random_state).reset_index(drop=True)
            print("Randomly shuffling data")
    
    # Calculate split indices
    train_end = min(train_size, total_rows)
    eval_end = min(train_size + eval_size, total_rows)
    
    # Split data
    train_df = df_sorted.iloc[:train_end].copy()
    eval_df = df_sorted.iloc[train_end:eval_end].copy()
    test_df = df_sorted.iloc[eval_end:].copy()
    
    print(f"Train set: {len(train_df):,} rows")
    print(f"Eval set: {len(eval_df):,} rows")
    print(f"Test set: {len(test_df):,} rows")
    
    # Log class distribution for each split
    if 'isFraud' in train_df.columns:
        print(f"\nTrain fraud distribution:\n{train_df['isFraud'].value_counts()}")
        print(f"Eval fraud distribution:\n{eval_df['isFraud'].value_counts()}")
        print(f"Test fraud distribution:\n{test_df['isFraud'].value_counts()}")
    
    return train_df, eval_df, test_df


def split_features_target(df, target_column="isFraud"):
    """
    Split DataFrame into features and target.
    """
    if target_column not in df.columns:
        raise ValueError(f"Target column '{target_column}' not found in DataFrame")
    
    # Drop target and other non-feature columns
    columns_to_drop = [
        target_column,
        'isFlaggedFraud',  # Business flag, not a feature
    ]
    
    # Keep only columns that exist
    columns_to_drop = [col for col in columns_to_drop if col in df.columns]
    
    X = df.drop(columns=columns_to_drop).copy()
    y = df[target_column].copy()
    
    print(f"Features shape: {X.shape}, Target shape: {y.shape}")
    print(f"Feature columns: {list(X.columns)}")
    
    return X, y


def get_stratified_split_info(df, target_column="isFraud"):
    """
    Get information about class distribution for stratified splitting.
    """
    if target_column not in df.columns:
        raise ValueError(f"Target column '{target_column}' not found")
    
    class_counts = df[target_column].value_counts().sort_index()
    class_proportions = df[target_column].value_counts(normalize=True).sort_index()
    
    info = {
        "class_counts": class_counts.to_dict(),
        "class_proportions": class_proportions.to_dict(),
        "total_samples": len(df),
        "n_classes": len(class_counts),
        "imbalance_ratio": class_counts.min() / class_counts.max() if len(class_counts) > 1 else 1.0
    }
    
    print(f"Class distribution: {info['class_counts']}")
    print(f"Imbalance ratio: {info['imbalance_ratio']:.6f}")
    
    return info


# Split the loaded data
print("=" * 60)
print("DATA SPLITTING")
print("=" * 60)

# Get class distribution info
split_info = get_stratified_split_info(df)

# Split data (adjust sizes for demo)
train_df, eval_df, test_df = split_data(
    df,
    train_size=int(len(df) * 0.6),  # Adjusted for demo
    eval_size=int(len(df) * 0.2),
    preserve_order=True
)

# Split features and target
X_train, y_train = split_features_target(train_df)
X_eval, y_eval = split_features_target(eval_df)
X_test, y_test = split_features_target(test_df)

print(f"\nFinal splits:")
print(f"Training set: X={X_train.shape}, y={y_train.shape}")
print(f"Evaluation set: X={X_eval.shape}, y={y_eval.shape}")
print(f"Test set: X={X_test.shape}, y={y_test.shape}")

## Section 5: Feature Engineering
**Source File**: `src/preprocessing/feature_engineering.py`

Create derived features from raw transaction data including balance, transaction, time, and account features.

In [None]:
# ============================================================================
# FEATURE ENGINEERING FUNCTIONS
# Source: src/preprocessing/feature_engineering.py
# ============================================================================

def create_balance_features(df):
    """Create balance-related features."""
    df_features = df.copy()
    
    # Balance differences
    df_features['balance_diff_orig'] = (
        df_features['oldbalanceOrg'] - df_features['newbalanceOrig']
    )
    df_features['balance_diff_dest'] = (
        df_features['newbalanceDest'] - df_features['oldbalanceDest']
    )
    
    # Zero balance flags
    df_features['balance_orig_zero'] = (df_features['oldbalanceOrg'] == 0).astype(int)
    df_features['balance_dest_zero'] = (df_features['oldbalanceDest'] == 0).astype(int)
    
    # Zero balance after transaction
    df_features['zero_balance_after_transaction'] = (
        df_features['newbalanceOrig'] == 0
    ).astype(int)
    
    # Balance ratios
    df_features['balance_orig_ratio'] = np.where(
        df_features['oldbalanceOrg'] > 0,
        df_features['amount'] / df_features['oldbalanceOrg'],
        0
    )
    
    df_features['balance_dest_ratio'] = np.where(
        df_features['oldbalanceDest'] > 0,
        df_features['amount'] / (df_features['oldbalanceDest'] + 1),
        0
    )
    
    return df_features


def create_transaction_features(df):
    """Create transaction-related features."""
    df_features = df.copy()
    
    # Log amount (handle zero and negative)
    df_features['amount_log'] = np.log1p(df_features['amount'].clip(lower=0))
    
    # Amount per original balance
    df_features['amount_per_balance_orig'] = (
        df_features['amount'] / (df_features['oldbalanceOrg'] + 1)
    )
    
    # Check if transaction empties origin account
    df_features['empties_origin'] = (
        (df_features['oldbalanceOrg'] > 0) & 
        (df_features['newbalanceOrig'] == 0)
    ).astype(int)
    
    # Check if transaction creates new destination balance
    df_features['creates_dest_balance'] = (
        (df_features['oldbalanceDest'] == 0) & 
        (df_features['newbalanceDest'] > 0)
    ).astype(int)
    
    # Amount categories (buckets)
    df_features['amount_category'] = pd.cut(
        df_features['amount'],
        bins=[0, 100, 1000, 10000, 100000, float('inf')],
        labels=['very_small', 'small', 'medium', 'large', 'very_large'],
        include_lowest=True
    )
    
    return df_features


def create_time_features(df):
    """Create time-related features from step column."""
    df_features = df.copy()
    
    if 'step' not in df_features.columns:
        print("Warning: 'step' column not found, skipping time features")
        return df_features
    
    # Hour of day (step represents hours)
    df_features['hour_of_day'] = df_features['step'] % 24
    
    # Day of week (assuming step 0 is start of week)
    df_features['day_of_week'] = (df_features['step'] // 24) % 7
    
    # Is weekend
    df_features['is_weekend'] = (
        (df_features['day_of_week'] == 5) | (df_features['day_of_week'] == 6)
    ).astype(int)
    
    # Is business hours (9-17)
    df_features['is_business_hours'] = (
        (df_features['hour_of_day'] >= 9) & (df_features['hour_of_day'] < 17)
    ).astype(int)
    
    # Is night (22-6)
    df_features['is_night'] = (
        (df_features['hour_of_day'] >= 22) | (df_features['hour_of_day'] < 6)
    ).astype(int)
    
    return df_features


def create_account_features(df, use_frequency=False):
    """Create account-related features."""
    df_features = df.copy()
    
    # Same account transfer flag
    df_features['same_account_transfer'] = (
        df_features['nameOrig'] == df_features['nameDest']
    ).astype(int)
    
    # Account name prefixes (C = customer, M = merchant)
    df_features['orig_is_customer'] = (
        df_features['nameOrig'].str.startswith('C', na=False)
    ).astype(int)
    df_features['dest_is_customer'] = (
        df_features['nameDest'].str.startswith('C', na=False)
    ).astype(int)
    
    if use_frequency:
        print("Computing account frequency features (this may take time)")
        # Frequency of origin account
        orig_counts = df_features['nameOrig'].value_counts()
        df_features['orig_account_frequency'] = df_features['nameOrig'].map(orig_counts)
        
        # Frequency of destination account
        dest_counts = df_features['nameDest'].value_counts()
        df_features['dest_account_frequency'] = df_features['nameDest'].map(dest_counts)
    
    return df_features


def create_all_features(df, feature_config=None):
    """Create all engineered features."""
    if feature_config is None:
        feature_config = FEATURE_CONFIG
    
    print("Starting feature engineering pipeline")
    df_features = df.copy()
    
    # Balance features
    if feature_config.get("use_balance_features", True):
        df_features = create_balance_features(df_features)
    
    # Transaction features
    if feature_config.get("use_transaction_features", True):
        df_features = create_transaction_features(df_features)
    
    # Time features
    if feature_config.get("use_time_features", True):
        df_features = create_time_features(df_features)
    
    # Account features
    df_features = create_account_features(
        df_features,
        use_frequency=feature_config.get("use_account_frequency", False)
    )
    
    print(f"Feature engineering complete. Shape: {df_features.shape}")
    new_features = [col for col in df_features.columns if col not in df.columns]
    print(f"New feature columns: {len(new_features)} features created")
    
    return df_features


def analyze_feature_importance_raw(X_original, X_featured, y):
    """Analyze the importance of engineered features."""
    print("\nFeature Engineering Analysis:")
    print(f"Original features: {len(X_original.columns)}")
    print(f"After engineering: {len(X_featured.columns)}")
    
    new_features = [col for col in X_featured.columns if col not in X_original.columns]
    print(f"New engineered features: {len(new_features)}")
    print(f"Engineered features: {new_features}")
    
    # Quick correlation analysis
    if len(new_features) > 0:
        engineered_corr = X_featured[new_features].corrwith(y).abs().sort_values(ascending=False)
        print(f"\nTop engineered features by correlation with target:")
        print(engineered_corr.head(10))


# Apply feature engineering to training data
print("=" * 60)
print("FEATURE ENGINEERING")
print("=" * 60)
X_train_features = create_all_features(X_train)

# Analyze feature importance
analyze_feature_importance_raw(X_train, X_train_features, y_train)

print(f"\nOriginal columns: {len(X_train.columns)}")
print(f"Features after engineering: {len(X_train_features.columns)}")

## Section 6: Data Preprocessing
**Source File**: `src/preprocessing/preprocessor.py`

Sklearn-compatible preprocessing pipeline that handles feature engineering, encoding, scaling, and imputation.

In [None]:
# ============================================================================
# PREPROCESSING PIPELINE CLASS
# Source: src/preprocessing/preprocessor.py
# ============================================================================

from sklearn.base import BaseEstimator, TransformerMixin

class FraudDetectionPreprocessor(BaseEstimator, TransformerMixin):
    """
    Sklearn-compatible preprocessor for fraud detection.
    Handles feature engineering, encoding, scaling, and imputation.
    """
    
    def __init__(
        self,
        feature_config=None,
        categorical_columns=None,
        numerical_columns=None,
        use_one_hot=True,
        use_scaling=True
    ):
        """Initialize preprocessor."""
        self.feature_config = feature_config or FEATURE_CONFIG
        self.categorical_columns = categorical_columns or []
        self.numerical_columns = numerical_columns or []
        self.use_one_hot = use_one_hot
        self.use_scaling = use_scaling
        
        # Transformers
        self.label_encoder = LabelEncoder()
        self.one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop='first')
        self.scaler = StandardScaler()
        self.imputer = SimpleImputer(strategy='median')
        
        # Feature names after transformation
        self.feature_names_ = None
        self.is_fitted_ = False
        
    def _identify_columns(self, X):
        """Identify categorical and numerical columns if not provided."""
        if not self.categorical_columns and not self.numerical_columns:
            for col in X.columns:
                if X[col].dtype == 'object' or X[col].dtype.name == 'category':
                    if col not in self.categorical_columns:
                        self.categorical_columns.append(col)
                elif X[col].dtype in ['int64', 'int32', 'float64', 'float32']:
                    if col not in self.numerical_columns:
                        self.numerical_columns.append(col)
    
    def _get_feature_names(self, X_features):
        """Get feature names after transformation."""
        feature_names = []
        
        # Numerical features
        for col in self.numerical_columns:
            if col in X_features.columns:
                feature_names.append(col)
        
        # Categorical features
        if self.use_one_hot and self.categorical_columns:
            # Generate one-hot feature names
            for col in self.categorical_columns:
                if col in X_features.columns:
                    unique_values = X_features[col].unique()
                    for value in unique_values[1:]:  # Skip first (dropped)
                        feature_names.append(f"{col}_{value}")
        else:
            # Keep original categorical names
            for col in self.categorical_columns:
                if col in X_features.columns:
                    feature_names.append(col)
        
        return feature_names
    
    def fit(self, X, y=None):
        """Fit the preprocessor on training data."""
        print("Fitting preprocessor")
        
        # Create features
        X_features = create_all_features(X, self.feature_config)
        
        # Identify columns if not already done
        self._identify_columns(X_features)
        
        # Handle missing values
        X_imputed = X_features.copy()
        if self.numerical_columns:
            numerical_data = X_imputed[self.numerical_columns].select_dtypes(include=[np.number])
            self.imputer.fit(numerical_data)
        
        # Fit encoders
        if self.categorical_columns:
            categorical_data = X_imputed[self.categorical_columns]
            
            if self.use_one_hot:
                self.one_hot_encoder.fit(categorical_data)
            else:
                if len(self.categorical_columns) == 1:
                    self.label_encoder.fit(categorical_data[self.categorical_columns[0]])
        
        # Fit scaler
        if self.use_scaling and self.numerical_columns:
            numerical_data = X_imputed[self.numerical_columns].select_dtypes(include=[np.number])
            if len(numerical_data.columns) > 0:
                self.scaler.fit(self.imputer.transform(numerical_data))
        
        # Store feature names
        self.feature_names_ = self._get_feature_names(X_features)
        self.is_fitted_ = True
        
        print(f"Preprocessor fitted. Output features: {len(self.feature_names_)}")
        return self
    
    def transform(self, X):
        """Transform data using fitted preprocessor."""
        if not self.is_fitted_:
            raise ValueError("Preprocessor must be fitted before transform")
        
        # Create features
        X_features = create_all_features(X, self.feature_config)
        
        # Handle missing values
        X_processed = X_features.copy()
        
        # Impute numerical columns
        if self.numerical_columns:
            numerical_data = X_processed[self.numerical_columns].select_dtypes(include=[np.number])
            if len(numerical_data.columns) > 0:
                numerical_imputed = self.imputer.transform(numerical_data)
                
                # Scale if enabled
                if self.use_scaling:
                    numerical_scaled = self.scaler.transform(numerical_imputed)
                else:
                    numerical_scaled = numerical_imputed
                
                # Update DataFrame
                for i, col in enumerate(numerical_data.columns):
                    X_processed[col] = numerical_scaled[:, i]
        
        # Encode categorical columns
        if self.categorical_columns:
            categorical_data = X_processed[self.categorical_columns]
            
            if self.use_one_hot:
                categorical_encoded = self.one_hot_encoder.transform(categorical_data)
                categorical_encoded_df = pd.DataFrame(
                    categorical_encoded,
                    columns=self.one_hot_encoder.get_feature_names_out(self.categorical_columns),
                    index=X_processed.index
                )
                X_processed = X_processed.drop(columns=self.categorical_columns)
                X_processed = pd.concat([X_processed, categorical_encoded_df], axis=1)
            else:
                if len(self.categorical_columns) == 1:
                    X_processed[self.categorical_columns[0]] = self.label_encoder.transform(
                        categorical_data[self.categorical_columns[0]]
                    )
        
        # Select features in correct order
        if self.feature_names_ is not None:
            available_features = [f for f in self.feature_names_ if f in X_processed.columns]
            X_processed = X_processed[available_features]
        
        return X_processed
    
    def fit_transform(self, X, y=None):
        """Fit and transform in one step."""
        return self.fit(X, y).transform(X)
    
    def save(self, filepath):
        """Save preprocessor to file."""
        print(f"Saving preprocessor to {filepath}")
        filepath.parent.mkdir(parents=True, exist_ok=True)
        joblib.dump(self, filepath)
    
    @classmethod
    def load(cls, filepath):
        """Load preprocessor from file."""
        print(f"Loading preprocessor from {filepath}")
        return joblib.load(filepath)


# Fit and transform the training data
print("=" * 60)
print("DATA PREPROCESSING")
print("=" * 60)
preprocessor = FraudDetectionPreprocessor()
X_train_transformed = preprocessor.fit_transform(X_train)
X_eval_transformed = preprocessor.transform(X_eval)

print(f"\nOriginal shape: {X_train.shape}")
print(f"Transformed shape: {X_train_transformed.shape}")
print(f"Number of features: {len(X_train_transformed.columns)}")
print(f"Feature names: {list(X_train_transformed.columns)}")

## Section 7: Model Training
**Source File**: `src/models/model_trainer.py`

Train XGBoost model with class imbalance handling using SMOTE.

In [None]:
# ============================================================================
# MODEL TRAINING CLASS
# Source: src/models/model_trainer.py
# ============================================================================

class FraudDetectionModelTrainer:
    """Model trainer for fraud detection with class imbalance handling."""
    
    def __init__(
        self,
        model_type="xgboost",
        use_smote=True,
        smote_config=None,
        random_state=42
    ):
        """Initialize model trainer."""
        self.model_type = model_type.lower()
        self.use_smote = use_smote
        self.smote_config = smote_config or SMOTE_CONFIG
        self.random_state = random_state
        
        self.model = None
        self.smote = None
        self.best_params_ = None
        
        print(f"Initialized trainer with model_type={model_type}, use_smote={use_smote}")
    
    def _create_model(self, params=None):
        """Create model instance based on model_type."""
        if params is None:
            params = {}
        
        if self.model_type == "xgboost":
            default_params = XGBOOST_PARAMS.copy()
            default_params.update(params)
            default_params['random_state'] = self.random_state
            return XGBClassifier(**default_params)
        
        elif self.model_type == "lightgbm":
            default_params = {
                'objective': 'binary',
                'metric': 'auc',
                'boosting_type': 'gbdt',
                'num_leaves': 31,
                'learning_rate': 0.1,
                'n_estimators': 100,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'min_child_samples': 20,
                'scale_pos_weight': 100,
                'random_state': self.random_state,
            }
            default_params.update(params)
            return LGBMClassifier(**default_params)
        
        elif self.model_type == "random_forest":
            rf_params = {
                'n_estimators': 100,
                'max_depth': 20,
                'min_samples_split': 5,
                'min_samples_leaf': 2,
                'class_weight': 'balanced',
                'random_state': self.random_state,
                **params
            }
            return RandomForestClassifier(**rf_params)
        
        elif self.model_type == "logistic":
            lr_params = {
                'max_iter': 1000,
                'class_weight': 'balanced',
                'random_state': self.random_state,
                **params
            }
            return LogisticRegression(**lr_params)
        
        else:
            raise ValueError(f"Unknown model_type: {self.model_type}")
    
    def _create_smote(self):
        """Create SMOTE instance."""
        k_neighbors = self.smote_config.get('k_neighbors', 5)
        random_state = self.smote_config.get('random_state', self.random_state)
        return SMOTE(
            k_neighbors=k_neighbors,
            random_state=random_state,
            n_jobs=-1
        )
    
    def fit(self, X, y, tune_hyperparameters=False, n_trials=20):
        """Train the model."""
        print(f"Training {self.model_type} model")
        print(f"Training data shape: {X.shape}, Target distribution: {y.value_counts().to_dict()}")
        
        # Handle class imbalance with SMOTE
        if self.use_smote:
            print("Applying SMOTE for class imbalance")
            self.smote = self._create_smote()
            
            min_class_count = y.value_counts().min()
            k_neighbors = self.smote_config.get('k_neighbors', 5)
            
            if min_class_count <= k_neighbors:
                print(f"Not enough samples for SMOTE. Using class_weight instead.")
                self.use_smote = False
            else:
                try:
                    X_resampled, y_resampled = self.smote.fit_resample(X, y)
                    print(f"After SMOTE: {X_resampled.shape}, Target distribution: {pd.Series(y_resampled).value_counts().to_dict()}")
                    X, y = X_resampled, y_resampled
                except Exception as e:
                    print(f"SMOTE failed: {str(e)}. Using class_weight instead.")
                    self.use_smote = False
        
        # Create and train model
        self.model = self._create_model()
        
        print("Training model...")
        self.model.fit(X, y)
        print("Model training completed")
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        return self.model.predict(X)
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        return self.model.predict_proba(X)
    
    def save(self, filepath):
        """Save model to file."""
        print(f"Saving model to {filepath}")
        filepath.parent.mkdir(parents=True, exist_ok=True)
        joblib.dump(self.model, filepath)
    
    @classmethod
    def load(cls, filepath):
        """Load model from file."""
        print(f"Loading model from {filepath}")
        return joblib.load(filepath)


# Train the model
print("=" * 60)
print("MODEL TRAINING")
print("=" * 60)
trainer = FraudDetectionModelTrainer(
    model_type=MODEL_CONFIG.get("primary_model", "xgboost"),
    use_smote=MODEL_CONFIG.get("use_smote", True),
    random_state=MODEL_CONFIG.get("random_state", 42)
)

trainer.fit(X_train_transformed, y_train, tune_hyperparameters=False)
print("\nModel trained successfully!")

# Save model
trainer.save(MODEL_PATHS["model"])
print(f"Model saved to {MODEL_PATHS['model']}")

## Section 8: Model Evaluation
**Source File**: `src/evaluation/model_evaluator.py`

Comprehensive evaluation with metrics, visualizations, and report generation.

In [None]:
# ============================================================================
# MODEL EVALUATION CLASS
# Source: src/evaluation/model_evaluator.py
# ============================================================================

class ModelEvaluator:
    """Comprehensive model evaluation with metrics and visualizations."""
    
    def __init__(self, model, preprocessor=None, optimize_threshold=True, target_recall=0.90):
        """Initialize evaluator."""
        self.model = model
        self.preprocessor = preprocessor
        self.optimize_threshold = optimize_threshold
        self.target_recall = target_recall
        self.optimal_threshold_ = None
        self.metrics_ = {}
    
    def evaluate(self, X, y_true, save_plots=True):
        """Comprehensive model evaluation."""
        print("Starting model evaluation")
        
        # Get predictions and probabilities
        y_pred = self.model.predict(X)
        y_proba = self.model.predict_proba(X)
        
        # If binary classification, get probabilities for positive class
        if y_proba.shape[1] == 2:
            y_proba_positive = y_proba[:, 1]
        else:
            y_proba_positive = y_proba[:, -1]
        
        # Calculate metrics with default threshold (0.5)
        metrics = self._calculate_metrics(y_true, y_pred, y_proba_positive)
        self.metrics_ = metrics
        
        # Optimize threshold if requested
        if self.optimize_threshold:
            optimal_threshold = self._optimize_threshold(y_true, y_proba_positive)
            self.optimal_threshold_ = optimal_threshold
            
            y_pred_optimal = (y_proba_positive >= optimal_threshold).astype(int)
            metrics_optimal = self._calculate_metrics(y_true, y_pred_optimal, y_proba_positive)
            metrics['optimal_threshold'] = optimal_threshold
            metrics['metrics_at_optimal_threshold'] = metrics_optimal
        
        # Generate visualizations
        if save_plots:
            self._generate_plots(X, y_true, y_pred, y_proba_positive, metrics)
        
        print("Model evaluation completed")
        return metrics
    
    def _calculate_metrics(self, y_true, y_pred, y_proba):
        """Calculate all evaluation metrics."""
        metrics = {
            'accuracy': accuracy_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred, zero_division=0),
            'recall': recall_score(y_true, y_pred, zero_division=0),
            'f1_score': f1_score(y_true, y_pred, zero_division=0),
            'roc_auc': roc_auc_score(y_true, y_proba),
            'pr_auc': average_precision_score(y_true, y_proba),
        }
        return metrics
    
    def _optimize_threshold(self, y_true, y_proba):
        """Optimize classification threshold based on target recall."""
        precision_vals, recall_vals, thresholds = precision_recall_curve(y_true, y_proba)
        target_idx = np.argmax(recall_vals >= self.target_recall)
        
        if target_idx > 0:
            optimal_threshold = thresholds[target_idx - 1]
        else:
            optimal_threshold = 0.5
        
        return optimal_threshold
    
    def _generate_plots(self, X, y_true, y_pred, y_proba, metrics):
        """Generate evaluation plots."""
        reports_dir = DIRECTORIES["reports"]
        reports_dir.mkdir(parents=True, exist_ok=True)
        
        # 1. Confusion Matrix
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        cm = confusion_matrix(y_true, y_pred)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
        axes[0, 0].set_title('Confusion Matrix')
        axes[0, 0].set_xlabel('Predicted')
        axes[0, 0].set_ylabel('Actual')
        
        cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Blues', ax=axes[0, 1])
        axes[0, 1].set_title('Normalized Confusion Matrix')
        axes[0, 1].set_xlabel('Predicted')
        axes[0, 1].set_ylabel('Actual')
        
        # 2. ROC Curve
        fpr, tpr, _ = roc_curve(y_true, y_proba)
        axes[1, 0].plot(fpr, tpr, label=f'ROC Curve (AUC = {metrics["roc_auc"]:.4f})')
        axes[1, 0].plot([0, 1], [0, 1], 'k--', label='Random')
        axes[1, 0].set_xlabel('False Positive Rate')
        axes[1, 0].set_ylabel('True Positive Rate')
        axes[1, 0].set_title('ROC Curve')
        axes[1, 0].legend()
        axes[1, 0].grid(True)
        
        # 3. Precision-Recall Curve
        precision, recall, _ = precision_recall_curve(y_true, y_proba)
        axes[1, 1].plot(recall, precision, label=f'PR Curve (AUC = {metrics["pr_auc"]:.4f})')
        axes[1, 1].set_xlabel('Recall')
        axes[1, 1].set_ylabel('Precision')
        axes[1, 1].set_title('Precision-Recall Curve')
        axes[1, 1].legend()
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        plt.savefig(reports_dir / "evaluation_metrics.png", dpi=300, bbox_inches='tight')
        plt.show()
        plt.close()
        
        # 4. Feature Importance
        if hasattr(self.model, 'feature_importances_'):
            plt.figure(figsize=(12, 8))
            feature_importance = pd.DataFrame({
                'feature': X.columns,
                'importance': self.model.feature_importances_
            }).sort_values('importance', ascending=False).head(20)
            
            sns.barplot(data=feature_importance, y='feature', x='importance')
            plt.xlabel('Importance')
            plt.ylabel('Feature')
            plt.title('Top 20 Feature Importance')
            plt.tight_layout()
            plt.savefig(reports_dir / "feature_importance.png", dpi=300, bbox_inches='tight')
            plt.show()
            plt.close()
    
    def generate_classification_report(self, X, y_true, threshold=0.5):
        """Generate detailed classification report."""
        y_pred = self.model.predict(X)
        y_proba = self.model.predict_proba(X)
        
        if y_proba.shape[1] == 2:
            y_proba_positive = y_proba[:, 1]
        else:
            y_proba_positive = y_proba[:, -1]
        
        # Use optimal threshold if available
        if self.optimal_threshold_ is not None:
            threshold = self.optimal_threshold_
            y_pred = (y_proba_positive >= threshold).astype(int)
        
        report = classification_report(y_true, y_pred, target_names=['Not Fraud', 'Fraud'])
        print(f"Classification Report (threshold={threshold:.4f}):")
        print(report)
        
        return report


# Evaluate the model
print("=" * 60)
print("MODEL EVALUATION")
print("=" * 60)
evaluator = ModelEvaluator(
    model=trainer.model,
    preprocessor=preprocessor,
    optimize_threshold=True,
    target_recall=0.90
)

metrics = evaluator.evaluate(X_eval_transformed, y_eval, save_plots=True)

print("\n" + "=" * 60)
print("EVALUATION METRICS")
print("=" * 60)
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1-Score: {metrics['f1_score']:.4f}")
print(f"ROC-AUC: {metrics['roc_auc']:.4f}")
print(f"PR-AUC: {metrics['pr_auc']:.4f}")

if 'optimal_threshold' in metrics:
    print(f"\nOptimal Threshold: {metrics['optimal_threshold']:.4f}")
    print("\nMetrics at Optimal Threshold:")
    opt_metrics = metrics['metrics_at_optimal_threshold']
    print(f"  Accuracy: {opt_metrics['accuracy']:.4f}")
    print(f"  Precision: {opt_metrics['precision']:.4f}")
    print(f"  Recall: {opt_metrics['recall']:.4f}")
    print(f"  F1-Score: {opt_metrics['f1_score']:.4f}")

# Generate classification report
evaluator.generate_classification_report(X_eval_transformed, y_eval)

## Section 9: Predictions
**Source File**: `scripts/predict.py`

Make predictions on new data using the trained model and preprocessor.

In [None]:
# ============================================================================
# PREDICTION FUNCTIONS
# Source: scripts/predict.py
# ============================================================================

def predict_batch(X, model, preprocessor, threshold=0.5):
    """
    Make batch predictions on new data.
    
    Args:
        X: Feature DataFrame
        model: Trained model
        preprocessor: Fitted preprocessor
        threshold: Classification threshold
    
    Returns:
        DataFrame with predictions and probabilities
    """
    print("Making predictions...")
    
    # Preprocess
    X_transformed = preprocessor.transform(X)
    
    # Predict
    predictions = model.predict(X_transformed)
    probabilities = model.predict_proba(X_transformed)
    
    # Create results dataframe
    results = X.copy()
    results['predicted_fraud'] = predictions
    results['fraud_probability'] = probabilities[:, 1] if probabilities.shape[1] > 1 else probabilities[:, 0]
    results['is_fraud'] = (results['fraud_probability'] >= threshold).astype(int)
    
    print(f"Predictions complete: {len(results)} transactions, {results['is_fraud'].sum()} flagged as fraud")
    
    return results


def predict_single(transaction_data, model, preprocessor, threshold=0.5):
    """
    Make prediction on a single transaction.
    
    Args:
        transaction_data: Dictionary with transaction features
        model: Trained model
        preprocessor: Fitted preprocessor
        threshold: Classification threshold
    
    Returns:
        Dictionary with prediction results
    """
    # Convert to DataFrame
    X = pd.DataFrame([transaction_data])
    
    # Preprocess
    X_transformed = preprocessor.transform(X)
    
    # Predict
    prediction = model.predict(X_transformed)[0]
    probability = model.predict_proba(X_transformed)[0, 1]
    
    # Determine confidence level
    if probability > 0.8:
        confidence = "high_fraud"
    elif probability > 0.6:
        confidence = "medium_fraud"
    elif probability > 0.4:
        confidence = "low_confidence"
    elif probability > 0.2:
        confidence = "medium_legitimate"
    else:
        confidence = "high_legitimate"
    
    result = {
        "prediction": int(prediction),
        "fraud_probability": float(probability),
        "is_fraud": bool(probability >= threshold),
        "confidence": confidence,
        "threshold_used": float(threshold)
    }
    
    return result


def analyze_predictions(predictions_df):
    """Analyze prediction results."""
    print("\nPrediction Analysis:")
    print(f"Total transactions: {len(predictions_df)}")
    print(f"Predicted fraud: {predictions_df['is_fraud'].sum()}")
    print(f"Fraud rate: {predictions_df['is_fraud'].mean()*100:.2f}%")
    
    # Probability distribution
    print("\nFraud Probability Distribution:")
    print(predictions_df['fraud_probability'].describe())
    
    # High-risk transactions
    high_risk = predictions_df[predictions_df['fraud_probability'] > 0.8]
    print(f"\nHigh-risk transactions (prob > 0.8): {len(high_risk)}")
    
    if len(high_risk) > 0:
        print("High-risk transaction characteristics:")
        print(high_risk[['amount', 'type', 'oldbalanceOrg', 'newbalanceOrig']].describe())


# Make predictions on test set
print("=" * 60)
print("MAKING PREDICTIONS")
print("=" * 60)

# Apply feature engineering to test set
X_test_features = create_all_features(X_test)
X_test_transformed = preprocessor.transform(X_test_features)

# Predictions
test_predictions = predict_batch(X_test, trainer.model, preprocessor, threshold=0.5)

print("\nSample predictions:")
print(test_predictions[['step', 'type', 'amount', 'fraud_probability', 'is_fraud']].head(10))

# Analyze predictions
analyze_predictions(test_predictions)

# Test single prediction
print("\n" + "=" * 40)
print("SINGLE PREDICTION EXAMPLE")
print("=" * 40)

sample_transaction = {
    'step': 100,
    'type': 'TRANSFER',
    'amount': 5000.0,
    'nameOrig': 'C123456789',
    'oldbalanceOrg': 10000.0,
    'newbalanceOrig': 5000.0,
    'nameDest': 'C987654321',
    'oldbalanceDest': 2000.0,
    'newbalanceDest': 7000.0
}

single_result = predict_single(
    sample_transaction, 
    trainer.model, 
    preprocessor, 
    threshold=evaluator.optimal_threshold_ or 0.5
)

print("Single transaction prediction:")
for key, value in single_result.items():
    print(f"  {key}: {value}")

## Section 10: Model Persistence
**Source Files**: `scripts/train_model.py`, `scripts/predict.py`

Save and load model artifacts for production use.

In [None]:
# ============================================================================
# SAVE MODEL ARTIFACTS
# Source: scripts/train_model.py
# ============================================================================

print("=" * 60)
print("SAVING MODEL ARTIFACTS")
print("=" * 60)

# Save preprocessor
preprocessor.save(MODEL_PATHS["preprocessor"])
print(f"‚úì Preprocessor saved to {MODEL_PATHS['preprocessor']}")

# Save model
trainer.save(MODEL_PATHS["model"])
print(f"‚úì Model saved to {MODEL_PATHS['model']}")

# Save feature names
joblib.dump(preprocessor.feature_names_, MODEL_PATHS["feature_names"])
print(f"‚úì Feature names saved to {MODEL_PATHS['feature_names']}")

# Save optimal threshold
if evaluator.optimal_threshold_ is not None:
    threshold_path = DIRECTORIES["models"] / "optimal_threshold.pkl"
    joblib.dump(evaluator.optimal_threshold_, threshold_path)
    print(f"‚úì Optimal threshold saved to {threshold_path}")

# Save evaluation metrics
metrics_path = DIRECTORIES["models"] / "evaluation_metrics.pkl"
joblib.dump(evaluator.metrics_, metrics_path)
print(f"‚úì Evaluation metrics saved to {metrics_path}")

print("\nAll artifacts saved successfully!")

# ============================================================================
# LOAD MODEL ARTIFACTS (Example for production use)
# Source: scripts/predict.py
# ============================================================================

print("\n" + "=" * 60)
print("LOADING MODEL ARTIFACTS (Example)")
print("=" * 60)

# Load preprocessor
loaded_preprocessor = FraudDetectionPreprocessor.load(MODEL_PATHS["preprocessor"])
print("‚úì Preprocessor loaded")

# Load model
loaded_model = FraudDetectionModelTrainer.load(MODEL_PATHS["model"])
print("‚úì Model loaded")

# Load feature names
loaded_feature_names = joblib.load(MODEL_PATHS["feature_names"])
print(f"‚úì Feature names loaded ({len(loaded_feature_names)} features)")

# Load optimal threshold
if (DIRECTORIES["models"] / "optimal_threshold.pkl").exists():
    loaded_threshold = joblib.load(DIRECTORIES["models"] / "optimal_threshold.pkl")
    print(f"‚úì Optimal threshold loaded: {loaded_threshold:.4f}")
else:
    loaded_threshold = 0.5
    print("Using default threshold: 0.5")

print("\nAll artifacts loaded successfully!")

# Verify loaded model works
test_sample = X_test.iloc[:5]
predictions_sample = predict_batch(test_sample, loaded_model, loaded_preprocessor, threshold=loaded_threshold)
print("\nTest prediction with loaded model:")
print(predictions_sample[['step', 'type', 'amount', 'fraud_probability', 'is_fraud']])

## Section 11: API Deployment
**Source Files**: `api/app.py`, `api/predict_endpoint.py`

Flask API structure for production deployment (code reference only).

In [None]:
# ============================================================================
# FLASK API CODE (Reference Only - Not Executable in Notebook)
# Source: api/app.py, api/predict_endpoint.py
# ============================================================================

flask_api_code = '''
# api/app.py
from flask import Flask, request, jsonify
from flask_cors import CORS
import sys
from pathlib import Path
import joblib
import pandas as pd

# Add project root to path
sys.path.insert(0, str(Path(__file__).parent.parent))

from config.config import MODEL_PATHS, DIRECTORIES
from src.preprocessing.preprocessor import FraudDetectionPreprocessor
from src.models.model_trainer import FraudDetectionModelTrainer

app = Flask(__name__)
CORS(app)

# Load model artifacts at startup
try:
    preprocessor = FraudDetectionPreprocessor.load(MODEL_PATHS["preprocessor"])
    model = FraudDetectionModelTrainer.load(MODEL_PATHS["model"])
    
    # Load optimal threshold if available
    threshold_path = DIRECTORIES["models"] / "optimal_threshold.pkl"
    if threshold_path.exists():
        optimal_threshold = joblib.load(threshold_path)
    else:
        optimal_threshold = 0.5
    
    print("‚úì Model artifacts loaded successfully")
    
except Exception as e:
    print(f"‚úó Error loading model artifacts: {str(e)}")
    preprocessor = None
    model = None
    optimal_threshold = 0.5


@app.route('/health', methods=['GET'])
def health():
    """Health check endpoint."""
    status = "healthy" if model is not None else "unhealthy"
    return jsonify({
        "status": status,
        "model_loaded": model is not None,
        "preprocessor_loaded": preprocessor is not None
    }), 200 if status == "healthy" else 503


@app.route('/predict', methods=['POST'])
def predict_single():
    """Predict single transaction."""
    if model is None or preprocessor is None:
        return jsonify({"error": "Model not loaded"}), 503
    
    try:
        data = request.get_json()
        
        # Validate required fields
        required_fields = ['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 
                          'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest']
        
        missing_fields = [field for field in required_fields if field not in data]
        if missing_fields:
            return jsonify({"error": f"Missing fields: {missing_fields}"}), 400
        
        # Make prediction
        result = predict_single(data, model, preprocessor, optimal_threshold)
        
        return jsonify({
            "transaction_id": data.get('step'),
            "prediction": result["prediction"],
            "is_fraud": result["is_fraud"],
            "fraud_probability": result["fraud_probability"],
            "confidence": result["confidence"],
            "threshold_used": result["threshold_used"]
        }), 200
        
    except Exception as e:
        return jsonify({"error": str(e)}), 500


@app.route('/predict_batch', methods=['POST'])
def predict_batch_endpoint():
    """Predict multiple transactions."""
    if model is None or preprocessor is None:
        return jsonify({"error": "Model not loaded"}), 503
    
    try:
        data = request.get_json()
        transactions = data.get('transactions', [])
        
        if not transactions:
            return jsonify({"error": "No transactions provided"}), 400
        
        # Convert to DataFrame
        df = pd.DataFrame(transactions)
        
        # Make predictions
        results = predict_batch(df, model, preprocessor, optimal_threshold)
        
        # Convert to list of dicts
        predictions = results.to_dict('records')
        
        return jsonify({
            "count": len(predictions),
            "predictions": predictions
        }), 200
        
    except Exception as e:
        return jsonify({"error": str(e)}), 500


@app.route('/model_info', methods=['GET'])
def model_info():
    """Get model information."""
    if model is None:
        return jsonify({"error": "Model not loaded"}), 503
    
    return jsonify({
        "model_type": "XGBoost",
        "version": "1.0",
        "threshold": optimal_threshold,
        "features_count": len(preprocessor.feature_names_) if preprocessor else 0,
        "target_recall": 0.90
    })


if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000, debug=False)


# api/predict_endpoint.py (helper functions)
def predict_single(transaction_data, model, preprocessor, threshold=0.5):
    """Make prediction on a single transaction."""
    # Convert to DataFrame
    X = pd.DataFrame([transaction_data])
    
    # Preprocess
    X_transformed = preprocessor.transform(X)
    
    # Predict
    prediction = model.predict(X_transformed)[0]
    probability = model.predict_proba(X_transformed)[0, 1]
    
    # Determine confidence level
    if probability > 0.8:
        confidence = "high_fraud"
    elif probability > 0.6:
        confidence = "medium_fraud"
    elif probability > 0.4:
        confidence = "low_confidence"
    elif probability > 0.2:
        confidence = "medium_legitimate"
    else:
        confidence = "high_legitimate"
    
    return {
        "prediction": int(prediction),
        "fraud_probability": float(probability),
        "is_fraud": bool(probability >= threshold),
        "confidence": confidence,
        "threshold_used": float(threshold)
    }


def predict_batch(df, model, preprocessor, threshold=0.5):
    """Make batch predictions."""
    # Preprocess
    X_transformed = preprocessor.transform(df)
    
    # Predict
    predictions = model.predict(X_transformed)
    probabilities = model.predict_proba(X_transformed)
    
    # Create results
    results = df.copy()
    results['predicted_fraud'] = predictions
    results['fraud_probability'] = probabilities[:, 1]
    results['is_fraud'] = (results['fraud_probability'] >= threshold).astype(int)
    
    return results
'''

print("Flask API code structure:")
print("\nüìÅ API Files Created:")
print("  - api/app.py: Main Flask application")
print("  - api/predict_endpoint.py: Prediction helper functions")

print("\nüöÄ API Endpoints:")
print("  - GET /health: Health check")
print("  - POST /predict: Single transaction prediction")
print("  - POST /predict_batch: Batch predictions")
print("  - GET /model_info: Model information")

print("\nüí° Usage Example:")
print("""
# Single prediction
import requests

response = requests.post('http://localhost:5000/predict', json={
    "step": 1,
    "type": "TRANSFER",
    "amount": 181.0,
    "nameOrig": "C123456789",
    "oldbalanceOrg": 181.0,
    "newbalanceOrig": 0.0,
    "nameDest": "C987654321",
    "oldbalanceDest": 0.0,
    "newbalanceDest": 181.0
})

result = response.json()
print(f"Fraud probability: {result['fraud_probability']:.4f}")
print(f"Is fraud: {result['is_fraud']}")
""")

print("\nüîß To run the API:")
print("1. Save model artifacts (completed above)")
print("2. Install Flask: pip install flask flask-cors")
print("3. Run: python api/app.py")
print("4. API will be available at http://localhost:5000")

## Summary

This notebook demonstrates the complete end-to-end ML pipeline for fraud detection:

### ‚úÖ **Completed Workflow Sections:**

1. **Configuration Setup** - Project paths, hyperparameters, and settings
2. **Data Ingestion** - SQLite database loading with chunking for large datasets
3. **Exploratory Data Analysis** - Comprehensive data understanding and visualization
4. **Data Splitting** - Train/eval/test splits preserving temporal order
5. **Feature Engineering** - Balance, transaction, time, and account features
6. **Data Preprocessing** - Sklearn-compatible pipeline with encoding and scaling
7. **Model Training** - XGBoost with SMOTE for class imbalance handling
8. **Model Evaluation** - Comprehensive metrics, visualizations, and threshold optimization
9. **Predictions** - Batch and single transaction predictions
10. **Model Persistence** - Save and load model artifacts for production
11. **API Deployment** - Flask API structure for real-time predictions

### üìÅ **Project Structure Understanding:**

- **`config/`** - Centralized configuration management
- **`src/data/`** - Data loading and splitting utilities
- **`src/preprocessing/`** - Feature engineering and preprocessing pipeline
- **`src/models/`** - Model training with class imbalance handling
- **`src/evaluation/`** - Comprehensive model evaluation and metrics
- **`scripts/`** - Standalone training and prediction scripts
- **`api/`** - Flask API for production deployment
- **`models/`** - Saved model artifacts and preprocessor
- **`reports/`** - Evaluation reports, plots, and visualizations
- **`notebooks/`** - Analysis and workflow notebooks

### üéØ **Key Features Implemented:**

- **Class Imbalance Handling**: SMOTE with fallback to class weights
- **Feature Engineering**: Balance ratios, time features, transaction patterns
- **Threshold Optimization**: Target recall optimization for fraud detection
- **Comprehensive Evaluation**: ROC-AUC, PR-AUC, confusion matrices, feature importance
- **Production Ready**: Model persistence, API endpoints, batch processing
- **Scalable Design**: Chunked data loading, sklearn-compatible transformers

### üöÄ **Next Steps for Production:**

1. **Hyperparameter Tuning**: Use Optuna for automated optimization
2. **Model Monitoring**: Implement drift detection and performance tracking
3. **A/B Testing**: Compare multiple models in production
4. **Real-time Processing**: Implement streaming data processing
5. **Explainability**: Add SHAP values for model interpretation

This consolidated workflow provides a complete foundation for fraud detection that can be easily extended and deployed in production environments.