# **Model Selection & Evaluation**

* Establish baselines; connect metrics to business trade-offs.

**Note:** This modelling uses a processed stratified sample (100k rows) of a larger dataset (1M rows) to ensure efficient training while maintaining the fraud class balance. Details are provided in 01_ETL.ipynb

## Inputs

* Processed dataset data/processed/card_transdata_processed.csv (derived from 100k stratified sample)

## Outputs

* Baseline metrics, plots, and a predictions CSV for later comparison.



---

# Change working directory

I need to change the working directory from the current folder to its parent folder (required because the notebook is being run from inside the jupyter notebooks subfolder). In the code below, I change the working directory from its current folder to its parent folder.  
* I access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1: Quick Load and Check of Dataframe

Quick steps to:
- Load processed file
- Check Dataframe Shape is as expected (100000, 17)
- Display first 5 rows to check loads as expected.


In [None]:
# =============================================================================
# Import all libraries needed for the notebook
# =============================================================================

# Core data manipulation and path handling
import pandas as pd  # Data manipulation and analysis
import numpy as np    # Numerical operations and array handling
from pathlib import Path  # Cross-platform file path handling

# Data visualisation
import matplotlib.pyplot as plt  # Static plotting
import seaborn as sns # Enhanced visual styling

# Display tools
from IPython.display import display  # Pretty display of DataFrames in Jupyter

# =============================================================================
# Machine Learning: Model preparation and evaluation
# =============================================================================
from sklearn.model_selection import train_test_split, GridSearchCV  # Data splitting and tuning
from sklearn.preprocessing import StandardScaler  # Feature scaling for numeric models
from sklearn.linear_model import LogisticRegression  # Linear baseline classifier
from sklearn.tree import DecisionTreeClassifier  # Simple tree model for interpretability
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier  # Ensemble models
from xgboost import XGBClassifier  # Gradient boosting with high performance
from imblearn.over_sampling import SMOTE


# =============================================================================
# Model evaluation metrics
# =============================================================================
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,  # Core classification metrics
    precision_recall_curve, average_precision_score,  # PR curve metrics (preferred for imbalance)
    roc_auc_score, roc_curve, confusion_matrix,  # ROC and confusion matrix
    classification_report  # Summary table
)

# =============================================================================
# Load processed dataset (created from 100k stratified sample in 01 ETL.ipynb)
# Display the shape of the dataframe

df = pd.read_csv("data/processed/card_transdata_processed.csv") # Load processed data

df.shape # (rows, columns)


In [None]:
df.head() # Display first few rows of the dataframe

# Section 1.1: Data Validation and Preparation

## Class Balance Verification

**Purpose:** Verify that the dataset maintains the expected fraud rate (8.74%) from the stratified sampling performed in the ETL pipeline.

**Expected Outcome:**
- Fraud rate: 8.74% 
- Non-fraud rate: 91.26%
- Imbalance ratio: approximately 1:10.4 (non-fraud to fraud)

## Feature and Target Split

**Purpose:** Prepare the dataset for modeling by separating features (X) and target variable (y).

**Dataset Composition:**
The dataset contains 17 columns total, split into features and target:

**Excluded Features from Modelling:**

1. **fraud** - Target variable (what we're predicting)

2. **online_chip_category** - String categorical variable
   - Contains labels: "online_no_chip", "online_chip", "offline_no_chip", "offline_chip"
   - Redundant: underlying binary features (online_order and used_chip) already included
   - Created for hypothesis testing visualisation (H8) only

3. **Binned Variables** - Created for EDA visualisation only:
   - log_distance_from_home_bin - Categorical distance ranges
   - log_purchase_price_bin - Categorical price ratio ranges  
   - log_distance_from_last_transaction_bin - Categorical transaction distance ranges
   
   **Why excluded:** Binning loses information; continuous and log-transformed versions provide superior predictive power for models

**Features Retained**

**Original Features:**
- distance_from_home - Geographic distance from customer's home address (km)
- distance_from_last_transaction - Distance from previous transaction location (km)
- ratio_to_median_purchase_price - Current purchase relative to customer's median spending
- repeat_retailer - Whether customer previously transacted with retailer (binary: 0/1)
- used_chip - Chip card authentication used (binary: 0/1)
- used_pin_number - PIN verification used (binary: 0/1)
- online_order - Transaction channel: online vs in-store (binary: 0/1)

**Log-Transformed Features:**
- log_distance_from_home- Reduces right skew in distance distribution
- log_distance_from_last_transaction - Handles extreme distance outliers
- log_ratio_to_median_purchase_price - Normalises purchase ratio distribution

*Rationale: Both original and log-transformed versions retained. Tree-based models (Random Forest, XGBoost) can select the most predictive representation through natural feature selection.*

**Engineered Interaction Features:**
- online_high_distance - Binary flag: online transaction far from home (combines channel + distance risk)
- online_and_chip - Binary flag: online transaction using chip authentication

**Output:**
- **X**: Feature matrix (n_samples × 12 features) - all numeric predictors
- **y**: Binary target vector (n_samples) - fraud indicator (0 = legitimate, 1 = fraud)

**Feature Strategy:**
This approach balances information richness with model efficiency:
- Retains both raw and transformed features for model flexibility
- Excludes redundant encoded categories
- Removes low-information binned variables
- Total of 12 features provides sufficient signal without overfitting risk

In [None]:
# =============================================================================
# DATA VALIDATION AND PREPARATION
# =============================================================================
# Validates class balance and prepares features for modeling
# Ensures dataset integrity before model training begins
# =============================================================================

# ─────────────────────────────────────────────────────────────────────────────
# Class Balance Verification
# ─────────────────────────────────────────────────────────────────────────────
# Confirm fraud rate matches expected 8.74% from stratified sampling in ETL
# Any mismatch indicates potential data loading or processing errors

# Calculate fraud statistics
fraud_count = int(df['fraud'].sum())        # Total number of fraudulent transactions
fraud_rate = df['fraud'].mean()             # Proportion of fraud (0 to 1)
total_rows = df.shape[0]                    # Total number of transactions

# Display fraud distribution clearly
print(f"Fraud rate: {fraud_rate:.2%} ({fraud_count:,} fraud cases out of {total_rows:,} total transactions)")

# ─────────────────────────────────────────────────────────────────────────────
# Verify against expected rate from ETL pipeline
# ─────────────────────────────────────────────────────────────────────────────
# Expected rate comes from sample_log.json created during stratified sampling
expected_fraud_rate = 0.0874  # 8.74% fraud rate from ETL stratified sampling
rate_difference = abs(fraud_rate - expected_fraud_rate)  # Absolute difference

# Check if observed rate matches expected (allow tiny floating point errors)
if rate_difference < 0.0001:
    print(f"✓ Class balance verified: matches expected rate ({expected_fraud_rate:.2%})")
else:
    print(f"Issue: Fraud rate {fraud_rate:.4%} differs from expected {expected_fraud_rate:.2%}")
    print(f"  Difference: {rate_difference:.4%}")

# Calculate and display imbalance ratio
imbalance_ratio = (1 - fraud_rate) / fraud_rate  # Ratio of non-fraud to fraud
print(f"Imbalance ratio: 1:{imbalance_ratio:.1f} (non-fraud : fraud)")

print("\n" + "="*90 + "\n")

# ─────────────────────────────────────────────────────────────────────────────
# Feature and Target Split
# ─────────────────────────────────────────────────────────────────────────────
# Separate features (X) from target variable (y)
# Exclude features that cannot be used for modeling

target = "fraud"  # Target variable name

# Define columns to exclude from feature matrix
# These are either:
# 1. The target variable itself
# 2. String categorical variables created for EDA visualization
# 3. Binned variables created for hypothesis testing/visualization only
exclude_cols = [
    target,                                      # Target variable (what we're predicting)
    'online_chip_category',                      # String labels: "online_no_chip", "online_chip", etc.
                                                 # (underlying features online_order + used_chip already included)
    'log_distance_from_home_bin',                # Categorical bins for EDA (continuous log_distance_from_home included)
    'log_purchase_price_bin',                    # Categorical bins for EDA (continuous log_ratio_to_median_purchase_price included)
    'log_distance_from_last_transaction_bin'     # Categorical bins for EDA (continuous log_distance_from_last_transaction included)
]

# Create feature matrix X (drop excluded columns if they exist)
X = df.drop(columns=[col for col in exclude_cols if col in df.columns])

# Create target vector y (ensure binary encoding: 0 or 1)
y = df[target].astype(int)  # 0 = legitimate transaction, 1 = fraudulent transaction

# Display shape and confirmation
print(f"Feature matrix (X) shape: {X.shape[0]:,} samples x {X.shape[1]} features")
print(f"Target vector (y) shape: {y.shape[0]:,} samples")
print(f"\nTarget distribution:")
print(f"  Class 0 (legitimate): {(y == 0).sum():,} ({(y == 0).mean():.2%})")
print(f"  Class 1 (fraud):      {(y == 1).sum():,} ({(y == 1).mean():.2%})")

# ─────────────────────────────────────────────────────────────────────────────
# Display features included in modeling
# ─────────────────────────────────────────────────────────────────────────────
print(f"\n{'='*90}")
print("Features Included in Modelling:")
print(f"{'='*90}\n")

print("Original Features (7 from dataset):")
print("  distance_from_home")
print("  distance_from_last_transaction")
print("  ratio_to_median_purchase_price")
print("  repeat_retailer")
print("  used_chip")
print("  used_pin_number")
print("  online_order")

print("\nLog-Transformed Features (3 engineered for skew reduction):")
print("  log_distance_from_home")
print("  log_distance_from_last_transaction")
print("  log_ratio_to_median_purchase_price")

print("\nInteraction Features (2 engineered for domain insights):")
print("  online_high_distance")
print("  online_and_chip")

print(f"\nTotal: {X.shape[1]} features")
print(f"\n{'='*90}\n")

---

# Section 2: Data Splitting and Class Balance

**Goal:** 
- Divide the dataset into three stratified subsets for model development, hyperparameter tuning, and final evaluation. This approach prevents data leakage and provides unbiased performance estimates.

**Approach:**
- Test set: 20% of the data, held back for final unbiased evaluation.
- Validation set: 20% of the data, used for hyperparameter tuning and threshold selection.
- Training set: 60% of the data, used to fit the models.

Splitting was done in two steps as the set must be isolated first to prevent any information leakage during model development. If you split all three sets simultaneously, there's a risk of inadvertently using test set characteristics during validation or training decisions.

Stratification is required to reserve class distribution across all splits to ensure each subset is representative of the full dataset. Without stratification, random chance could create splits with 7% fraud in training and 10% in test. This would make validation metrics unreliable predictors of test performance. 

**Results of Split:**
- Train: 60,000 rows (60%)
- Validation: 20,000 rows (20%)
- Test: 20,000 rows (20%)
- Fraud prevalence is 8.7% across all splits, consistent with the overall sampled dataset.

**Insights:**
- The class imbalance is preserved, meaning that validation and test performance will reflect the same challenge as the overall data.

In [None]:
# =============================================================================
# Train - Validate - Test Split
# =============================================================================
# Split dataset into three stratified subsets for model development and evaluation
# Target split: 60% train / 20% validation / 20% test
# Stratification preserves 8.74% fraud rate across all splits
# =============================================================================

# ─────────────────────────────────────────────────────────────────────────────
# Split Strategy: Two-Step Process
# ─────────────────────────────────────────────────────────────────────────────
# Step 1: Hold out test set first (20%)
#   - Test set remains completely untouched until final evaluation
#   - Provides unbiased estimate of model performance on unseen data
#   - No information leakage from training or validation
#
# Step 2: Split remaining data into train (60%) and validation (20%)
#   - Training set: Used to fit model parameters
#   - Validation set: Used for hyperparameter tuning and threshold optimisation
#   - Keeps test set isolated throughout entire development process
# ─────────────────────────────────────────────────────────────────────────────

# ─────────────────────────────────────────────────────────────────────────────
# Step 1: Hold Out Test Set (20% of total data)
# ─────────────────────────────────────────────────────────────────────────────
# Split off 20% for final testing, keeping 80% for train/validation split
# Test set is locked away and will not be touched until final model evaluation

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y,
    test_size=0.20,          # Reserve 20% of total data for final testing
    stratify=y,              # Maintain 8.74% fraud rate in both temp and test sets
    random_state=42          # Ensures reproducibility across runs
)

# ─────────────────────────────────────────────────────────────────────────────
# Step 2: Split Remaining 80% into Train (60%) and Validation (20%)
# ─────────────────────────────────────────────────────────────────────────────
# From the 80% remaining after test split:
# - 75% of temp becomes Training (0.75 × 0.80 = 0.60 of original)
# - 25% of temp becomes Validation (0.25 × 0.80 = 0.20 of original)
# Result: 60% train, 20% validation, 20% test (from Step 1)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp,
    test_size=0.25,          # 25% of remaining 80% = 20% of original dataset
    stratify=y_temp,         # Maintain fraud rate in both train and validation sets
    random_state=42          # Same seed ensures consistent splits
)

# ─────────────────────────────────────────────────────────────────────────────
# Verify Split Sizes
# ─────────────────────────────────────────────────────────────────────────────
# Confirm that split proportions match expected 60/20/20 distribution

print("Shapes:")
print("  Train:", X_train.shape)  # Expected: approximately 60,000 rows
print("  Valid:", X_val.shape)    # Expected: approximately 20,000 rows
print("  Test :", X_test.shape)   # Expected: approximately 20,000 rows

# ─────────────────────────────────────────────────────────────────────────────
# Verify Stratification: Fraud Rate Consistency
# ─────────────────────────────────────────────────────────────────────────────
# Confirm that fraud prevalence is consistent across all splits
# All splits should maintain approximately 8.74% fraud rate

# Verify fraud rate is consistent across all splits (should all be 8.7%)
for name, yt in [("Overall", y), ("Train", y_train), ("Valid", y_val), ("Test", y_test)]:
    print(f"{name} fraud prevalence: {yt.mean()*100:.3f}%")

# Section 2.1: Class Balance Strategy

**The Challenge**

Fraud represents only 8.74% of transactions (8,740 fraud cases out of 100,000 total transactions). This class imbalance presents a significant modelling challenge:

Without adjustment, machine learning models optimise for overall accuracy by favouring the majority class. A baseline model could achieve 91.3% accuracy by simply predicting "not fraud" for every transaction, whilst failing to detect any actual fraud cases. This renders the model useless for fraud detection despite appearing highly accurate.

Standard loss functions treat all misclassification errors equally. With 91.3% non-fraud cases, the model minimises overall error by predicting the majority class, treating rare fraud cases as statistical noise rather than critical signals.

**Objective**

Test and compare three approaches to handling class imbalance, evaluating their effectiveness for fraud detection:

1. No weighting (baseline)
2. Class weighting (class_weight="balanced"`)
3. SMOTE oversampling

This systematic comparison should:
- Demonstrate the impact of class imbalance on model performance
- Identify the optimal approach for this dataset's moderate imbalance (8.74% fraud rate)
- Provide empirical evidence for modelling decisions

**Approach Selection Rationale**

Primary Approach: 
- Class Weighting due to:
1. Moderate imbalance: At 8.74% fraud (1:10.4 ratio), the imbalance is significant but not extreme
2. Computational efficiency: Class weighting requires no data augmentation or resampling
3. Preserves data distribution: Works with original samples rather than synthetic data
4. Standard industry practice: Widely used and well-understood approach
5. Model-native implementation: Built into sklearn models, ensuring proper integration

**How it works:**
- Assigns higher penalty to misclassifying minority class (fraud)
- class_weight="balanced" automatically calculates: weight = n_samples / (n_classes × n_samples_per_class)
- For fraud: weight ≈ 60,000 / (2 × 5,244) ≈ 5.72
- For non-fraud: weight ≈ 60,000 / (2 × 54,756) ≈ 0.55
- **Relative penalty:** Fraud errors cost approximately 10× more than non-fraud errors (5.72 / 0.55 ≈ 10.4)

**Secondary Approach: SMOTE Oversampling**
- Synthetic Minority Oversampling Technique (SMOTE), selected as secondary due to:
1. More appropriate for extreme imbalance: Most beneficial when fraud <2% of data
2. Creates synthetic data: May not generalise as well as real samples
3. Computational overhead: Increases training set size and fitting time
4. Risk of overfitting: Synthetic samples might not represent true fraud patterns
5. Better suited for severe imbalance (when insufficient minority class examples exist)

**How it works:**
- Generates synthetic fraud cases by interpolating between existing fraud samples
- Creates new samples along line segments connecting k-nearest fraud neighbours
- Balances training set to 50:50 or custom ratio

**Expected Outcomes**

Test 1: No Weighting (Baseline)
- High overall accuracy (≈91%) from predicting majority class
- Poor fraud detection: Low recall, likelihood of many missed fraud cases
- Demonstrates the problem I am trying to solve

Test 2: Class Weighting
- Improved fraud detection: Higher recall at cost of more false positives
- Better balance between catching fraud and minimising false alarms
- Expected to be optimal for this moderate imbalance

Test 3: SMOTE
- Expect outcome to be similar (or slightly better) performance to class weighting
- May show diminishing returns compared to class weighting
- Testing to validate if class weighting is sufficient for this imbalance level

**Success Crieria**

The aim of these tests is to identify the optimal model. This will be one that:
1. Maximises fraud detection (recall) whilst maintaining acceptable precision
2. Achieve high PR AUC (primary metric for imbalanced data)
3. Provide actionable probability thresholds for business decisions
4. Balances fraud detection against investigation resource constraints and cost

In [None]:
# =============================================================================
# Test 1: No Class Weighting (Baseline without imbalance handling)
# =============================================================================
# Demonstrates model behaviour without handling class imbalance
# Expected outcome: High accuracy but poor fraud detection
# Included as it establishes the baseline problem that needs solving
# =============================================================================

print("=" * 90)
print("TEST 1: NO CLASS WEIGHTING (Baseline without imbalance handling)")
print("=" * 90)

# ─────────────────────────────────────────────────────────────────────────────
# Train Logistic Regression WITHOUT Class Weighting
# ─────────────────────────────────────────────────────────────────────────────
# Standard logistic regression treats all misclassifications equally
# With 91.3% non-fraud cases, the model learns to predict "not fraud" frequently
# This minimises overall error but fails at fraud detection

logit_unweighted = LogisticRegression(
    max_iter=2000,    # Increased from default (100) to ensure convergence; initial testing showed 1000 ensured convergence, later adjusted 2000 to match other models (no convergence warning)
    random_state=42   # Fixed seed for reproducibility
    # No class_weight parameter - all classes weighted equally
)

# Fit model on training data
# Model optimises standard logistic loss without fraud-specific penalties

logit_unweighted.fit(X_train, y_train)

# ─────────────────────────────────────────────────────────────────────────────
# Generate Probability Predictions on Validation Set
# ─────────────────────────────────────────────────────────────────────────────
# Get predicted probabilities for fraud class (positive class = 1)
# predict_proba() returns array with shape (n_samples, 2):
#   - Column 0: Probability of non-fraud (class 0)
#   - Column 1: Probability of fraud (class 1)
# [:, 1] extracts only fraud probabilities for evaluation

y_val_unweighted = logit_unweighted.predict_proba(X_val)[:, 1]

# ─────────────────────────────────────────────────────────────────────────────
# Calculate Performance Metrics
# ─────────────────────────────────────────────────────────────────────────────

# PR AUC (Precision-Recall Area Under Curve) - Primary Metric
#   Also called Average Precision (AP)
#   - Focuses specifically on minority class performance (fraud detection)
#   - Summarises precision-recall trade-off across all probability thresholds
#   - Range: 0 to 1, where higher is better
#   - More informative than ROC AUC for imbalanced datasets
#   - Baseline (random classifier): ≈ 0.087 (proportion of fraud cases)

ap_unweighted = average_precision_score(y_val, y_val_unweighted)

# ROC AUC (Receiver Operating Characteristic) - Secondary Metric
#   - Measures overall ability to separate fraud from non-fraud
#   - Can be misleadingly high on imbalanced data
#   - Range: 0 to 1, where higher is better
#   - Baseline (random classifier): 0.5
#   - Included for comparison with standard literature

roc_unweighted = roc_auc_score(y_val, y_val_unweighted)

print(f"\nValidation Performance Metrics:")
print(f"  PR AUC (Average Precision): {ap_unweighted:.3f}")
print(f"  ROC AUC:                    {roc_unweighted:.3f}")

# ─────────────────────────────────────────────────────────────────────────────
# Evaluate Performance at Multiple Probability Thresholds
# ─────────────────────────────────────────────────────────────────────────────
# Default threshold (0.5) is often suboptimal for imbalanced problems (doesn't acccount for costs and risks)
# Test common thresholds to understand precision-recall trade-off:
#   - Lower threshold (0.5): Catch more fraud but more false alarms
#   - Higher threshold (0.7): Fewer false alarms but miss more fraud

print("\nThreshold Performance:")

for t in [0.50, 0.70]:
    # ─────────────────────────────────────────────────────────────────────────
    # Convert probabilities to binary predictions using threshold t
    # ─────────────────────────────────────────────────────────────────────────
    # Decision rule: If prob_fraud >= t, predict fraud (1), else not fraud (0)

    y_pred = (y_val_unweighted >= t).astype(int)
    
    # ─────────────────────────────────────────────────────────────────────────
    # Compute Confusion Matrix Components
    # ─────────────────────────────────────────────────────────────────────────
    # Confusion matrix layout:
    #                      Predicted
    #                 Not Fraud  |  Fraud
    #     Actual  ─────────────────────────
    #     Not Fraud    TN       |   FP
    #     Fraud        FN       |   TP
    #
    # TN (True Negative):  Correctly predicted non-fraud
    # FP (False Positive): Incorrectly flagged as fraud (Type I error)
    #                      - False alarm, wastes investigation resources
    # FN (False Negative): Missed fraud case (Type II error)
    #                      - Most costly: undetected fraud causes financial loss
    # TP (True Positive):  Correctly detects fraud cases
     
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    
    # ─────────────────────────────────────────────────────────────────────────
    # Calculate Key Business Metrics
    # ─────────────────────────────────────────────────────────────────────────
    
    # Precision (Positive Predictive Value):
    #   = TP / (TP + FP)
    #   = Of all fraud alerts, what proportion are genuine fraud?
    #   - Low precision = many false alarms = wasted investigation effort
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
    # Recall (Sensitivity, True Positive Rate):
    #   = TP / (TP + FN)
    #   = Of all actual fraud cases, what proportion did it catch?
    #   - Low recall = missing fraud = direct financial losses
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0 
    
    # Display results for this threshold
    print(f"\n  Threshold {t:.2f}:")
    print(f"    Recall   : {recall:.3f} ({tp}/{tp+fn} fraud cases caught)") # High recall = catch most fraud
    print(f"    Precision: {precision:.3f} ({tp}/{tp+fp} predictions correct)") # High precision = few false alarms
    print(f"    FP={fp:,}, FN={fn:,}")

print("\n" + "=" * 90 + "\n")


# Section 2.2 Test 2 - Balanced Class Weights 

**AIM**
- Assess the effectiveness of class weighting for managing moderate class imbalance, and determine whether synthetic oversampling methods offer measurable performance gains.

**What class_weight="balanced" does:**
- Automatically calculates weights inversely proportional to class frequencies
- Formula: weight = n_samples / (n_classes × class_count)
- For fraud class (8.7%): weight 5.74 (10x more importance)
- For non-fraud class (91.3%): weight 0.55 (reduces importance)

**How it works:**
- Loss function penalises misclassified fraud cases more heavily than non-fraud
- Forces model to prioritise catching fraud over overall accuracy
- No data resampling required - just adjusts the optimisation objective

**Expected improvement:**
- Significantly higher recall (90-95%) compared to no weighting
- Precision drops slightly (more false positives) but acceptable
- Better balance between catching fraud and managing false alerts  



In [None]:
# =============================================================================
# Test 2: Balanced Class Weights (Primary Imbalance Handling Approach)
# =============================================================================
# Adjusts loss function to penalise fraud misclassification more heavily
# Expected outcome: Higher recall (catch more fraud) with acceptable precision
# This is the preferred approach for moderate class imbalance (8.74%)
# =============================================================================

print("=" * 90)
print("TEST 2: CLASS_WEIGHT='BALANCED' (Primary imbalance handling approach)")
print("=" * 90)

# ─────────────────────────────────────────────────────────────────────────────
# Train Logistic Regression WITH Balanced Class Weights
# ─────────────────────────────────────────────────────────────────────────────
# class_weight="balanced" automatically calculates inverse-frequency weights
#
# Formula: weight_class_i = n_samples / (n_classes × n_samples_class_i)
#
# For this dataset (8.74% fraud, 91.26% non-fraud):
#   Fraud weight     ≈ 60,000 / (2 × 5,244)  ≈ 5.72
#   Non-fraud weight ≈ 60,000 / (2 × 54,756) ≈ 0.55
#
# Effect on loss function:
#   - Misclassifying fraud costs 5.72× more than baseline
#   - Misclassifying non-fraud costs 0.55× baseline
#   - Relative penalty: Fraud errors penalised approximately 10× more than non-fraud (5.72 / 0.55 ≈ 10.4)
#   - Model is forced to prioritise fraud detection over overall accuracy

logit_balanced = LogisticRegression(
    class_weight="balanced",  # Automatic inverse-frequency weighting
    max_iter=2000,            # Increased to ensure convergence
    random_state=42           # Fixed seed for reproducibility
)

# Fit model with weighted loss function
# During training, fraud misclassifications contribute more to total loss
# This forces model to learn better fraud detection patterns
logit_balanced.fit(X_train, y_train)

# ─────────────────────────────────────────────────────────────────────────────
# Generate Probability Predictions on Validation Set
# ─────────────────────────────────────────────────────────────────────────────
# Get fraud probabilities (positive class = 1)
# predict_proba() returns [prob_non_fraud, prob_fraud]
# [:, 1] extracts only fraud probabilities for evaluation

y_val_balanced = logit_balanced.predict_proba(X_val)[:, 1]

# ─────────────────────────────────────────────────────────────────────────────
# Calculate Performance Metrics
# ─────────────────────────────────────────────────────────────────────────────
# Expected changes compared to Test 1 (no weighting):
#   - PR AUC: Should remain similar or improve slightly
#   - Recall: Should increase significantly (catch more fraud)
#   - Precision: May decrease slightly (more false positives)
#   - Overall: Better fraud detection at cost of more false alarms

ap_balanced = average_precision_score(y_val, y_val_balanced)
roc_balanced = roc_auc_score(y_val, y_val_balanced)

print(f"\nValidation Performance Metrics:")
print(f"  PR AUC (Average Precision): {ap_balanced:.3f}")
print(f"  ROC AUC:                    {roc_balanced:.3f}")

# ─────────────────────────────────────────────────────────────────────────────
# Evaluate Performance at Multiple Probability Thresholds
# ─────────────────────────────────────────────────────────────────────────────
# Test same thresholds as Test 1 for direct comparison
# Class weighting shifts probability distribution, affecting threshold performance

print("\nThreshold Performance:")

for t in [0.50, 0.70]:
    # ─────────────────────────────────────────────────────────────────────────
    # Convert probabilities to binary predictions using threshold t
    # ─────────────────────────────────────────────────────────────────────────
    # Decision rule: If prob_fraud >= t, predict fraud (1), else not fraud (0)
    
    y_pred = (y_val_balanced >= t).astype(int)
    
    # ─────────────────────────────────────────────────────────────────────────
    # Compute Confusion Matrix Components
    # ─────────────────────────────────────────────────────────────────────────
    # TN (True Negative):  Correctly predicted non-fraud
    # FP (False Positive): False alarm - wastes investigation resources
    # FN (False Negative): Missed fraud - most costly error (financial loss)
    # TP (True Positive):  Correctly detected fraud - prevented loss
    
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    
    # ─────────────────────────────────────────────────────────────────────────
    # Calculate Key Business Metrics
    # ─────────────────────────────────────────────────────────────────────────
    
   # Precision: Of all fraud alerts, what proportion are genuine?
    #   = TP / (TP + FP)
    #   Expected: May decrease vs Test 1 (more false alarms acceptable trade-off)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
    # Recall: Of all actual fraud, what proportion did it catch?
    #   = TP / (TP + FN)
    #   Expected: Should increase significantly vs Test 1 (primary goal)
    
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # Display results for this threshold
    print(f"\n  Threshold {t:.2f}:")
    print(f"    Recall   : {recall:.3f} ({tp}/{tp+fn} fraud cases caught)")
    print(f"    Precision: {precision:.3f} ({tp}/{tp+fp} predictions correct)")
    print(f"    FP={fp:,}, FN={fn:,}")

# Section 2.4 Test 3 - Synthetic Minority Oversampling Technique (SMOTE)

**AIM** 
- Test SMOTE as an alternative approach to handling class imbalance. This validates whether class weighting (Test 2) is sufficient or if data augmentation provides additional benefits.

**What SMOTE does:**
- Creates synthetic fraud cases by interpolating between existing fraud samples
- Balances class distribution in training data (≈ 50/50 split after SMOTE)
- Training set size increases from 60,000 to ≈ 110,000

**How it works:**
1. For each fraud case, find its k-nearest fraud neighbors (default k=5)
2. Draw lines between the fraud case and its neighbors
3. Generate new synthetic samples along those lines
4. Result: More diverse fraud examples for the model to learn from

**Trade-offs:**
- **Pro:** Can improve recall by giving model more fraud patterns
- **Pro:** No manual class weighting needed
- **Con:** Increases training time (more data to process)
- **Con:** May introduce artificial patterns not present in real data
- **Con:** Risk of overfitting if synthetic samples are too similar

In [None]:
# =============================================================================
# TEST 3: SMOTE Oversampling (Secondary Imbalance Handling Approach)
# =============================================================================
# Creates synthetic fraud samples to balance training data
# Expected outcome: Similar or slightly better than class weighting
# This is the secondary approach - validates that class weighting is sufficient
# =============================================================================

print("=" * 90)
print("TEST 3: SMOTE OVERSAMPLING (Synthetic minority oversampling)")
print("=" * 90)

# ─────────────────────────────────────────────────────────────────────────────
# Apply SMOTE to Training Data Only
# ─────────────────────────────────────────────────────────────────────────────
# SMOTE (Synthetic Minority Oversampling Technique):
#
# How it works:
#   1. For each fraud case in training set, identify k=5 nearest fraud neighbours
#   2. In feature space, draw line segments connecting the case to its neighbours
#   3. Generate new synthetic fraud samples at random points along these lines
#   4. Repeat until training set reaches 50:50 balance (fraud:non-fraud)
#
# Mathematical approach:
#   For fraud sample x and neighbour n:
#   synthetic_sample = x + λ × (n - x), where λ ∈ [0, 1] is random
#   This creates realistic fraud cases between existing patterns
#
# Key characteristics:
#   - Only training data is resampled (validation/test remain unchanged)
#   - Validation performance tests generalisation to real (not synthetic) data
#   - Creates ~52k synthetic fraud samples to match ~52k non-fraud samples

smote = SMOTE(
    random_state=42,      # Fixed seed for reproducibility
    k_neighbors=5         # Default: use 5 nearest fraud neighbours
)

# fit_resample() generates synthetic samples and returns balanced dataset

X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# ─────────────────────────────────────────────────────────────────────────────
# Verify SMOTE Transformation
# Original training set: 60,000 samples (8.74% fraud = 5,244 fraud cases)
# After SMOTE: ≈110,000 samples (50% fraud ≈ 55,000 fraud cases)
# SMOTE added: ≈50,000 synthetic fraud samples

print(f"\nSMOTE Transformation Summary:")
print(f"  Original training shape: {X_train.shape[0]:,} samples × {X_train.shape[1]} features")
print(f"  SMOTE training shape:    {X_train_smote.shape[0]:,} samples × {X_train_smote.shape[1]} features")
print(f"  Samples added:           {X_train_smote.shape[0] - X_train.shape[0]:,} synthetic fraud cases")
print(f"\n  Original fraud rate: {y_train.mean()*100:.2f}%")
print(f"  SMOTE fraud rate:    {y_train_smote.mean()*100:.2f}% (balanced)")

# ─────────────────────────────────────────────────────────────────────────────
# Train Logistic Regression on SMOTE-Balanced Data
# ─────────────────────────────────────────────────────────────────────────────
# Key differences from Test 2:
#   - Training on ≈110k samples instead of ≈60k (increased computational cost)
#   - No class_weight needed (data already balanced at 50:50)
#   - Model sees many synthetic fraud patterns during training
#   - Validation tests whether synthetic samples improve real-world performance

logit_smote = LogisticRegression(
    max_iter=2000,      # Increased: larger dataset may need more iterations
    random_state=42     # Fixed seed for reproducibility
    # NOTE: No class_weight parameter - data is balanced via oversampling
)

# Fit model on SMOTE-resampled training data
# Training time will be longer due to increased sample size
logit_smote.fit(X_train_smote, y_train_smote)

# ─────────────────────────────────────────────────────────────────────────────
# Evaluate on Original (Non-Resampled) Validation Set
# ─────────────────────────────────────────────────────────────────────────────
# CRITICAL: Validation set retains original 8.74% fraud distribution
# This tests whether learning from synthetic samples generalises to real data
# If SMOTE validation performance matches Test 2, class weighting is sufficient
# If SMOTE is worse, synthetic samples may have introduced noise/overfitting

# Get fraud probabilities on validation set
# [:, 1] extracts probability of positive class (fraud)

y_val_smote = logit_smote.predict_proba(X_val)[:, 1]

# ─────────────────────────────────────────────────────────────────────────────
# Calculate Performance Metrics
# ─────────────────────────────────────────────────────────────────────────────
# Expected outcomes compared to Test 2 (class weighting):
#   - PR AUC: Similar or slightly better (if synthetic samples add value)
#   - Recall: Similar (both approaches prioritise fraud detection)
#   - Precision: Similar trade-off between detection and false alarms
#   - If results similar: validates that class weighting is sufficient

ap_smote = average_precision_score(y_val, y_val_smote)
roc_smote = roc_auc_score(y_val, y_val_smote)

print(f"\nValidation Performance Metrics:")
print(f"  PR AUC (Average Precision): {ap_smote:.3f}")
print(f"  ROC AUC:                    {roc_smote:.3f}")

# ─────────────────────────────────────────────────────────────────────────────
# Evaluate Performance at Multiple Probability Thresholds
# ─────────────────────────────────────────────────────────────────────────────
# Same thresholds as Tests 1 and 2 for direct comparison
# SMOTE may shift probability calibration due to training on synthetic data

print("\nThreshold Performance:")

for t in [0.50, 0.70]:
    # ─────────────────────────────────────────────────────────────────────────
    # Convert probabilities to binary predictions using threshold t
    # ─────────────────────────────────────────────────────────────────────────
    # Decision rule: If prob_fraud >= t, predict fraud (1), else not fraud (0)
    y_pred = (y_val_smote >= t).astype(int)
    
    # ─────────────────────────────────────────────────────────────────────────
    # Compute Confusion Matrix Components
    # ─────────────────────────────────────────────────────────────────────────
    # TN (True Negative):  Correctly predicted non-fraud
    # FP (False Positive): False alarm - wastes investigation resources
    # FN (False Negative): Missed fraud - most costly error (financial loss)
    # TP (True Positive):  Correctly detected fraud - prevented loss
    
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    
    # ─────────────────────────────────────────────────────────────────────────
    # Calculate Key Business Metrics
    # ─────────────────────────────────────────────────────────────────────────
    
    # Precision: Of all fraud alerts, what proportion are genuine?
    #   = TP / (TP + FP)
    #   Compare to Test 2 to assess synthetic sample quality
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
    # Recall: Of all actual fraud, what proportion did it catch?
    #   = TP / (TP + FN)
    #   Should be similar to Test 2 if SMOTE adds value
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # Display results for this threshold
    print(f"\n  Threshold {t:.2f}:")
    print(f"    Recall   : {recall:.3f} ({tp}/{tp+fn} fraud cases caught)") # High recall = catch most fraud
    print(f"    Precision: {precision:.3f} ({tp}/{tp+fp} predictions correct)") # High precision = few false alarms
    print(f"    FP={fp:,}, FN={fn:,}")

print("\n" + "=" * 90 + "\n")

# Section 2.5 Comparison of All Three Methods

## Class Imbalance Strategy Comparison

**Approach**

Three imbalance handling strategies were compared using the validation set (20,000 samples, 8.74% fraud):
- No weighting (baseline)
- Balanced class weights (class_weight='balanced')
- SMOTE oversampling

**Summary of Results**

| **Metric**           | **No Weighting**          | **Balanced Weights**      | **SMOTE**             |
| -------------------- | ------------------------- | ------------------------- | --------------------- |
| **PR AUC**           | **0.799**                 | 0.748                     | 0.754                 |
| **ROC AUC**          | 0.969                     | **0.977**                 | **0.977**             |
| **Recall @ 0.50**    | 0.658 (1,150 / 1,748)     | **0.946 (1,653 / 1,748)** | 0.944 (1,650 / 1,748) |
| **Precision @ 0.50** | **0.813 (1,150 / 1,414)** | 0.522 (1,653 / 3,167)     | 0.532 (1,650 / 3,103) |
| **Recall @ 0.70**    | 0.490 (857 / 1,748)       | **0.878 (1,534 / 1,748)** | 0.875 (1,530 / 1,748) |
| **Precision @ 0.70** | **0.897 (857 / 955)**     | 0.640 (1,534 / 2,397)     | 0.646 (1,530 / 2,369) |

# Key Findings

**Baseline (no weighting):**
- Highest precision (0.90 @ threshold 0.70) but lowest recall (0.49)
- Misses half of fraud cases - unacceptable for fraud prevention
- Strong performance suggests well-engineered features and moderate imbalance (this could be synthetic dataset limitation):
    - These positive results may reflect limitations in the dataset due to its synthetic nature; synthetic data are often generated with balanced or well-separated feature distributions, the model may find it easier to distinguish classes even without weighting. In a real-world dataset (with noisier, overlapping fraud signals), the unweighted baseline would likely perform worse and class weighting or SMOTE would be more impactful. 

**Balanced weights:**
- Major recall improvement: +39 percentage points (0.49 → 0.88 @ threshold 0.70)
- Acceptable precision trade-off: 0.64 (down from 0.90)
- Catches 677 additional fraud cases at cost of 765 additional false positives

**SMOTE:**
- Near-identical performance to balanced weights (recall: 0.875 vs 0.878)
- Doubled training time with no performance gain
- Confirms class weighting is sufficient for this imbalance level (8.74%)

**Recommendation**
- Primary approach: class_weight='balanced'
    - Efficient, interpretable, and achieves high fraud detection
    - Optimal for moderate imbalance (5-20% minority class)
    - No synthetic data generation required

- When SMOTE might be preferred:
    - Extreme class imbalance (<2% minority class) 
    - Very few minority class examples (<1,000 samples)
    - Complex fraud patterns requiring more diverse training examples
    - When research question specifically examines synthetic oversampling efficacy 

Threshold recommendations:
- Threshold 0.70 (recommended operational threshold):
    - Recall: 87.8% - catches 1,534 of 1,748 fraud cases
    - Precision: 64.0% - 863 false positives (manageable investigation load)
    - Best balance between fraud detection and resource efficiency
  
- Threshold 0.50 (maximum fraud detection):
     - Recall: 94.6% - catches 1,653 of 1,748 fraud cases (95 missed)
     - Precision: 52.2% - 1,514 false positives (high investigation burden)
     - Use when fraud prevention is paramount and resources permit

- Threshold 0.80+ (precision-focused):
   - Higher precision but significantly lower recall
   - Only suitable if investigation capacity is severely constrained

- Threshold selection guidance:
    - Business should determine optimal threshold based on:
        - Investigation team capacity (false positive tolerance)
        - Cost of missed fraud vs cost of false alarms
        - Risk appetite and regulatory requirements

# Dataset Limitations: 

**Threshold Setting and Cost and Risk-Based Weighting (Limitation)**

This analysis could not incorporate actual financial costs, which would enable more precise threshold optimisation. A complete cost-benefit framework would require:
1. **Cost of false positives** (investigation costs):
   - Can be estimated: average investigation time × hourly rate × FP count
   - Example: 30 min/case × £50/hour × 863 FP = £21,575

2. **Cost of false negatives** (missed fraud losses):
   - Cannot be estimated: dataset lacks transaction amounts
   - Critical missing information: actual financial loss per undetected fraud case

Without reliable estimates of the financial loss from missed fraud or the operational cost of investigating false alerts, it was not possible to calculate truly cost-sensitive class weights.

If such data were available, model weights could be set using business and risk considerations rather than relying on the automatic class_weight='balanced' option. This would align the model with real financial impact, risk appetite, and operational constraints:
- A cost- and risk-based approach would directly link model weighting to business outcomes, penalising missed fraud more heavily and optimising recall according to real cost trade-offs.
- In this framework, the fraud class weight could be driven by stakeholder priorities, for example:
    - fraud_weight = (total impact of missed fraud ÷ total impact of false alert) × class imbalance ratio
- Such an approach would ensure the model reflects the true business cost of error, balancing fraud prevention effectiveness against investigation workload and customer impact.

In [None]:
# =============================================================================
# Class Imbalance Strategy Comparison
# =============================================================================
# Compare all three approaches side-by-side
# Evaluates trade-offs between fraud detection (recall) and false alarms (precision)
# Identifies optimal approach for deployment
# =============================================================================

print("=" * 90)
print("CLASS IMBALANCE STRATEGY COMPARISON (Validation Set)")
print("=" * 90)

# ─────────────────────────────────────────────────────────────────────────────
# Create High-Level Metrics Comparison Table
# ─────────────────────────────────────────────────────────────────────────────
# Compares PR AUC and ROC AUC across all three strategies
# PR AUC is primary metric (focuses on fraud detection performance)
# ROC AUC is secondary metric (shows overall class separability)

comparison = pd.DataFrame({
    'Strategy': ['No Weighting', 'class_weight=balanced', 'SMOTE'],
    'PR_AUC': [ap_unweighted, ap_balanced, ap_smote],
    'ROC_AUC': [roc_unweighted, roc_balanced, roc_smote]
})

# Display formatted comparison table
print("\nMetrics Summary:")
print(comparison.to_string(index=False))

# ─────────────────────────────────────────────────────────────────────────────
# Detailed Recall and Precision Comparison at Operational Threshold
# ─────────────────────────────────────────────────────────────────────────────
# Threshold 0.70 selected as representative operational threshold
# This threshold typically provides good balance between:
#   - Catching fraud (recall)
#   - Minimising false alarms (precision)
# Business can adjust threshold based on cost/benefit analysis

print("\n" + "=" * 90)
print("DETAILED PERFORMANCE COMPARISON @ Threshold 0.70")
print("=" * 90)
print("\nThreshold 0.70 rationale:")
print("  - Balances fraud detection against investigation resource constraints")
print("  - Higher than default 0.50 to reduce false positive rate")
print("  - Commonly used in fraud detection systems")
print("=" * 90)

# Iterate through all three strategies for detailed comparison
for name, scores in [
    ('No Weighting', y_val_unweighted),  # No class weighting applied
    ('Balanced Weights', y_val_balanced),  # Balanced class weights
    ('SMOTE', y_val_smote) # Synthetic minority oversampling
]:
    # ─────────────────────────────────────────────────────────────────────────
    # Convert Probabilities to Binary Predictions at Threshold 0.70
    # ─────────────────────────────────────────────────────────────────────────
    # Decision rule: If prob_fraud >= 0.70, predict fraud (1), else not fraud (0)
    # Higher threshold = more conservative = fewer false alarms but might miss fraud
    y_pred = (scores >= 0.70).astype(int)
    
    # ─────────────────────────────────────────────────────────────────────────
    # Extract Confusion Matrix Components
    # ─────────────────────────────────────────────────────────────────────────
    # TN (True Negative):  Correctly predicted non-fraud
    # FP (False Positive): False alarm - investigation costs
    # FN (False Negative): Missed fraud - financial losses
    # TP (True Positive):  Correctly caught fraud - losses prevented

    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    
    # ─────────────────────────────────────────────────────────────────────────
    # Calculate Business-Critical Metrics
    # ─────────────────────────────────────────────────────────────────────────
    
    # Recall (Sensitivity): Of all actual fraud, what % did it catch?
    #   = TP / (TP + FN)
    #   Higher recall = fewer missed fraud cases = better fraud prevention
    #   Primary goal: maximise this metric
    recall = tp / (tp + fn)
    
    # Precision (Positive Predictive Value): Of all fraud alerts, what % are genuine?
    #   = TP / (TP + FP)
    #   Higher precision = fewer false alarms = efficient resource use
    #   Trade-off: Higher precision often means lower recall
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    
    # Display detailed results for this strategy
    print(f"\n{name}:")
    print(f"  Recall:    {recall:.3f} ({tp:,}/{tp+fn:,} fraud caught, {fn:,} missed)") # High recall = catch most fraud
    print(f"  Precision: {precision:.3f} ({fp:,} false positives out of {tp+fp:,} alerts)") # High precision = few false alarms
    print(f"  True Negatives: {tn:,} | False Positives: {fp:,}") # True Negatives: correctly identified non-fraud
    print(f"  False Negatives: {fn:,} | True Positives: {tp:,}") # False Negatives: missed fraud cases

print("\n" + "=" * 90)

# ─────────────────────────────────────────────────────────────────────────────
# Save Comparison Results for Reporting
# ─────────────────────────────────────────────────────────────────────────────
# Export results to CSV for use in:
#   - Stakeholder presentations
#   - README documentation
#   - Business intelligence tools (Power BI, Tableau)
#   - Model performance tracking

# Ensure reports directory exists
Path("reports").mkdir(exist_ok=True, parents=True)

# Save high-level metrics comparison
comparison.to_csv("reports/balance_model_comparison.csv", index=False)
print(f"\nHigh-level metrics saved to: reports/balance_model_comparison.csv")

# ─────────────────────────────────────────────────────────────────────────────
# Create Detailed Comparison Table at Threshold 0.70
# ─────────────────────────────────────────────────────────────────────────────
# Build comprehensive comparison including recall, precision, and confusion matrix

detailed_comparison = [] # List to hold detailed results

for name, scores in [ 
    ('No Weighting', y_val_unweighted), # No class weighting applied
    ('Balanced Weights', y_val_balanced), # Balanced class weights
    ('SMOTE', y_val_smote)
]:
    # Generate predictions at threshold 0.70
    y_pred = (scores >= 0.70).astype(int) # Predict fraud if prob >= 0.70
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel() # Confusion matrix components
    
    # Calculate metrics
    recall = tp / (tp + fn) # Recall: Proportion of actual fraud caught
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0 # Precision: Proportion of alerts that are genuine fraud
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0  # F1 Score: Harmonic mean of precision and recall
    
    # Append to detailed comparison
    detailed_comparison.append({
        'Strategy': name,
        'Threshold': 0.70,
        'Recall': recall,
        'Precision': precision,
        'F1_Score': f1_score,
        'True_Positives': tp,
        'False_Positives': fp,
        'False_Negatives': fn,
        'True_Negatives': tn,
        'Total_Fraud_Cases': tp + fn,
        'Total_Alerts': tp + fp
    })

# Convert to DataFrame for easy viewing and export
detailed_df = pd.DataFrame(detailed_comparison) # Convert list of dicts to DataFrame

# Display detailed comparison
print("\n" + "=" * 90)
print("DETAILED COMPARISON TABLE (Threshold 0.70)")
print("=" * 90)
print(detailed_df.to_string(index=False))

# Save detailed comparison
detailed_df.to_csv("reports/balance_model_detailed_comparison_t70.csv", index=False) # Export to CSV
print(f"\nDetailed comparison saved to: reports/balance_model_detailed_comparison_t70.csv") # Export to CSV

# ─────────────────────────────────────────────────────────────────────────────
# Summary and Recommendation
# ─────────────────────────────────────────────────────────────────────────────
print("\n" + "=" * 90)
print("SUMMARY AND RECOMMENDATION")
print("=" * 90)

# Identify best strategy by PR AUC (primary metric)
best_strategy_idx = comparison['PR_AUC'].idxmax()
best_strategy = comparison.loc[best_strategy_idx, 'Strategy']
best_pr_auc = comparison.loc[best_strategy_idx, 'PR_AUC']

print(f"\nBest Strategy by PR AUC: {best_strategy} (PR AUC = {best_pr_auc:.3f})")

# Extract recall values at threshold 0.70 for comparison
recalls_t70 = {row['Strategy']: row['Recall'] for row in detailed_comparison}

print(f"\nRecall Comparison @ Threshold 0.70:")
for strategy, recall in recalls_t70.items():
    print(f"  {strategy:25s}: {recall:.1%}")

print("\nKey Findings:")
print("  1. No Weighting achieves highest PR AUC (0.799) with strong baseline performance")
print("  2. Balanced Weights and SMOTE significantly improve recall (94%+ vs 49%)")
print("  3. Trade-off: Improved recall comes at cost of lower precision")
print("  4. SMOTE and Balanced Weights perform similarly, validating class weighting sufficiency")

print("\nRecommended Approach:")
print("  • PRIMARY: class_weight='balanced'")
print("    - Reason: Achieves high recall (88-95% depending on threshold)")
print("    - No synthetic data generation required")
print("    - Computationally efficient")
print("    - Similar performance to SMOTE without added complexity")
print("\n  - ALTERNATIVE: Consider no weighting if precision is critical")
print("    - Reason: Highest PR AUC and precision")
print("    - Accept lower recall (49-66%) for fewer false positives")
print("    - Suitable if investigation resources are highly constrained")

print("\n" + "=" * 90 + "\n")

---