# **Model Selection & Evaluation**

* Establish baselines; connect metrics to business trade-offs.

**Note:** This modelling uses a processed stratified sample (100k rows) of a larger dataset (1M rows) to ensure efficient training while maintaining the fraud class balance. Details are provided in 01_ETL.ipynb

## Inputs

* Processed dataset data/processed/card_transdata_processed.csv (derived from 100k stratified sample)

## Outputs

* Baseline metrics, plots, and a predictions CSV for later comparison.



---

# Change working directory

I need to change the working directory from the current folder to its parent folder (required because the notebook is being run from inside the jupyter notebooks subfolder). In the code below, I change the working directory from its current folder to its parent folder.  
* I access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\DA_Capstone\\Credit-Card-Fraud-Analysis\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\DA_Capstone\\Credit-Card-Fraud-Analysis'

# Section 1: Load, Check Shape

Quick steps to:
- Load processed file
- Check Dataframe Shape is as expected (100000, 17)
- Display first 5 rows to check loads as expected.


In [4]:
# =============================================================================
# Import all libraries needed for the notebook
# =============================================================================

# Core data manipulation and path handling
import pandas as pd  # Data manipulation and analysis
import numpy as np    # Numerical operations and array handling
from pathlib import Path  # Cross-platform file path handling

# Data visualisation
import matplotlib.pyplot as plt  # Static plotting
import seaborn as sns # Enhanced visual styling

# Display tools
from IPython.display import display  # Pretty display of DataFrames in Jupyter

# =============================================================================
# Machine Learning: Model preparation and evaluation
# =============================================================================
from sklearn.model_selection import train_test_split, GridSearchCV  # Data splitting and tuning
from sklearn.preprocessing import StandardScaler  # Feature scaling for numeric models
from sklearn.linear_model import LogisticRegression  # Linear baseline classifier
from sklearn.tree import DecisionTreeClassifier  # Simple tree model for interpretability
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier  # Ensemble models
from xgboost import XGBClassifier  # Gradient boosting with high performance

# =============================================================================
# Model evaluation metrics
# =============================================================================
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,  # Core classification metrics
    precision_recall_curve, average_precision_score,  # PR curve metrics (preferred for imbalance)
    roc_auc_score, roc_curve, confusion_matrix,  # ROC and confusion matrix
    classification_report  # Summary table
)

# =============================================================================
# Load processed dataset (created from 100k stratified sample in 01 ETL.ipynb)
# Display the shape of the dataframe

df = pd.read_csv("data/processed/card_transdata_processed.csv") # Load processed data

df.shape # (rows, columns)


(100000, 17)

In [5]:
df.head() # Display first few rows of the dataframe

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud,log_distance_from_home,log_distance_from_last_transaction,log_ratio_to_median_purchase_price,log_distance_from_home_bin,log_purchase_price_bin,log_distance_from_last_transaction_bin,online_high_distance,online_chip_category,online_and_chip
0,19.179396,0.178758,2.979353,1,0,0,1,0,3.004662,0.164461,1.381119,3,4,0,1,online_no_chip,0
1,47.192898,1.224832,0.293538,1,1,0,1,0,3.875212,0.799681,0.257381,4,0,2,1,online_with_chip,1
2,54.389043,5.29091,4.492304,1,1,0,0,0,4.014382,1.839106,1.703348,4,4,4,0,offline_with_chip,0
3,3.129745,0.607212,0.357527,1,0,0,1,0,1.418216,0.474501,0.305665,1,0,1,0,online_no_chip,0
4,0.925275,2.238057,0.684942,0,0,0,0,0,0.655069,1.174974,0.521731,0,1,3,0,offline_no_chip,0


# Section 1.1: Data Validation and Preparation

## Class Balance Verification

**Purpose:** Verify that the dataset maintains the expected fraud rate (8.74%) from the stratified sampling performed in the ETL pipeline.

**Expected Outcome:**
- Fraud rate: 8.74% 
- Non-fraud rate: 91.26%
- Imbalance ratio: approximately 1:10.4 (non-fraud to fraud)

## Feature and Target Split

**Purpose:** Prepare the dataset for modeling by separating features (X) and target variable (y).

**Dataset Composition:**
The dataset contains 17 columns total, split into features and target:

**Excluded Features from Modelling:**

1. **fraud** - Target variable (what we're predicting)

2. **online_chip_category** - String categorical variable
   - Contains labels: "online_no_chip", "online_chip", "offline_no_chip", "offline_chip"
   - Redundant: underlying binary features (online_order and used_chip) already included
   - Created for hypothesis testing visualisation (H8) only

3. **Binned Variables** - Created for EDA visualisation only:
   - log_distance_from_home_bin - Categorical distance ranges
   - log_purchase_price_bin - Categorical price ratio ranges  
   - log_distance_from_last_transaction_bin - Categorical transaction distance ranges
   
   **Why excluded:** Binning loses information; continuous and log-transformed versions provide superior predictive power for models

**Features Retained**

**Original Features:**
- distance_from_home - Geographic distance from customer's home address (km)
- distance_from_last_transaction - Distance from previous transaction location (km)
- ratio_to_median_purchase_price - Current purchase relative to customer's median spending
- repeat_retailer - Whether customer previously transacted with retailer (binary: 0/1)
- used_chip - Chip card authentication used (binary: 0/1)
- used_pin_number - PIN verification used (binary: 0/1)
- online_order - Transaction channel: online vs in-store (binary: 0/1)

**Log-Transformed Features:**
- log_distance_from_home- Reduces right skew in distance distribution
- log_distance_from_last_transaction - Handles extreme distance outliers
- log_ratio_to_median_purchase_price - Normalises purchase ratio distribution

*Rationale: Both original and log-transformed versions retained. Tree-based models (Random Forest, XGBoost) can select the most predictive representation through natural feature selection.*

**Engineered Interaction Features:**
- online_high_distance - Binary flag: online transaction far from home (combines channel + distance risk)
- online_and_chip - Binary flag: online transaction using chip authentication

**Output:**
- **X**: Feature matrix (n_samples × 12 features) - all numeric predictors
- **y**: Binary target vector (n_samples) - fraud indicator (0 = legitimate, 1 = fraud)

**Feature Strategy:**
This approach balances information richness with model efficiency:
- Retains both raw and transformed features for model flexibility
- Excludes redundant encoded categories
- Removes low-information binned variables
- Total of 12 features provides sufficient signal without overfitting risk

In [7]:
# =============================================================================
# DATA VALIDATION AND PREPARATION
# =============================================================================
# Validates class balance and prepares features for modeling
# Ensures dataset integrity before model training begins
# =============================================================================

# ─────────────────────────────────────────────────────────────────────────────
# Class Balance Verification
# ─────────────────────────────────────────────────────────────────────────────
# Confirm fraud rate matches expected 8.74% from stratified sampling in ETL
# Any mismatch indicates potential data loading or processing errors

# Calculate fraud statistics
fraud_count = int(df['fraud'].sum())        # Total number of fraudulent transactions
fraud_rate = df['fraud'].mean()             # Proportion of fraud (0 to 1)
total_rows = df.shape[0]                    # Total number of transactions

# Display fraud distribution clearly
print(f"Fraud rate: {fraud_rate:.2%} ({fraud_count:,} fraud cases out of {total_rows:,} total transactions)")

# ─────────────────────────────────────────────────────────────────────────────
# Verify against expected rate from ETL pipeline
# ─────────────────────────────────────────────────────────────────────────────
# Expected rate comes from sample_log.json created during stratified sampling
expected_fraud_rate = 0.0874  # 8.74% fraud rate from ETL stratified sampling
rate_difference = abs(fraud_rate - expected_fraud_rate)  # Absolute difference

# Check if observed rate matches expected (allow tiny floating point errors)
if rate_difference < 0.0001:
    print(f"✓ Class balance verified: matches expected rate ({expected_fraud_rate:.2%})")
else:
    print(f"Issue: Fraud rate {fraud_rate:.4%} differs from expected {expected_fraud_rate:.2%}")
    print(f"  Difference: {rate_difference:.4%}")

# Calculate and display imbalance ratio
imbalance_ratio = (1 - fraud_rate) / fraud_rate  # Ratio of non-fraud to fraud
print(f"Imbalance ratio: 1:{imbalance_ratio:.1f} (non-fraud : fraud)")

print("\n" + "="*90 + "\n")

# ─────────────────────────────────────────────────────────────────────────────
# Feature and Target Split
# ─────────────────────────────────────────────────────────────────────────────
# Separate features (X) from target variable (y)
# Exclude features that cannot be used for modeling

target = "fraud"  # Target variable name

# Define columns to exclude from feature matrix
# These are either:
# 1. The target variable itself
# 2. String categorical variables created for EDA visualization
# 3. Binned variables created for hypothesis testing/visualization only
exclude_cols = [
    target,                                      # Target variable (what we're predicting)
    'online_chip_category',                      # String labels: "online_no_chip", "online_chip", etc.
                                                 # (underlying features online_order + used_chip already included)
    'log_distance_from_home_bin',                # Categorical bins for EDA (continuous log_distance_from_home included)
    'log_purchase_price_bin',                    # Categorical bins for EDA (continuous log_ratio_to_median_purchase_price included)
    'log_distance_from_last_transaction_bin'     # Categorical bins for EDA (continuous log_distance_from_last_transaction included)
]

# Create feature matrix X (drop excluded columns if they exist)
X = df.drop(columns=[col for col in exclude_cols if col in df.columns])

# Create target vector y (ensure binary encoding: 0 or 1)
y = df[target].astype(int)  # 0 = legitimate transaction, 1 = fraudulent transaction

# Display shape and confirmation
print(f"Feature matrix (X) shape: {X.shape[0]:,} samples x {X.shape[1]} features")
print(f"Target vector (y) shape: {y.shape[0]:,} samples")
print(f"\nTarget distribution:")
print(f"  Class 0 (legitimate): {(y == 0).sum():,} ({(y == 0).mean():.2%})")
print(f"  Class 1 (fraud):      {(y == 1).sum():,} ({(y == 1).mean():.2%})")

# ─────────────────────────────────────────────────────────────────────────────
# Display features included in modeling
# ─────────────────────────────────────────────────────────────────────────────
print(f"\n{'='*90}")
print("Features Included in Modelling:")
print(f"{'='*90}\n")

print("Original Features (7 from dataset):")
print("  distance_from_home")
print("  distance_from_last_transaction")
print("  ratio_to_median_purchase_price")
print("  repeat_retailer")
print("  used_chip")
print("  used_pin_number")
print("  online_order")

print("\nLog-Transformed Features (3 engineered for skew reduction):")
print("  log_distance_from_home")
print("  log_distance_from_last_transaction")
print("  log_ratio_to_median_purchase_price")

print("\nInteraction Features (2 engineered for domain insights):")
print("  online_high_distance")
print("  online_and_chip")

print(f"\nTotal: {X.shape[1]} features")
print(f"\n{'='*90}\n")

Fraud rate: 8.74% (8,740 fraud cases out of 100,000 total transactions)
✓ Class balance verified: matches expected rate (8.74%)
Imbalance ratio: 1:10.4 (non-fraud : fraud)


Feature matrix (X) shape: 100,000 samples x 12 features
Target vector (y) shape: 100,000 samples

Target distribution:
  Class 0 (legitimate): 91,260 (91.26%)
  Class 1 (fraud):      8,740 (8.74%)

Features Included in Modelling:

Original Features (7 from dataset):
  distance_from_home
  distance_from_last_transaction
  ratio_to_median_purchase_price
  repeat_retailer
  used_chip
  used_pin_number
  online_order

Log-Transformed Features (3 engineered for skew reduction):
  log_distance_from_home
  log_distance_from_last_transaction
  log_ratio_to_median_purchase_price

Interaction Features (2 engineered for domain insights):
  online_high_distance
  online_and_chip

Total: 12 features




---

# Section 2: Data Splitting and Class Balance

---