## Imports

- `pathlib.Path` – to build file paths in a clean, OS-independent way  
- `numpy`, `pandas` – to work with numeric data and DataFrames  
- `XGBClassifier`, `plot_importance` – the XGBoost model and feature importance plotting  
- `sklearn.metrics` – precision, recall, F1, ROC-AUC, confusion matrix, text report  
- `RandomizedSearchCV` – to try different XGBoost hyperparameters and pick the best one

In [2]:
from pathlib import Path

import numpy as np
import pandas as pd

from xgboost import XGBClassifier, plot_importance

from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    precision_recall_fscore_support,
    roc_auc_score,
)
from sklearn.model_selection import RandomizedSearchCV


## Load processed train / validation / test splits

Here we load the preprocessed datasets produced by the Phase 1 pipeline:

- `train.csv` – SMOTE-balanced training data after preprocessing  
- `val.csv` – validation split (no SMOTE applied)  
- `test.csv` – final hold-out test split (no SMOTE applied)  

Each row is a transaction, and each column (except `Class`) is a numeric feature:

- PCA components from the original dataset  
- Engineered behavioral features  
- Device / network / geo features  
- One-hot encoded categorical variables  

The shapes:

- Train: **27,942 rows × 7,622 features**  
- Val: **3,000 rows × 7,622 features**  
- Test: **3,000 rows × 7,622 features**  

This confirms the preprocessing pipeline created a wide feature matrix suitable for XGBoost.
The target label is the column `Class` (0 = non-fraud, 1 = fraud).


In [6]:
from pathlib import Path

# Identify project root correctly (2 levels up)
NOTEBOOK_DIR = Path().resolve()
ROOT = NOTEBOOK_DIR.parents[0]


print("Notebook dir:", NOTEBOOK_DIR)
print("Project root:", ROOT)

DATA_DIR = ROOT / "data" / "processed"
print("DATA_DIR:", DATA_DIR)

TRAIN_PATH = DATA_DIR / "train.csv"
VAL_PATH   = DATA_DIR / "val.csv"
TEST_PATH  = DATA_DIR / "test.csv"

# Load files
train_df = pd.read_csv(TRAIN_PATH)
val_df = pd.read_csv(VAL_PATH)
test_df = pd.read_csv(TEST_PATH)

train_df.shape, val_df.shape, test_df.shape


Notebook dir: /Users/lavanyasrinivas/Documents/AI-First-Preauth-Fraud-Detection/AI-First-Preauth-Fraud-Detection/notebooks
Project root: /Users/lavanyasrinivas/Documents/AI-First-Preauth-Fraud-Detection/AI-First-Preauth-Fraud-Detection
DATA_DIR: /Users/lavanyasrinivas/Documents/AI-First-Preauth-Fraud-Detection/AI-First-Preauth-Fraud-Detection/data/processed


((27942, 7622), (3000, 7622), (3000, 7622))

## Train / validation / test label distribution

We now split the data into:

- `X_*` – all numeric features (after preprocessing and encoding)  
- `y_*` – the target label `Class` (0 = non-fraud, 1 = fraud)

In [7]:
TARGET_COL = "Class"

# Split into features (X) and labels (y)
X_train = train_df.drop(columns=[TARGET_COL])
y_train = train_df[TARGET_COL]

X_val = val_df.drop(columns=[TARGET_COL])
y_val = val_df[TARGET_COL]

X_test = test_df.drop(columns=[TARGET_COL])
y_test = test_df[TARGET_COL]

print("Train X:", X_train.shape, " y:", y_train.shape)
print("Val   X:", X_val.shape,   " y:", y_val.shape)
print("Test  X:", X_test.shape,  " y:", y_test.shape)

print("\nClass balance (train):")
print(y_train.value_counts(normalize=True).rename("proportion"))

print("\nClass balance (val):")
print(y_val.value_counts(normalize=True).rename("proportion"))

print("\nClass balance (test):")
print(y_test.value_counts(normalize=True).rename("proportion"))


Train X: (27942, 7621)  y: (27942,)
Val   X: (3000, 7621)  y: (3000,)
Test  X: (3000, 7621)  y: (3000,)

Class balance (train):
Class
0    0.5
1    0.5
Name: proportion, dtype: float64

Class balance (val):
Class
0    0.998333
1    0.001667
Name: proportion, dtype: float64

Class balance (test):
Class
0    0.999333
1    0.000667
Name: proportion, dtype: float64




The class proportions are:

- **Train:** 50% non-fraud, 50% fraud → this is expected because SMOTE was applied only to the training split to balance the classes.  
- **Validation:** ~0.17% fraud → highly imbalanced, closer to real-world fraud rates.  
- **Test:** ~0.07% fraud → extremely imbalanced, also realistic.

This setup is intentional:

- The model is **trained** on balanced data (to learn the minority class better).
- The model is **evaluated** on imbalanced validation and test sets, to see how it behaves under realistic fraud conditions.


## Baseline XGBoost training

Here we train a first XGBoost model on the SMOTE-balanced training data.

Important:
- `scale_pos_weight` is computed from the class distribution (here it will be ~1.0 because the data is balanced).
- `eval_set` in XGBoost uses a list of `(X, y)` pairs, **not** named triples.  
  So we pass `[(X_val, y_val)]` instead of `("validation", X_val, y_val)`.


In [9]:
from xgboost import XGBClassifier

# Compute scale_pos_weight from the (balanced) training labels
values, counts = np.unique(y_train, return_counts=True)
class_counts = dict(zip(values, counts))
neg = class_counts.get(0, 0)
pos = class_counts.get(1, 0)
scale_pos_weight = neg / pos if pos > 0 else 1.0

print("Class counts (train):", class_counts)
print("scale_pos_weight:", scale_pos_weight)

# Baseline XGBoost model
baseline_xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    scale_pos_weight=scale_pos_weight,
    tree_method="hist",     # fast on CPU
    eval_metric="logloss",  # avoids warnings
    n_jobs=-1,
)

# Fit on training data
baseline_xgb.fit(
    X_train,
    y_train,
    eval_set=[(X_val, y_val)],  # <-- FIXED: list of (X, y) tuples
    verbose=False,
)

print("✅ Baseline XGBoost model trained.")


Class counts (train): {0: 13971, 1: 13971}
scale_pos_weight: 1.0
✅ Baseline XGBoost model trained.


## Evaluate baseline XGBoost on validation and test sets

Now that the baseline XGBoost model is trained on the SMOTE-balanced training data,  
we evaluate it on the **imbalanced** validation and test splits.

For each split we compute:

- **Precision** – of all predicted frauds, how many are truly fraud  
- **Recall** – of all true frauds, how many we correctly catch  
- **F1-score** – harmonic mean of precision and recall (good for imbalanced data)  
- **ROC-AUC** – how well the model separates fraud vs non-fraud across thresholds  
- **Confusion matrix** – counts of TP, FP, FN, TN  
- **Classification report** – per-class precision/recall/F1 and overall averages


In [10]:
from sklearn.metrics import (
    precision_recall_fscore_support,
    roc_auc_score,
    confusion_matrix,
    classification_report,
)

def evaluate_split(model, X, y, split_name: str):
    print(f"\n===== {split_name.upper()} EVALUATION =====")

    # Predictions
    y_pred = model.predict(X)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X)[:, 1]
    else:
        y_proba = y_pred  # fallback, not ideal but safe

    # Binary metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        y, y_pred, average="binary", zero_division=0
    )

    try:
        roc_auc = roc_auc_score(y, y_proba)
    except ValueError:
        roc_auc = float("nan")

    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-score:  {f1:.4f}")
    print(f"ROC-AUC:   {roc_auc:.4f}")

    print("\nConfusion matrix:")
    print(confusion_matrix(y, y_pred))

    print("\nClassification report:")
    print(classification_report(y, y_pred, digits=4))

    return {
        "precision": float(precision),
        "recall": float(recall),
        "f1": float(f1),
        "roc_auc": float(roc_auc),
    }

# Evaluate on validation and test
val_metrics_baseline = evaluate_split(baseline_xgb, X_val, y_val, split_name="validation")
test_metrics_baseline = evaluate_split(baseline_xgb, X_test, y_test, split_name="test")

val_metrics_baseline, test_metrics_baseline



===== VALIDATION EVALUATION =====
Precision: 0.8000
Recall:    0.8000
F1-score:  0.8000
ROC-AUC:   0.9876

Confusion matrix:
[[2994    1]
 [   1    4]]

Classification report:
              precision    recall  f1-score   support

           0     0.9997    0.9997    0.9997      2995
           1     0.8000    0.8000    0.8000         5

    accuracy                         0.9993      3000
   macro avg     0.8998    0.8998    0.8998      3000
weighted avg     0.9993    0.9993    0.9993      3000


===== TEST EVALUATION =====
Precision: 1.0000
Recall:    1.0000
F1-score:  1.0000
ROC-AUC:   1.0000

Confusion matrix:
[[2998    0]
 [   0    2]]

Classification report:
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      2998
           1     1.0000    1.0000    1.0000         2

    accuracy                         1.0000      3000
   macro avg     1.0000    1.0000    1.0000      3000
weighted avg     1.0000    1.0000    1.0000      3000

({'precision': 0.8, 'recall': 0.8, 'f1': 0.8, 'roc_auc': 0.9876460767946578},
 {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'roc_auc': 1.0})

## Hyperparameter tuning with RandomizedSearchCV

The baseline XGBoost model already performs well (F1 ≈ 0.80 on validation, ROC-AUC ≈ 0.99).
To push performance further, we use `RandomizedSearchCV` to explore a small hyperparameter space:

- `max_depth` – tree depth (controls model complexity)  
- `learning_rate` – step size for boosting  
- `subsample` – fraction of rows used per boosting round  
- `colsample_bytree` – fraction of features used per tree  
- `n_estimators` – number of boosting trees  

We optimize using **F1-score** on cross-validation, since fraud detection cares about both precision and recall.
We keep the search space and number of iterations small to avoid very slow training.


In [11]:
from sklearn.model_selection import RandomizedSearchCV

# Parameter search space (small but meaningful)
param_distributions = {
    "max_depth": [4, 6, 8],
    "learning_rate": [0.01, 0.03, 0.05, 0.1],
    "subsample": [0.7, 0.8, 0.9],
    "colsample_bytree": [0.7, 0.8, 0.9],
    "n_estimators": [200, 300, 400],
}

# Rebuild base model (same as baseline, but will be tuned)
base_xgb_for_search = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    scale_pos_weight=scale_pos_weight,
    tree_method="hist",
    eval_metric="logloss",
    n_jobs=-1,
)

# Optional: if training set becomes very large in future, we could subsample for tuning
X_train_search = X_train
y_train_search = y_train

print("Search training shape:", X_train_search.shape)

# RandomizedSearchCV for F1-score
search = RandomizedSearchCV(
    estimator=base_xgb_for_search,
    param_distributions=param_distributions,
    n_iter=10,          # number of parameter combinations to try
    scoring="f1",       # optimize F1 for fraud class
    cv=3,               # 3-fold cross-validation
    verbose=1,
    n_jobs=-1,
    random_state=42,
)

search.fit(X_train_search, y_train_search)

print("\n✅ Hyperparameter search finished.")
print("Best params:", search.best_params_)
print(f"Best CV F1: {search.best_score_:.4f}")

best_xgb = search.best_estimator_


Search training shape: (27942, 7621)
Fitting 3 folds for each of 10 candidates, totalling 30 fits

✅ Hyperparameter search finished.
Best params: {'subsample': 0.9, 'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.03, 'colsample_bytree': 0.8}
Best CV F1: 0.9998


## Evaluate the tuned XGBoost model

Now that RandomizedSearchCV returned the best hyperparameters,  
we evaluate the tuned model on the **validation** and **test** sets.

This shows how well the tuned model generalizes to imbalanced real-world data.


In [12]:
print("\n=== Evaluating Tuned Model on Validation and Test ===")

val_metrics_tuned = evaluate_split(best_xgb, X_val, y_val, split_name="validation (tuned)")
test_metrics_tuned = evaluate_split(best_xgb, X_test, y_test, split_name="test (tuned)")

val_metrics_tuned, test_metrics_tuned



=== Evaluating Tuned Model on Validation and Test ===

===== VALIDATION (TUNED) EVALUATION =====
Precision: 0.8000
Recall:    0.8000
F1-score:  0.8000
ROC-AUC:   0.9958

Confusion matrix:
[[2994    1]
 [   1    4]]

Classification report:
              precision    recall  f1-score   support

           0     0.9997    0.9997    0.9997      2995
           1     0.8000    0.8000    0.8000         5

    accuracy                         0.9993      3000
   macro avg     0.8998    0.8998    0.8998      3000
weighted avg     0.9993    0.9993    0.9993      3000


===== TEST (TUNED) EVALUATION =====
Precision: 1.0000
Recall:    1.0000
F1-score:  1.0000
ROC-AUC:   1.0000

Confusion matrix:
[[2998    0]
 [   0    2]]

Classification report:
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      2998
           1     1.0000    1.0000    1.0000         2

    accuracy                         1.0000      3000
   macro avg     1.0000    1.0000   

({'precision': 0.8, 'recall': 0.8, 'f1': 0.8, 'roc_auc': 0.9957929883138564},
 {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'roc_auc': 1.0})

## Threshold tuning for maximum F1-score

XGBoost outputs probabilities, but the default decision threshold (0.50) is not optimal for imbalanced fraud data.
We search across thresholds from 0.01 to 0.99 and compute F1-score at each threshold.
The threshold that gives the highest F1 is selected as the optimal operating point.


In [13]:
import numpy as np
from sklearn.metrics import f1_score

# Get validation probabilities (not labels)
val_proba = best_xgb.predict_proba(X_val)[:, 1]

thresholds = np.linspace(0.01, 0.99, 99)
f1_scores = []

for t in thresholds:
    preds = (val_proba >= t).astype(int)
    f1_scores.append(f1_score(y_val, preds, zero_division=0))

best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
best_f1 = f1_scores[best_idx]

print("Best threshold:", best_threshold)
print("Best F1 score:", best_f1)


Best threshold: 0.67
Best F1 score: 0.8888888888888888


## Final evaluation using the tuned probability threshold

Here we apply the optimal threshold (found via threshold search) to convert predicted
probabilities into fraud labels. This final evaluation gives the true F1, precision,
recall, and confusion matrix for the tuned model at its best operating point.


In [14]:
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score

def evaluate_with_threshold(model, X, y, threshold, split_name="validation"):
    print(f"\n===== FINAL EVALUATION ({split_name.upper()}) =====")
    
    proba = model.predict_proba(X)[:, 1]
    preds = (proba >= threshold).astype(int)

    precision = precision_score(y, preds, zero_division=0)
    recall = recall_score(y, preds, zero_division=0)
    f1 = f1_score(y, preds, zero_division=0)

    print(f"Threshold: {threshold:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-score:  {f1:.4f}")

    print("\nConfusion matrix:")
    print(confusion_matrix(y, preds))

    print("\nClassification report:")
    print(classification_report(y, preds, digits=4))

    return {"precision": precision, "recall": recall, "f1": f1}


# Evaluate on validation and test
final_val_metrics = evaluate_with_threshold(best_xgb, X_val, y_val, best_threshold, split_name="validation")
final_test_metrics = evaluate_with_threshold(best_xgb, X_test, y_test, best_threshold, split_name="test")

final_val_metrics, final_test_metrics



===== FINAL EVALUATION (VALIDATION) =====
Threshold: 0.6700
Precision: 1.0000
Recall:    0.8000
F1-score:  0.8889

Confusion matrix:
[[2995    0]
 [   1    4]]

Classification report:
              precision    recall  f1-score   support

           0     0.9997    1.0000    0.9998      2995
           1     1.0000    0.8000    0.8889         5

    accuracy                         0.9997      3000
   macro avg     0.9998    0.9000    0.9444      3000
weighted avg     0.9997    0.9997    0.9996      3000


===== FINAL EVALUATION (TEST) =====
Threshold: 0.6700
Precision: 1.0000
Recall:    1.0000
F1-score:  1.0000

Confusion matrix:
[[2998    0]
 [   0    2]]

Classification report:
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000      2998
           1     1.0000    1.0000    1.0000         2

    accuracy                         1.0000      3000
   macro avg     1.0000    1.0000    1.0000      3000
weighted avg     1.0000    1.0000    

({'precision': 1.0, 'recall': 0.8, 'f1': 0.8888888888888888},
 {'precision': 1.0, 'recall': 1.0, 'f1': 1.0})