2.0 Modeling

The fraud detection approach implemented in this analysis uses a two-stage modeling process. The intent of this design is to apply a broadly applicable decision threshold to the majority of transactions in the first stage, while reserving a more detailed and targeted model for transactions that fall near the decision boundary. This allows the system to maintain high throughput and low customer impact for most transactions, while focusing modeling complexity where it is most valuable.

Given the nature of the dataset, fraud is a rare event, with fraudulent transactions representing approximately 0.17% of all observations. This level of class imbalance is a key consideration in both model selection and evaluation. In this context, traditional accuracy metrics are not informative, as a naïve model that predicts all transactions as non-fraudulent would achieve high accuracy while providing no practical value. As a result, Precision–Recall AUC is used as the primary evaluation metric, reflecting both the rarity of fraud and the operational cost associated with false positives.

2.1 Stage 1: High-Recall Screening

The objective of the first-stage model is to act as a high-recall screening filter, identifying transactions that warrant further examination while allowing the majority of low-risk transactions to pass without additional friction. This stage is intentionally conservative, prioritizing recall over precision to minimize the likelihood of fraudulent transactions being missed early in the process.

By design, this stage applies a single threshold across the full dataset and is optimized to capture the vast majority of fraud cases, accepting that some legitimate transactions will be forwarded to the second stage as a result. Transactions not flagged at this stage are treated as low risk and are not evaluated further.

2.2 Stage 2: Boundary Refinement

Transactions flagged by Stage 1 are passed to a second model focused on refining decisions near the classification boundary. This stage operates on a significantly smaller subset of data and is optimized for precision, with the goal of separating high-confidence fraud from legitimate transactions that were conservatively flagged in the first stage.

Multiple operating thresholds were evaluated for the second-stage model in order to understand the tradeoff between fraud capture and customer impact. Based on a review of the resulting confusion matrices, an operating point corresponding to approximately 80% Stage-2 recall was selected for automatic blocking.

At this operating point:

Approximately 0.002% of legitimate transactions would be incorrectly blocked automatically.

An additional 0.019% of transactions would be flagged for customer contact or step-up verification rather than outright blocking.

When combined with the first-stage screening, this configuration results in the system identifying approximately 95.3% of fraudulent transactions overall, while keeping false-positive impact on legitimate customers very low. This reflects a deliberate balance between fraud prevention effectiveness and customer experience, rather than an attempt to maximize any single metric in isolation.

In [1]:
# Load data

# Connect to drive

file_path = '/content/drive/MyDrive/Scikit learn/Fraud Classification/creditcard.csv'

Mounted at /content/drive


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    average_precision_score, roc_auc_score, precision_recall_curve,
    classification_report, confusion_matrix, make_scorer
)
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from functools import partial
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_predict

In [4]:
#  Add hour column to show hour of the day a transaction occured.
df = pd.read_csv(file_path)
df["hour"] = (df["Time"]// 3600)% 24

In [6]:
# Create training and texting data sets

X = df.drop(columns=["Class", "Time"])
y = df["Class"]  # 1 = fraud, 0 = not fraud

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Linear regression models and a HGB classifier, comparing results to determine which models are more effective for data set.

In [7]:
#  Create pipelines to examine linear regression models with a balanced class weight and no correction for class weight

def fit_eval(pipe, X_train, y_train, X_valid, y_valid, name="model"):
    pipe.fit(X_train, y_train)
    p = pipe.predict_proba(X_valid)[:, 1]
    print(f"\n{name}")
    print("PR AUC:", average_precision_score(y_valid, p))
    print("ROC AUC:", roc_auc_score(y_valid, p))
    return p

pipe_lr_none = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000, class_weight=None))
])

pipe_lr_bal = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000, class_weight="balanced"))
])

p_none = fit_eval(pipe_lr_none, X_train, y_train, X_valid, y_valid, "LR (no class_weight)")
p_bal  = fit_eval(pipe_lr_bal,  X_train, y_train, X_valid, y_valid, "LR (balanced)")


LR (no class_weight)
PR AUC: 0.7424445995487957
ROC AUC: 0.959559615207929

LR (balanced)
PR AUC: 0.7197876670919009
ROC AUC: 0.9717294504323958


In [9]:
# HGB model

hgb = HistGradientBoostingClassifier(
    max_depth = 6,
    learning_rate = 0.05,
    max_iter = 300,
    random_state = 42
)

hgb.fit(X_train, y_train)
p_hgb = hgb.predict_proba(X_valid)[:, 1]

print("HGB PR AUC:", average_precision_score(y_valid, p_hgb))
print("HGB ROC AUC:", roc_auc_score(y_valid, p_hgb))

HGB PR AUC: 0.7051514270163073
HGB ROC AUC: 0.901171107863517


In [11]:
# Compare 3 models


def threshold_for_recall(y_true, p, target_recall= 0.9):
  prec, rec, thr = precision_recall_curve(y_true, p)

  rec_t = rec[1:]
  prec_t = prec[1:]

  ok = np.where(rec_t >= target_recall)[0]
  if len(ok) ==0:
    return 1.0, prec_t[-1], rec_t[-1]

  j = ok[-1]
  t = thr[j]
  return t, prec_t[j], rec_t[j]

def eval_at_threshold(y_true, p, t):
  y_pred = (p >=t).astype(int)
  print("Threshold:", t)
  print(confusion_matrix(y_valid, y_pred))
  print(classification_report(y_true, y_pred, digits=4, zero_division=0))

for name, p in {
    "LR (no class_weight)": p_none,
    "LR (balanced)": p_bal,
    "HGB": p_hgb,
}.items():
  t, pr, rc = threshold_for_recall(y_valid, p, target_recall= 0.90)
  print(f"\n{name}")
  print("threshold:", t)
  print("precision:", pr)


LR (no class_weight)
threshold: 0.001200000791970843
precision: 0.019123334765792865

LR (balanced)
threshold: 0.5557683216815212
precision: 0.06732223903177005

HGB
threshold: 0.0010229736602741576
precision: 0.0016354258397282676


It appears that the balanced Linear Regression has the greatest precision.  This model will be tuned with a recall value of 0.95.

In [13]:
# LR parameter tuning

# ------------------------------------------------------------
# Preprocessing: log1p(Amount) + scale (all numeric features)
# ------------------------------------------------------------
def make_preprocessor(feature_cols, log_amount=True, amount_col="Amount"):
    """
    Builds a ColumnTransformer that:
      - imputes missing numeric values with median
      - (optionally) log1p-transforms Amount
      - scales all numeric features
    Assumes all features are numeric (creditcard classic dataset).
    """
    if log_amount and amount_col in feature_cols:
        amount_idx = [feature_cols.index(amount_col)]
        other_idx = [i for i, c in enumerate(feature_cols) if c != amount_col]

        pre = ColumnTransformer(
            transformers=[
                ("amount_log", Pipeline([
                    ("imputer", SimpleImputer(strategy="median")),
                    ("log1p", FunctionTransformer(np.log1p, feature_names_out="one-to-one")),
                    ("scaler", StandardScaler()),
                ]), amount_idx),
                ("other_num", Pipeline([
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler()),
                ]), other_idx),
            ],
            remainder="drop"
        )
    else:
        pre = Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ])

    return pre


# ------------------------------------------------------------
# Precision at (at least) a target recall
# ------------------------------------------------------------
def precision_at_recall(y_true, p, target_recall=0.90):
    prec, rec, thr = precision_recall_curve(y_true, p)
    prec_t = prec[1:]
    rec_t  = rec[1:]

    ok = np.where(rec_t >= target_recall)[0]
    if len(ok) == 0:
        return 0.0

    j = ok[-1]
    val = float(prec_t[j])

    # Guard against NaN/inf
    if not np.isfinite(val):
        return 0.0
    return val


def _precision_at_recall_scorer_func(y_true, y_score, target_recall):
    """Internal helper for make_precision_at_recall_scorer."""
    return precision_at_recall(y_true, y_score, target_recall)

def make_precision_at_recall_scorer(target_recall=0.90):
    # This will create a picklable scorer
    return make_scorer(
        _precision_at_recall_scorer_func,
        needs_proba=True,
        target_recall=target_recall
    )


# ------------------------------------------------------------
# FAST LR Optimization (L2 only) with small grid + 3-fold CV
# ------------------------------------------------------------
def optimize_logistic_regression_fast(
    df: pd.DataFrame,
    target_col: str = "Class",
    drop_cols: list | None = None,
    log_amount: bool = True,
    target_recall_for_report: float = 0.90,
    random_state: int = 42
):
    if drop_cols is None:
        drop_cols = []

    X = df.drop(columns=[target_col] + drop_cols).copy()
    y = df[target_col].astype(int).copy()

    feature_cols = list(X.columns)
    pre = make_preprocessor(feature_cols, log_amount=log_amount, amount_col="Amount")


    pipe = Pipeline([
        ("preprocess", pre),
        ("model", LogisticRegression(
            solver="lbfgs",
            penalty="l2",
            max_iter=2000,
            random_state=random_state
        ))
    ])

    # Small grid = fast
    param_grid = [{
        "model__C": [0.05, 0.1, 0.5, 1.0],
        "model__class_weight": [None, "balanced"],
    }]

    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_state)

    scorers = {
        "pr_auc": "average_precision",
        "roc_auc": "roc_auc",
        f"precision_at_recall_{target_recall_for_report:.2f}": make_precision_at_recall_scorer(target_recall_for_report),
    }

    gs = GridSearchCV(
        estimator=pipe,
        param_grid=param_grid,
        scoring=scorers,
        refit="pr_auc",
        cv=cv,
        n_jobs=2,
        verbose=2
    )

    gs.fit(X, y)
    return gs


# ------------------------------------------------------------
# Evaluate best estimator at a target recall using OOF predictions
# ------------------------------------------------------------
def evaluate_at_target_recall(best_estimator, X, y, target_recall=0.90, cv_splits=3, random_state=42):
    cv = StratifiedKFold(n_splits=cv_splits, shuffle=True, random_state=random_state)

    # Out-of-fold predicted probabilities for an unbiased threshold choice
    p_oof = cross_val_predict(
        best_estimator, X, y,
        cv=cv, method="predict_proba", n_jobs=2
    )[:, 1]

    print(f"OOF PR AUC: {average_precision_score(y, p_oof):.6f}")
    print(f"OOF ROC AUC: {roc_auc_score(y, p_oof):.6f}")

    prec, rec, thr = precision_recall_curve(y, p_oof)
    prec_t, rec_t = prec[1:], rec[1:]

    ok = np.where(rec_t >= target_recall)[0]
    if len(ok) == 0:
        print(f"Could not reach recall >= {target_recall:.2f} with this model.")
        return

    j = ok[-1]  # highest threshold meeting recall
    t = thr[j]

    y_pred = (p_oof >= t).astype(int)

    print(f"Chosen threshold for recall≥{target_recall:.2f}: {t:.6g}")
    print(f"Precision at that point: {prec_t[j]:.6f}")
    print(f"Recall at that point: {rec_t[j]:.6f}")
    print(confusion_matrix(y, y_pred))
    print(classification_report(y, y_pred, digits=4, zero_division=0))

gs = optimize_logistic_regression_fast(
    df=df,
    target_col="Class",
    drop_cols=["Time"],          # drop if present; remove if you already dropped it
    log_amount=True,
    target_recall_for_report=0.95,
    random_state=42
)

print("Best params (by PR AUC):", gs.best_params_)
print("Best CV PR AUC:", gs.best_score_)

# Evaluate operating point with out-of-fold predictions
X_all = df.drop(columns=["Class", "Time"]) if "Time" in df.columns else df.drop(columns=["Class"])
y_all = df["Class"].astype(int)

evaluate_at_target_recall(gs.best_estimator_, X_all, y_all, target_recall=0.95, cv_splits=3, random_state=42)

Fitting 3 folds for each of 8 candidates, totalling 24 fits




Best params (by PR AUC): {'model__C': 0.5, 'model__class_weight': None}
Best CV PR AUC: 0.756405814595492
OOF PR AUC: 0.755582
OOF ROC AUC: 0.977520
Chosen threshold for recall≥0.95: 0.000626186
Precision at that point: 0.011053
Recall at that point: 0.951220
[[242442  41873]
 [    24    468]]
              precision    recall  f1-score   support

           0     0.9999    0.8527    0.9205    284315
           1     0.0111    0.9512    0.0219       492

    accuracy                         0.8529    284807
   macro avg     0.5055    0.9020    0.4712    284807
weighted avg     0.9982    0.8529    0.9189    284807



In [14]:
# Best parameters

stage1 = gs.best_estimator_

X_all = df.drop(columns=["Class", "Time"])
y_all = df["Class"].astype(int)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

p_oof = cross_val_predict(
    stage1, X_all, y_all,
    cv=cv, method="predict_proba", n_jobs=2
)[:, 1]

prec, rec, thr = precision_recall_curve(y_all, p_oof)
prec_t, rec_t = prec[1:], rec[1:]   # align with thr

target_recall = 0.95
ok = np.where(rec_t >= target_recall)[0]
j = ok[-1]                          # highest threshold meeting recall
t_gate = thr[j]

print("Stage-1 gate threshold (recall≥0.95):", t_gate)
print("OOF precision at gate:", prec_t[j])
print("OOF recall at gate:", rec_t[j])

Stage-1 gate threshold (recall≥0.95): 0.0006261859674817748
OOF precision at gate: 0.011053377420878602
OOF recall at gate: 0.9512195121951219


In [16]:
# Remove transactions that were not classified as fraud at 95% recall threshold

flag = (p_oof >= t_gate)   # use OOF flagging to build stage-2 training data

X_stage2 = X_all.loc[flag].copy()
y_stage2 = y_all.loc[flag].copy()

stage2_df = X_stage2.copy()
stage2_df["Class"] = y_stage2.values

print("Stage-2 rows:", X_stage2.shape[0])
print("Stage-2 fraud rate:", y_stage2.mean())
print("Percent routed to Stage-2:", flag.mean())

Stage-2 rows: 42341
Stage-2 fraud rate: 0.01105311636475284
Percent routed to Stage-2: 0.14866558757333914


In [18]:
#  Evaluate linear regression, random forest, and HGB models to determine best model type for stage-2 training set

def precision_at_recall(y_true, p, target_recall=0.80):
    prec, rec, thr = precision_recall_curve(y_true, p)
    prec_t, rec_t = prec[1:], rec[1:]
    ok = np.where(rec_t >= target_recall)[0]
    if len(ok) == 0:
        return 0.0
    return float(prec_t[ok[-1]])


def eval_model_cv(model, X, y, cv, target_recall=0.80):
    p = cross_val_predict(model, X, y, cv=cv, method="predict_proba", n_jobs=2)[:, 1]
    return {
        "PR_AUC": average_precision_score(y, p),
        "ROC_AUC": roc_auc_score(y, p),
        f"Prec@Rec{target_recall:.2f}": precision_at_recall(y, p, target_recall=target_recall),
        "Stage2_FraudRate": float(y.mean()),
        "Pred_Prob_Mean": float(np.mean(p)),
    }


cv2 = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

models = {
    "LR_l2": Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=2000))
    ]),
    "LR_balanced_l2": Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=2000, class_weight="balanced"))
    ]),
    "HGB": Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("clf", HistGradientBoostingClassifier(max_depth=3, learning_rate=0.1, random_state=42))
    ]),
    "RF": Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("clf", RandomForestClassifier(
            n_estimators=300,
            random_state=42,
            n_jobs=2,
            class_weight=None
        ))
    ]),
}

results = []
for name, model in models.items():
    metrics = eval_model_cv(model, X_stage2, y_stage2, cv=cv2, target_recall=0.80)
    metrics["Model"] = name
    results.append(metrics)

res_df = pd.DataFrame(results).sort_values("PR_AUC", ascending=False)
res_df


Unnamed: 0,PR_AUC,ROC_AUC,Prec@Rec0.80,Stage2_FraudRate,Pred_Prob_Mean,Model
3,0.8858,0.970732,0.951777,0.011053,0.011197,RF
2,0.870918,0.97486,0.930521,0.011053,0.010748,HGB
0,0.796134,0.959786,0.801282,0.011053,0.011103,LR_l2
1,0.777568,0.966507,0.822368,0.011053,0.093121,LR_balanced_l2


Tuning HGB model as it performed only slightly worse than RF in initial comparison, as it will more effectivly allow thresholds.  Using a threshold of 0.8 for recall overall for immediate transaction cancellation, and higher thresholds for investigation to be flagged.

In [21]:
import warnings
# Precision at target recall
# -----------------------------
def precision_at_recall(y_true, p, target_recall=0.80):
    prec, rec, thr = precision_recall_curve(y_true, p)
    prec_t, rec_t = prec[1:], rec[1:]  # align with thr
    ok = np.where(rec_t >= target_recall)[0]
    if len(ok) == 0:
        return 0.0
    val = float(prec_t[ok[-1]])
    return 0.0 if not np.isfinite(val) else val

# Helper function for make_scorer that receives y_true and y_score (probabilities)
def _precision_at_recall_scorer_func_stage2(y_true, y_score, target_recall):
    return precision_at_recall(y_true, y_score, target_recall)

def make_precision_at_recall_scorer_stage2(target_recall=0.80):
    return make_scorer(
        _precision_at_recall_scorer_func_stage2,
        needs_proba=True,
        target_recall=target_recall
    )

# -----------------------------
# OOF evaluation helper
# -----------------------------
def oof_eval(estimator, X, y, cv, target_recall=0.80, n_jobs=2):
    p_oof = cross_val_predict(estimator, X, y, cv=cv, method="predict_proba", n_jobs=n_jobs)[:, 1]

    pr_auc = average_precision_score(y, p_oof)
    roc_auc = roc_auc_score(y, p_oof)

    prec, rec, thr = precision_recall_curve(y, p_oof)
    prec_t, rec_t = prec[1:], rec[1:]
    ok = np.where(rec_t >= target_recall)[0]
    j = ok[-1] if len(ok) else None

    thr_star = float(thr[j]) if j is not None else np.nan
    prec_star = float(prec_t[j]) if j is not None else 0.0
    rec_star = float(rec_t[j]) if j is not None else 0.0

    y_pred = (p_oof >= thr_star).astype(int) if np.isfinite(thr_star) else np.zeros_like(y, dtype=int)

    return {
        "OOF_PR_AUC": pr_auc,
        "OOF_ROC_AUC": roc_auc,
        f"OOF_Prec@Rec{target_recall:.2f}": prec_star,
        f"OOF_Thr@Rec{target_recall:.2f}": thr_star,
        f"OOF_Rec@Thr": rec_star,
        "confusion_matrix": confusion_matrix(y, y_pred),
        "classification_report": classification_report(y, y_pred, digits=4, zero_division=0),
    }

# -----------------------------
# FAST HGB tuning (Colab-slow friendly)
# -----------------------------
def tune_hgb_stage2(
    X_stage2,
    y_stage2,
    target_recall=0.80,
    random_state=42
):
    cv2 = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_state)

    pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("clf", HistGradientBoostingClassifier(
            random_state=random_state
        ))
    ])

    # Small grid: informative but fast
    param_grid = {
        "clf__learning_rate": [0.05, 0.1],
        "clf__max_depth": [3, 4],
        "clf__max_iter": [200, 400],
        "clf__min_samples_leaf": [50, 100],
        "clf__l2_regularization": [0.0, 1.0],
    }

    scorers = {
        "pr_auc": "average_precision",
        "roc_auc": "roc_auc",
        f"prec_at_rec_{target_recall:.2f}": make_precision_at_recall_scorer_stage2(target_recall), # Changed function name
    }

    # Suppress the specific UserWarning about non-finite scores from GridSearchCV
    warnings.filterwarnings(
        'ignore',
        message='One or more of the test scores are non-finite:.*',
        category=UserWarning,
        module='sklearn.model_selection._search'
    )

    gs = GridSearchCV(
        pipe,
        param_grid,
        scoring=scorers,
        refit=f"prec_at_rec_{target_recall:.2f}",   # choose best for your operating goal
        cv=cv2,
        n_jobs=2,
        verbose=2
    )

    gs.fit(X_stage2, y_stage2)

    # Reset warnings to default behavior after GridSearchCV
    warnings.filterwarnings('default')

    return gs, cv2

# -----------------------------
# RUN (Stage 2)
# -----------------------------
# Assumes you already built these:
# X_stage2, y_stage2

target_recall_stage2 = 0.80

hgb_gs, cv2 = tune_hgb_stage2(
    X_stage2=X_stage2,
    y_stage2=y_stage2,
    target_recall=target_recall_stage2,
    random_state=42
)

print("Best params (Stage-2 HGB):", hgb_gs.best_params_)
print("Best CV Prec@Rec:", hgb_gs.best_score_)

# OOF evaluation of tuned winner
hgb_best = hgb_gs.best_estimator_
metrics = oof_eval(hgb_best, X_stage2, y_stage2, cv=cv2, target_recall=target_recall_stage2, n_jobs=2)

print("OOF PR AUC:", metrics["OOF_PR_AUC"])
print("OOF ROC AUC:", metrics["OOF_ROC_AUC"])
print(f"OOF Prec@Rec{target_recall_stage2:.2f}:", metrics[f"OOF_Prec@Rec{target_recall_stage2:.2f}"])
print(f"OOF Thr@Rec{target_recall_stage2:.2f}:", metrics[f"OOF_Thr@Rec{target_recall_stage2:.2f}"])
print(metrics["confusion_matrix"])
print(metrics["classification_report"])


Fitting 3 folds for each of 32 candidates, totalling 96 fits
Best params (Stage-2 HGB): {'clf__l2_regularization': 0.0, 'clf__learning_rate': 0.05, 'clf__max_depth': 3, 'clf__max_iter': 200, 'clf__min_samples_leaf': 50}
Best CV Prec@Rec: nan


  return datetime.utcnow().replace(tzinfo=utc)


OOF PR AUC: 0.8949282010943328
OOF ROC AUC: 0.9779109490827066
OOF Prec@Rec0.80: 0.9566326530612245
OOF Thr@Rec0.80: 0.698406579799198
[[41855    18]
 [   93   375]]
              precision    recall  f1-score   support

           0     0.9978    0.9996    0.9987     41873
           1     0.9542    0.8013    0.8711       468

    accuracy                         0.9974     42341
   macro avg     0.9760    0.9004    0.9349     42341
weighted avg     0.9973    0.9974    0.9973     42341



  return datetime.utcnow().replace(tzinfo=utc)


In [22]:
# Examine threshold and confusion matrix for different recall levels

def eval_stage2_at_recall(y_true, p, target_recall):
    prec, rec, thr = precision_recall_curve(y_true, p)
    prec_t, rec_t = prec[1:], rec[1:]  # align with thr

    ok = np.where(rec_t >= target_recall)[0]
    if len(ok) == 0:
        print(f"Cannot reach recall ≥ {target_recall:.2f}")
        return

    j = ok[-1]
    t = thr[j]

    y_pred = (p >= t).astype(int)
    cm = confusion_matrix(y_true, y_pred)

    print(f"\nTarget recall ≥ {target_recall:.2f}")
    print(f"Threshold: {t:.6g}")
    print(f"Precision: {prec_t[j]:.4f} | Recall: {rec_t[j]:.4f}")
    print("Confusion matrix:\n", cm)
    print(classification_report(y_true, y_pred, digits=4, zero_division=0))

# Get Stage-2 probabilities (OOF-style is best; if you already have p_oof, use it)
# If you don't, here's a simple way using your tuned estimator on Stage-2 set:
p_stage2 = hgb_best.predict_proba(X_stage2)[:, 1]

for r in [0.80, 0.90, 0.95, 0.99]:
    eval_stage2_at_recall(y_stage2, p_stage2, r)


Target recall ≥ 0.80
Threshold: 0.871056
Precision: 0.9894 | Recall: 0.8013
Confusion matrix:
 [[41868     5]
 [   93   375]]
              precision    recall  f1-score   support

           0     0.9978    0.9999    0.9988     41873
           1     0.9868    0.8013    0.8844       468

    accuracy                         0.9977     42341
   macro avg     0.9923    0.9006    0.9416     42341
weighted avg     0.9977    0.9977    0.9976     42341


Target recall ≥ 0.90
Threshold: 0.377944
Precision: 0.9420 | Recall: 0.9017
Confusion matrix:
 [[41847    26]
 [   45   423]]
              precision    recall  f1-score   support

           0     0.9989    0.9994    0.9992     41873
           1     0.9421    0.9038    0.9226       468

    accuracy                         0.9983     42341
   macro avg     0.9705    0.9516    0.9609     42341
weighted avg     0.9983    0.9983    0.9983     42341


Target recall ≥ 0.95
Threshold: 0.161162
Precision: 0.8918 | Recall: 0.9509
Confusion matri

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


In [31]:
# Fallout of false positives at 80% recall

percent_80 = 5/284315*100

percent_95 = 55/284315*100

percent_95_fallout = 23/492*100

print("The false positives blocking transactons for the 0.8 threshold is:", round(percent_80, 3), "%")

print("The flagged/contact customer rate for standard transactions for the 0.95 threshold is:", round(percent_95, 3), "%")

print("The fraudulent claims that would be missed after blocking and flagging are:", round(percent_95_fallout, 3), "%")

The false positives blocking transactons for the 0.8 threshold is: 0.002 %
The flagged/contact customer rate for standard transactions for the 0.95 threshold is: 0.019 %
The fraudulent claims that would be missed after blocking and flagging are: 4.675 %


After reviewing the confusion matrices at each operating threshold, setting the automatic block threshold at the 80% Stage-2 recall point would result in approximately 0.002% of legitimate transactions being incorrectly blocked. An additional 0.019% of transactions would be flagged for customer contact or step-up verification. With this configuration, the two-stage system identifies approximately 95.3% of fraudulent transactions overall, while keeping customer impact from false positives very low.