# Fraud Detection Case Study (Financial Transactions)

**Author:** Amar Nath  
**Date:** September 03, 2025

This notebook delivers a complete, production-lean fraud detection workflow for a large transactional dataset (~6.36M rows, 10 columns).  
It is optimized for scalability (chunked reading, memory-efficient dtypes, and hash/target encodings), and emphasizes **precision-recall** metrics suitable for rare-event classification.

> **How to use:**  
> 1. Set `CSV_PATH` below to the location of the dataset.  
> 2. Run the cells in order.  
> 3. The notebook will produce: cleaned feature store, trained models, evaluation plots, and a serialized model.


## Business Context & Data

We are predicting **fraudulent transactions** and providing operational insights. Key fields (as provided):

- `step` – hours since simulation start (1 step = 1 hour). 744 steps ≈ 30 days.
- `type` – one of {CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER}
- `amount` – transaction amount in local currency
- `nameOrig` – origin customer
- `oldbalanceOrg` – origin balance before txn
- `newbalanceOrig` – origin balance after txn
- `nameDest` – destination customer
- `oldbalanceDest` – destination balance before txn (missing for merchants starting with 'M')
- `newbalanceDest` – destination balance after txn (missing for merchants starting with 'M')
- `isFraud` – label: 1 if fraud, else 0
- `isFlaggedFraud` – rule-based flag when amount > 200,000 for certain transfers

**Core Questions Addressed**
1. Data cleaning: missing values, outliers, multicollinearity.  
2. Model design & variable selection.  
3. Performance demonstration with appropriate metrics/plots.  
4. Key predictive factors (feature importance) + interpretation.  
5. Prevention recommendations and how to evaluate impact post-deployment.


In [1]:
# --- Setup
# Adjust the CSV_PATH to your dataset location.
CSV_PATH = r"C:\Users\Amar Nath\Downloads\Accredian\Fraud.csv"

# Artifacts
ARTIFACT_DIR =  r"C:\Users\Amar Nath\Downloads\Accredian\sss"
MODEL_PATH = f"{ARTIFACT_DIR}/hgb_classifier.joblib"
CALIB_MODEL_PATH = f"{ARTIFACT_DIR}/hgb_classifier_calibrated.joblib"
FEATURES_PARQUET = f"{ARTIFACT_DIR}/features.parquet"
REPORT_JSON = f"{ARTIFACT_DIR}/eval_report.json"

import os
os.makedirs(ARTIFACT_DIR, exist_ok=True)

import numpy as np
import pandas as pd

# Plotting (no seaborn to keep dependencies light)
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score, RocCurveDisplay, PrecisionRecallDisplay
from sklearn.metrics import precision_recall_fscore_support
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_hist_gradient_boosting  # noqa: F401
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.inspection import permutation_importance

from joblib import dump

# For VIF (multicollinearity) on numeric features
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Utilities
from pathlib import Path




In [3]:
# --- Memory-efficient dtype map for the large CSV
DTYPES = {
    "step": "int16",
    "type": "category",
    "amount": "float32",
    "nameOrig": "string",
    "oldbalanceOrg": "float32",
    "newbalanceOrig": "float32",
    "nameDest": "string",
    "oldbalanceDest": "float32",
    "newbalanceDest": "float32",
    "isFraud": "int8",
    "isFlaggedFraud": "int8",
}

READ_KW = dict(
    dtype=DTYPES,
    low_memory=True
)


In [5]:
# --- Chunked loading + lightweight EDA
# If your machine has enough RAM (~8–16GB), you can try reading at once.
# Otherwise, use chunks and optionally downsample non-fraud for faster experimentation.

TOTAL_ROWS = None  # set to known rowcount to show progress

fraud_rate_est = None
row_counter = 0
fraud_counter = 0

CHUNK_SIZE = 1_000_000  # tune this for your machine

# Quick head on a small chunk for schema sanity
try:
    head_df = pd.read_csv(CSV_PATH, nrows=5, **READ_KW)
    display(head_df.head())
except Exception as e:
    print("Unable to preview head; check CSV_PATH:", e)

# Estimate fraud rate without loading all rows to memory
for chunk in pd.read_csv(CSV_PATH, chunksize=CHUNK_SIZE, **READ_KW):
    row_counter += len(chunk)
    fraud_counter += int(chunk["isFraud"].sum())

fraud_rate_est = fraud_counter / row_counter if row_counter else None
print(f"Row count estimation: {row_counter:,}")
print(f"Approx fraud rate: {fraud_rate_est:.6f}" if fraud_rate_est is not None else "N/A")


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.639648,C1231006815,170136.0,160296.359375,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.280029,C1666544295,21249.0,19384.720703,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.139648,C2048537720,41554.0,29885.859375,M1230701703,0.0,0.0,0,0


Row count estimation: 6,362,620
Approx fraud rate: 0.001291


In [6]:
# --- Feature Engineering
# We will create numerically stable and leak-free features.

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Basic deltas to capture inconsistencies / suspicious flows
    df["orig_delta"] = (df["oldbalanceOrg"] - df["newbalanceOrig"]).astype("float32")
    df["dest_delta"] = (df["newbalanceDest"] - df["oldbalanceDest"]).astype("float32")

    # Balance change vs amount consistency flags
    df["orig_mismatch"] = (np.abs(df["orig_delta"] - df["amount"]) > 1e-3).astype("int8")
    df["dest_mismatch"] = (np.abs(df["dest_delta"] - df["amount"]) > 1e-3).astype("int8")

    # Zero balance patterns often appear in simulated fraud
    df["orig_old_is_zero"] = (df["oldbalanceOrg"] == 0).astype("int8")
    df["dest_old_is_zero"] = (df["oldbalanceDest"] == 0).astype("int8")

    # Merchant destination?
    df["dest_is_merchant"] = df["nameDest"].str.startswith("M", na=False).astype("int8")

    # Time features
    df["day"] = (df["step"] // 24).astype("int16")
    df["hour"] = (df["step"] % 24).astype("int8")
    df["is_weekend"] = df["day"].isin([5, 6, 12, 13, 19, 20, 26, 27]).astype("int8")  # 30-day sim approx

    # Hashing for high-cardinality IDs (names) to avoid leakage while capturing patterns
    # Keep small hash buckets to remain memory-lean
    def hash_series(s, n_buckets=256):
        # Use pandas hashing to get stable integers, then mod into buckets
        return (pd.util.hash_pandas_object(s, index=False).astype("int64") % n_buckets).astype("int16")

    df["orig_hash"] = hash_series(df["nameOrig"])
    df["dest_hash"] = hash_series(df["nameDest"])

    # Drop raw identifiers to prevent overfitting/leakage
    df = df.drop(columns=["nameOrig", "nameDest"])

    return df

# Preview on a small sample
sample = pd.read_csv(CSV_PATH, nrows=100_000, **READ_KW)
sample_fe = engineer_features(sample)
display(sample_fe.head())


Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,orig_delta,...,orig_mismatch,dest_mismatch,orig_old_is_zero,dest_old_is_zero,dest_is_merchant,day,hour,is_weekend,orig_hash,dest_hash
0,1,PAYMENT,9839.639648,170136.0,160296.359375,0.0,0.0,0,0,9839.640625,...,0,1,0,1,1,0,1,0,11,95
1,1,PAYMENT,1864.280029,21249.0,19384.720703,0.0,0.0,0,0,1864.279297,...,0,1,0,1,1,0,1,0,162,154
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1,0,181.0,...,0,1,0,1,0,0,1,0,16,247
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1,0,181.0,...,0,1,0,0,0,0,1,0,137,109
4,1,PAYMENT,11668.139648,41554.0,29885.859375,0.0,0.0,0,0,11668.140625,...,0,1,0,1,1,0,1,0,213,181


In [9]:
# --- Cleaning: handle missing, outliers, and sanity checks

def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Missing handling: merchant destinations lack balances; impute zeros (documented)
    for col in ["oldbalanceDest", "newbalanceDest"]:
        if col in df.columns:
            df[col] = df[col].fillna(0)

    # Clip extreme amounts to mitigate outliers' influence on linear models
    if "amount" in df.columns:
        df["amount"] = df["amount"].clip(lower=0, upper=df["amount"].quantile(0.999)).astype("float32")

    # Ensure no negative balances
    for col in ["oldbalanceOrg", "newbalanceOrg", "oldbalanceDest", "newbalanceDest"]:
        if col in df.columns:
            df[col] = df[col].clip(lower=0)

    return df

# Apply cleaning + FE to a manageable working set for fast iteration
work = pd.read_csv(CSV_PATH, nrows=2_000_000, **READ_KW)  # subset for prototyping
work = clean_dataframe(work)
work = engineer_features(work)

target = "isFraud"
y = work[target].astype("int8")
X = work.drop(columns=[target])

print(X.shape, y.mean())
display(X.head())


(2000000, 20) 0.001018


Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFlaggedFraud,orig_delta,dest_delta,orig_mismatch,dest_mismatch,orig_old_is_zero,dest_old_is_zero,dest_is_merchant,day,hour,is_weekend,orig_hash,dest_hash
0,1,PAYMENT,9839.639648,170136.0,160296.359375,0.0,0.0,0,9839.640625,0.0,0,1,0,1,1,0,1,0,11,95
1,1,PAYMENT,1864.280029,21249.0,19384.720703,0.0,0.0,0,1864.279297,0.0,0,1,0,1,1,0,1,0,162,154
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,0,181.0,0.0,0,1,0,1,0,0,1,0,16,247
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,0,181.0,-21182.0,0,1,0,0,0,0,1,0,137,109
4,1,PAYMENT,11668.139648,41554.0,29885.859375,0.0,0.0,0,11668.140625,0.0,0,1,0,1,1,0,1,0,213,181


In [11]:
# --- Multicollinearity (VIF) on numeric subset
# Tree models are robust to collinearity, but we measure it for documentation/linear baselines.

numeric_cols = X.select_dtypes(include=["int16","int8","float32","float64"]).columns.tolist()
# Exclude the target if present
numeric_cols = [c for c in numeric_cols if c != "isFraud"]

X_num = X[numeric_cols].astype("float32").fillna(0)
# Add constant for statsmodels
X_num_const = sm.add_constant(X_num, has_constant='add')

vifs = []
for i, col in enumerate(X_num_const.columns):
    try:
        vifs.append({"feature": col, "VIF": variance_inflation_factor(X_num_const.values, i)})
    except Exception as e:
        vifs.append({"feature": col, "VIF": np.nan})

vif_df = pd.DataFrame(vifs).sort_values("VIF", ascending=False)
display(vif_df.head(20))


  vif = 1. / (1. - r_squared_i)
  return 1 - self.ssr/self.centered_tss


Unnamed: 0,feature,VIF
3,oldbalanceOrg,inf
4,newbalanceOrig,inf
8,orig_delta,9007199000000000.0
15,day,9007199000000000.0
1,step,4503600000000000.0
5,oldbalanceDest,3002400000000000.0
16,hour,2251800000000000.0
6,newbalanceDest,562950000000000.0
9,dest_delta,38166100000000.0
0,const,43.69841


In [13]:
# --- Train/Validation split with stratification
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train fraud rate:", y_train.mean())
print("Valid fraud rate:", y_valid.mean())


Train fraud rate: 0.001018125
Valid fraud rate: 0.0010175


In [None]:
# --- Preprocessing
# 'type' is a low-cardinality category -> One-Hot
cat_cols = [c for c in X.columns if str(X[c].dtype) == "category"]
num_cols = [c for c in X.columns if c not in cat_cols]

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), cat_cols),
        ("num", "passthrough", num_cols),
    ],
    remainder="drop",
    sparse_threshold=0
)


In [None]:
# --- Baseline Models

# 1) Logistic Regression (baseline)
log_reg = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", LogisticRegression(max_iter=200, n_jobs=None, class_weight="balanced"))
])

# 2) HistGradientBoosting (fast, large-scale tree boosting)
hgb = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", HistGradientBoostingClassifier(
        max_depth=None,
        max_iter=200,
        learning_rate=0.05,
        l2_regularization=0.0,
        early_stopping=True,
        random_state=42
    ))
])

# 3) RandomForest (strong baseline; not as memory-efficient as HGB)
rf = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", RandomForestClassifier(
        n_estimators=200,
        max_depth=None,
        n_jobs=-1,
        class_weight="balanced_subsample",
        random_state=42
    ))
])


In [None]:
# --- Train & Evaluate (ROC-AUC and Average Precision / PR-AUC)

def evaluate_model(name, model, X_tr, y_tr, X_va, y_va):
    model.fit(X_tr, y_tr)
    # Probabilities
    if hasattr(model, "predict_proba"):
        p_va = model.predict_proba(X_va)[:, 1]
    else:
        # For HGB, use decision_function for calibrated ranking
        if hasattr(model, "decision_function"):
            p_va = model.decision_function(X_va)
            # Normalize decision scores to [0,1] via rank-based scaling for fair PR curves
            p_va = (p_va - p_va.min()) / (p_va.max() - p_va.min() + 1e-8)
        else:
            # Fallback: binary predictions as pseudo-proba
            p_va = model.predict(X_va)

    roc = roc_auc_score(y_va, p_va)
    pr = average_precision_score(y_va, p_va)

    print(f"\n{name}: ROC-AUC={roc:.4f} | PR-AUC={pr:.4f}")
    RocCurveDisplay.from_predictions(y_va, p_va)
    plt.title(f"{name} ROC Curve")
    plt.show()

    PrecisionRecallDisplay.from_predictions(y_va, p_va)
    plt.title(f"{name} Precision-Recall Curve")
    plt.show()

    # Threshold at 0.5 for a quick confusion matrix (tune later)
    y_hat = (p_va >= 0.5).astype(int)
    print(classification_report(y_va, y_hat, digits=4))
    print(confusion_matrix(y_va, y_hat))

    return model, {"roc_auc": float(roc), "pr_auc": float(pr)}

reports = {}

for name, mdl in [
    ("LogisticRegression", log_reg),
    ("HistGradientBoosting", hgb),
    ("RandomForest", rf),
]:
    fitted, rep = evaluate_model(name, mdl, X_train, y_train, X_valid, y_valid)
    reports[name] = rep

# Pick best by PR-AUC (suitable for rare fraud)
best_name = max(reports, key=lambda k: reports[k]["pr_auc"])
best_model = {"LogisticRegression": log_reg, "HistGradientBoosting": hgb, "RandomForest": rf}[best_name]
print("Best model by PR-AUC:", best_name)


In [None]:
# --- Probability Calibration (for alert triage & risk scoring)
# Calibrate the best model with isotonic (better for PR in many cases).

calibrated = CalibratedClassifierCV(best_model, method="isotonic", cv=3)
calibrated.fit(X_train, y_train)
# Save
dump(calibrated, CALIB_MODEL_PATH)
print("Saved calibrated model:", CALIB_MODEL_PATH)

# Evaluate calibrated
if hasattr(calibrated, "predict_proba"):
    p_va = calibrated.predict_proba(X_valid)[:, 1]
else:
    p_va = calibrated.decision_function(X_valid)
    p_va = (p_va - p_va.min()) / (p_va.max() - p_va.min() + 1e-8)

roc = roc_auc_score(y_valid, p_va)
pr = average_precision_score(y_valid, p_va)
print(f"Calibrated {best_name}: ROC-AUC={roc:.4f} | PR-AUC={pr:.4f}")

RocCurveDisplay.from_predictions(y_valid, p_va)
plt.title(f"Calibrated {best_name} ROC")
plt.show()

PrecisionRecallDisplay.from_predictions(y_valid, p_va)
plt.title(f"Calibrated {best_name} PR")
plt.show()

# Store report
with open(REPORT_JSON, "w") as f:
    json.dump({"best_model": best_name, "roc_auc": roc, "pr_auc": pr}, f, indent=2)
print("Saved report:", REPORT_JSON)


In [None]:
# --- Feature Importance (Permutation) for Interpretability

# Use a small sample for speed
X_val_s = X_valid.sample(n=min(200_000, len(X_valid)), random_state=42)
y_val_s = y_valid.loc[X_val_s.index]

perm = permutation_importance(calibrated, X_val_s, y_val_s, n_repeats=5, random_state=42, n_jobs=-1)
imp_df = pd.DataFrame({
    "feature": X_val_s.columns,
    "importance": perm.importances_mean,
    "std": perm.importances_std
}).sort_values("importance", ascending=False)

display(imp_df.head(25))

# Plot top 20
top = imp_df.head(20)
plt.figure()
plt.barh(top["feature"][::-1], top["importance"][::-1])
plt.title("Top 20 Features (Permutation Importance)")
plt.xlabel("Mean Importance")
plt.ylabel("Feature")
plt.show()

# Save for later use
imp_df.to_csv(f"{ARTIFACT_DIR}/feature_importance.csv", index=False)


In [None]:
# --- Operating Point Selection (Cost-sensitive thresholding)
# Choose a threshold given costs or desired precision/recall.

from sklearn.metrics import precision_recall_curve

prec, rec, thresh = precision_recall_curve(y_valid, p_va)
# Example: pick threshold for Precision >= 0.95
target_precision = 0.95
idx = np.where(prec[:-1] >= target_precision)[0]
chosen_thr = float(thresh[idx[0]]) if len(idx) else float(thresh[np.argmax(prec[:-1])])

print(f"Chosen threshold for precision >= {target_precision}: {chosen_thr:.4f}")
y_hat = (p_va >= chosen_thr).astype(int)
print(classification_report(y_valid, y_hat, digits=4))


In [None]:
# --- Persist features for downstream analysis
# Requires pyarrow for Parquet; if not available, fallback to CSV.

try:
    X.to_parquet(FEATURES_PARQUET, index=False)
    print("Saved feature matrix to:", FEATURES_PARQUET)
except Exception as e:
    fallback = FEATURES_PARQUET.replace(".parquet", ".csv")
    X.to_csv(fallback, index=False)
    print("Parquet failed, saved CSV to:", fallback, "Error:", e)


## Ops Plan & Monitoring (Post-Deployment)

**Prevention tactics to implement:**  
- **Velocity & amount controls**: tighter, dynamic limits on `TRANSFER` → `CASH-OUT` within short windows; hold funds for review if risk > threshold.  
- **Recipient risk scoring**: maintain a **hotlist** of suspicious `dest_hash`, fast-propagate via graph analytics.  
- **Device/account hygiene**: MFA enforced on high-risk actions; anomaly detection on login/IP/device fingerprint.  
- **Real-time model + rules ensemble**: combine calibrated ML score with transparent rules (`isFlaggedFraud`, large `amount`, `orig_mismatch`, `dest_mismatch`).  
- **Human-in-the-loop review** for top-N riskiest transactions with auto-feedback loop to retrain weekly.

**Did it work? How to tell (A/B or phased rollout):**  
- Randomly route a small % of traffic to **control** (existing system) vs **treatment** (new model+risk controls).  
- Track for 2–4 weeks: **confirmed fraud rate**, **$ prevented**, **alert precision/recall**, **review queue SLA**, **customer false-positive rate**.  
- Use **uplift** metrics and sequential testing; deploy to 100% once confidence intervals exclude 0 uplift.


## Answers to Prompts (Short Form)

1. **Data cleaning**: Missing merchant balances → impute 0; clip extreme `amount`; prevent negative balances; add leak-checked **delta/mismatch** features; VIF documented for numeric features.  
2. **Model**: Primary model is **HistGradientBoostingClassifier** with one-hot for `type`, hashed IDs, and calibrated probabilities. Baselines: Logistic Regression, Random Forest.  
3. **Variable selection**: Start from data dictionary → derive **consistency** features (`orig_delta/dest_delta`, mismatches), time features (`day/hour`), merchant flags, and hashed IDs; retain features with high permutation importance and business sense.  
4. **Performance**: Report **ROC-AUC** and **PR-AUC**; operating threshold chosen via cost/precision target. Confusion matrix + classification report provided.  
5. **Key predictors** (typically): `type` (TRANSFER/CASH-OUT), high `amount`, `orig_mismatch/dest_mismatch`, `dest_is_merchant==0`, `orig_old_is_zero`, and suspicious `dest_hash` buckets.  
6. **Do they make sense?** Yes—fraud rings move funds via TRANSFER→CASH-OUT quickly, often from newly/zero-funded accounts with balance deltas that don't reconcile.  
7. **Prevention**: velocity controls, holdouts for high risk, hotlist recipients, MFA for high-risk actions, rules+model ensemble.  
8. **Measuring impact**: online A/B or phased rollout with fraud loss prevented, precision/recall of alerts, and customer FP rates as primary KPIs.


---

### Repro Tips
- For the full 6.36M rows, increase `nrows` (or remove it) and consider running on a machine with ≥16GB RAM, or keep chunked training.
- Persist features + labels for faster iteration; train on stratified samples for tuning, then refit on full data.
- Always recheck data leakage when adding new features.

**Artifacts saved to:** `/mnt/data/fraud_artifacts`.
