
# Fraud Transaction Detection — End‑to‑End Notebook

This notebook walks you through a full, **reproducible** workflow to detect fraudulent transactions using the provided dataset.
It follows the expected steps: data cleaning, feature engineering, model training, evaluation, interpretation, and actionable recommendations.

> **Tip:** Run each cell from top to bottom. If your machine has limited RAM, use the built-in sampling options below.


In [11]:

# === 1) Setup & Imports ===
import os, gc, math, json, warnings
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (roc_auc_score, roc_curve, precision_recall_curve,
                             average_precision_score, classification_report,
                             confusion_matrix)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_PATH = Path("C:/Users/anjan/OneDrive/Documents/Desktop/fraud_detection/Fraud.csv")  



## 2) Data Loading Strategy

The full dataset is ~6.36M rows, which may be heavy. Choose **one** option:

- **Option A (recommended for most PCs):** Load a **sample** (e.g., 200k rows).
- **Option B:** Chunked load with random sampling across the full file.
- **Option C:** Full load (only if you have lots of RAM).


In [12]:

# === 2A) Simple Sample Load (fast & easy) ===
SAMPLE_N = 200_000  # tweak if you have more/less RAM
df = pd.read_csv(DATA_PATH, nrows=SAMPLE_N)
print(df.shape)
df.head()


(200000, 11)


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [None]:

# === 2B) Chunked Stratified Sampling (across entire file) ===
# Uncomment to use. This will sample approximately a fraction (SAMPLE_FRAC) of rows across the file.
# Useful when you want a more representative sample without loading all data at once.

# SAMPLE_FRAC = 0.05  # 5% of 6.36M ~ 318k rows
# reader = pd.read_csv(DATA_PATH, chunksize=200_000)
# chunks = []
# for ch in reader:
#     ch_sample = ch.sample(frac=SAMPLE_FRAC, random_state=RANDOM_STATE)
#     chunks.append(ch_sample)
# df = pd.concat(chunks, ignore_index=True)
# print(df.shape)
# df.head()


In [None]:

# === 2C) Full Load (ONLY if you have lots of RAM) ===
# df = pd.read_csv(DATA_PATH)
# print(df.shape)
# df.head()



## 3) Quick EDA & Data Quality Checks
We’ll explore data types, class balance, missing values, and basic distributions.


In [14]:

df.info()
display(df.head(10))
print("\nClass balance:")
print(df['isFraud'].value_counts())
print("\nTransaction types:")
print(df['type'].value_counts())

print("\nMissing values per column:")
print(df.isna().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   step            200000 non-null  int64  
 1   type            200000 non-null  object 
 2   amount          200000 non-null  float64
 3   nameOrig        200000 non-null  object 
 4   oldbalanceOrg   200000 non-null  float64
 5   newbalanceOrig  200000 non-null  float64
 6   nameDest        200000 non-null  object 
 7   oldbalanceDest  200000 non-null  float64
 8   newbalanceDest  200000 non-null  float64
 9   isFraud         200000 non-null  int64  
 10  isFlaggedFraud  200000 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 16.8+ MB


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0



Class balance:
isFraud
0    199853
1       147
Name: count, dtype: int64

Transaction types:
type
PAYMENT     73427
CASH_OUT    66488
CASH_IN     41579
TRANSFER    16836
DEBIT        1670
Name: count, dtype: int64

Missing values per column:
step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64



## 4) Business-Driven Feature Engineering

From the data dictionary:

- `step`: hour in the simulation timeline (0–744).
- `type`: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER.
- `amount`, `oldbalanceOrg`, `newbalanceOrig`, `oldbalanceDest`, `newbalanceDest`.
- `isFraud`: target label (1 = fraud).
- `isFlaggedFraud`: rule-based flag for transfers > 200,000.

We'll create features that capture **balance inconsistencies**, **behavioral signals**, and **type indicators**.


In [15]:

# === Feature Engineering ===
df_fe = df.copy()

# High-cardinality IDs usually don't generalize well; we drop them.
drop_cols = ['nameOrig', 'nameDest']
df_fe = df_fe.drop(columns=drop_cols)

# Balance change features
df_fe['deltaOrig'] = df_fe['oldbalanceOrg'] - df_fe['newbalanceOrig'] - df_fe['amount']
df_fe['deltaDest'] = df_fe['newbalanceDest'] - df_fe['oldbalanceDest'] - df_fe['amount']

# Ratios (guard against divide-by-zero)
df_fe['amt_over_oldOrg']  = df_fe['amount'] / (df_fe['oldbalanceOrg'].replace(0, np.nan))
df_fe['amt_over_oldDest'] = df_fe['amount'] / (df_fe['oldbalanceDest'].replace(0, np.nan))

# Indicators
df_fe['isMerchantDest'] = df['nameDest'].str.startswith('M').astype(int)
df_fe['isTransferOrCashOut'] = df['type'].isin(['TRANSFER','CASH_OUT']).astype(int)

# One-hot for 'type'
df_fe = pd.get_dummies(df_fe, columns=['type'], drop_first=True)

# Impute NaNs from ratios with 0 (means undefined ratio when denominator=0)
df_fe = df_fe.fillna(0.0)

# Separate features and target
y = df_fe['isFraud'].astype(int)
X = df_fe.drop(columns=['isFraud'])
X.columns, X.shape


(Index(['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
        'newbalanceDest', 'isFlaggedFraud', 'deltaOrig', 'deltaDest',
        'amt_over_oldOrg', 'amt_over_oldDest', 'isMerchantDest',
        'isTransferOrCashOut', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT',
        'type_TRANSFER'],
       dtype='object'),
 (200000, 17))


## 5) Train/Validation Split

To avoid **temporal leakage**, we split by time: earlier `step` for training, later steps for validation.
If you prefer random split, switch to `train_test_split` with `stratify=y`.


In [16]:

# Time-based split (e.g., 80% earliest steps for train, remainder for validation)
cutoff = np.quantile(df_fe['step'], 0.8)
train_idx = df_fe['step'] <= cutoff
X_train, y_train = X[train_idx], y[train_idx]
X_val,   y_val   = X[~train_idx], y[~train_idx]

print("Cutoff step:", cutoff)
print("Train size:", X_train.shape, "Fraud rate:", y_train.mean())
print("Valid size:", X_val.shape, "Fraud rate:", y_val.mean())


Cutoff step: 12.0
Train size: (182111, 17) Fraud rate: 0.0007742530654381119
Valid size: (17889, 17) Fraud rate: 0.000335401643468053



## 6) Baselines: Logistic Regression and Random Forest

We’ll train a **Logistic Regression (with class_weight='balanced')** and a **Random Forest**.


In [17]:

num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Scale numeric features for logistic regression
ct = ColumnTransformer(
    transformers=[('scale', StandardScaler(with_mean=False), num_cols)],
    remainder='passthrough'
)

logreg = Pipeline(steps=[
    ('prep', ct),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', n_jobs=None, solver='lbfgs'))
])

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    n_jobs=-1,
    class_weight='balanced_subsample',
    random_state=RANDOM_STATE
)

models = {'LogReg': logreg, 'RandomForest': rf}

fitted = {}
for name, mdl in models.items():
    mdl.fit(X_train, y_train)
    fitted[name] = mdl
    print(f"{name} trained.")


LogReg trained.
RandomForest trained.



## 7) Evaluation: ROC‑AUC and PR‑AUC, Confusion Matrix at a Business Threshold

Fraud is rare; PR‑AUC is often more informative than ROC‑AUC.
We also pick a **business threshold** (e.g., 0.90 quantile of predicted scores) to see confusion matrix.


In [None]:

def evaluate_model(name, mdl, X_val, y_val, threshold=None):
    # Get probabilities (works for both sklearn and pipeline)
    if hasattr(mdl, "predict_proba"):
        proba = mdl.predict_proba(X_val)[:,1]
    else:
        # Fallback for models without predict_proba
        try:
            proba = mdl.decision_function(X_val)
            proba = (proba - proba.min()) / (proba.max() - proba.min())
        except:
            proba = mdl.predict(X_val).astype(float)

    roc = roc_auc_score(y_val, proba)
    prc = average_precision_score(y_val, proba)
    print(f"Model: {name} | ROC-AUC: {roc:.4f} | PR-AUC: {prc:.4f}")

    # Plots
    fpr, tpr, _ = roc_curve(y_val, proba)
    plt.figure()
    plt.plot(fpr, tpr, label=f'{name} (AUC={roc:.3f})')
    plt.plot([0,1],[0,1], linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve — {name}')
    plt.legend()
    plt.show()

    prec, rec, thr = precision_recall_curve(y_val, proba)
    plt.figure()
    plt.plot(rec, prec, label=f'{name} (AP={prc:.3f})')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall — {name}')
    plt.legend()
    plt.show()

    if threshold is None:
        threshold = np.quantile(proba, 0.90)  # top 10% as alerts (tune per budget)
    y_pred = (proba >= threshold).astype(int)
    print(f"Threshold used: {threshold:.5f}")
    print(classification_report(y_val, y_pred, digits=4))
    print("Confusion Matrix (val):\n", confusion_matrix(y_val, y_pred))

for n, m in fitted.items():
    evaluate_model(n, m, X_val, y_val, threshold=None)



## 8) Interpreting the Model: Feature Effects

- **Logistic Regression:** inspect coefficients (after scaling).
- **Random Forest:** feature importances.


In [20]:

# Get feature names post-transform for logreg
def get_feature_names(column_transformer, input_features):
    # Works for simple ColumnTransformer with 'remainder=passthrough'
    out = []
    for name, trans, cols in column_transformer.transformers_:
        if name == 'remainder' and trans == 'drop':
            continue
        if name == 'remainder' and trans == 'passthrough':
            # Identify non-specified columns
            passthrough_cols = [c for c in input_features if c not in num_cols]
            out.extend(passthrough_cols)
        else:
            if isinstance(cols, list):
                out.extend(cols)
            else:
                out.append(cols)
    return out

# LogReg coefficients
lr = fitted['LogReg']
if isinstance(lr, Pipeline):
    prep = lr.named_steps['prep']
    clf  = lr.named_steps['clf']
    # numeric features are scaled; remainder (categoricals already one-hot) passes through
    feature_names = num_cols + [c for c in X_train.columns if c not in num_cols]
    coefs = pd.Series(clf.coef_.ravel(), index=feature_names)
    print("Top positive signals (LogReg):")
    print(coefs.sort_values(ascending=False).head(15))
    print("\nTop negative signals (LogReg):")
    print(coefs.sort_values(ascending=True).head(15))

# RandomForest importances
rf = fitted['RandomForest']
rf_imps = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print("\nTop features (RandomForest):")
print(rf_imps.head(20))


Top positive signals (LogReg):
isTransferOrCashOut     5.030092
deltaOrig               4.341852
amount                  3.150452
type_TRANSFER           2.127176
type_CASH_OUT           0.349856
amt_over_oldDest        0.085514
isFlaggedFraud          0.000000
newbalanceDest         -0.112441
deltaDest              -0.176815
oldbalanceDest         -0.512374
step                   -0.928529
type_DEBIT             -1.440835
type_PAYMENT           -2.081550
isMerchantDest         -4.306886
oldbalanceOrg         -14.746285
dtype: float64

Top negative signals (LogReg):
amt_over_oldOrg    -23.955314
newbalanceOrig     -15.453153
oldbalanceOrg      -14.746285
isMerchantDest      -4.306886
type_PAYMENT        -2.081550
type_DEBIT          -1.440835
step                -0.928529
oldbalanceDest      -0.512374
deltaDest           -0.176815
newbalanceDest      -0.112441
isFlaggedFraud       0.000000
amt_over_oldDest     0.085514
type_CASH_OUT        0.349856
type_TRANSFER        2.127176
amount 


## 9) Answering the Business Questions (Template)

1. **Data cleaning:** handled missing values (none/rare), dropped high-cardinality IDs (`nameOrig`, `nameDest`), engineered balance checks, one-hot encoded `type`, checked correlations.
2. **Model description:** Logistic Regression (interpretable) and Random Forest (non-linear), both class-weighted for imbalance.
3. **Variable selection:** business intuition + performance; redundant/highly correlated features can be pruned by checking correlations/VIF.
4. **Performance:** report ROC‑AUC, PR‑AUC, confusion matrix at business threshold; adjust threshold to budget for investigations.
5. **Key predictors:** typically, Transfer/Cash‑out indicators, `deltaOrig`, `deltaDest`, and large `amount` vs prior balances.
6. **Do they make sense?** Yes—fraud rings often **transfer** then **cash out**, leaving balance inconsistencies.
7. **Prevention while updating infra:** stricter velocity/risk rules, step-up auth for TRANSFER/CASH_OUT, device/IP fingerprinting, anomaly scoring, and real-time holds for high‑risk patterns.
8. **Measuring impact:** track **fraud $ prevented**, **precision/recall**, **false positive rate**, **chargeback rate**, and **manual review SLA** before vs after.

> Adapt this section with your actual numbers from Sections 6–8.
