# Week 10 — Day 5: Error Analysis

### Imports and Load

In [1]:
import joblib
import numpy as np
import pandas as pd
from pathlib import Path

In [2]:
ARTIFACTS_DIR = Path("..") / "models"
REPORTS_DIR = Path("..") / "reports"
REPORTS_DIR.mkdir(exist_ok=True)

X_train, X_test, y_train, y_test = joblib.load(ARTIFACTS_DIR / "split_v1.joblib")
rf = joblib.load(ARTIFACTS_DIR / "rf_fe_v1.joblib")
threshold = joblib.load(ARTIFACTS_DIR / "rf_threshold_v1.joblib")  # should be 0.01

print("Loaded threshold:", threshold)
print("Test shape:", X_test.shape, y_test.shape)

Loaded threshold: 0.01
Test shape: (56962, 30) (56962,)


**Feature Engineering Function**

In [3]:
def add_features(df):
    df = df.copy()
    df["log_amount"] = np.log1p(df["Amount"])
    df["hour"] = (df["Time"] // 3600).astype(int)
    df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
    df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)
    return df

In [4]:
X_test_fe = add_features(X_test)

### Predict with tuned threshold

In [5]:
y_prob = rf.predict_proba(X_test_fe)[:, 1]
y_pred_tuned = (y_prob >= threshold).astype(int)

print("Predicted fraud count:", y_pred_tuned.sum())

Predicted fraud count: 491


### Build an “error analysis table”

In [6]:
analysis_df = X_test_fe.copy()
analysis_df["y_true"] = y_test.values if hasattr(y_test, "values") else y_test
analysis_df["y_prob"] = y_prob
analysis_df["y_pred"] = y_pred_tuned

analysis_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V27,V28,Amount,log_amount,hour,hour_sin,hour_cos,y_true,y_prob,y_pred
263020,160760.0,-0.674466,1.408105,-1.110622,-1.328366,1.388996,-1.308439,1.885879,-0.614233,0.311652,...,0.533837,0.291319,23.0,3.178054,44,-0.8660254,0.5,0,0.0,0
11378,19847.0,-2.829816,-2.765149,2.537793,-1.07458,2.842559,-2.153536,-1.795519,-0.25002,3.073504,...,0.110802,-0.511938,11.85,2.553344,5,0.9659258,0.258819,0,0.0,0
147283,88326.0,-3.576495,2.318422,1.306985,3.263665,1.127818,2.865246,1.444125,-0.718922,1.874046,...,0.552411,0.509764,76.07,4.344714,24,-2.449294e-16,1.0,0,0.003333,0
219439,141734.0,2.060386,-0.015382,-1.082544,0.386019,-0.024331,-1.074935,0.207792,-0.33814,0.455091,...,-0.063621,-0.060077,0.99,0.688135,39,-0.7071068,-0.707107,0,0.0,0
36939,38741.0,1.209965,1.384303,-1.343531,1.763636,0.662351,-2.113384,0.854039,-0.475963,-0.629658,...,0.046884,0.104527,1.5,0.916291,10,0.5,-0.866025,0,0.0,0


**Filtering FP and FN**

In [7]:
false_positives = analysis_df[(analysis_df["y_true"] == 0) & (analysis_df["y_pred"] == 1)].copy()
false_negatives = analysis_df[(analysis_df["y_true"] == 1) & (analysis_df["y_pred"] == 0)].copy()

print("False Positives:", len(false_positives))
print("False Negatives:", len(false_negatives))

False Positives: 402
False Negatives: 9


In [8]:
# inspecting most confident false positive
fp_top = false_positives.sort_values("y_prob", ascending=False).head(15)
fp_top[["y_prob", "Amount", "Time", "log_amount", "hour"]].head(15)

Unnamed: 0,y_prob,Amount,Time,log_amount,hour
16110,0.97,1.0,27524.0,0.693147,7
14920,0.863333,1.0,26217.0,0.693147,7
190263,0.65,0.76,128759.0,0.565314,35
153398,0.49,0.0,98904.0,0.0,27
153457,0.336667,45.51,99129.0,3.839667,27
9643,0.263333,39.0,14446.0,3.688879,4
17592,0.26,89.99,28818.0,4.51075,8
8464,0.256667,1.0,11347.0,0.693147,3
19145,0.23,89.99,30047.0,4.51075,8
22640,0.226667,1.0,32354.0,0.693147,8


In [9]:
# Look at false negatives (missed fraud)
fn_top = false_negatives.sort_values("y_prob", ascending=False).head(15)
fn_top[["y_prob", "Amount", "Time", "log_amount", "hour"]].head(15)

Unnamed: 0,y_prob,Amount,Time,log_amount,hour
157585,0.003333,1.0,110087.0,0.693147,30
68067,0.003333,519.9,52814.0,6.255558,14
50537,0.0,1.0,44532.0,0.693147,12
623,0.0,529.0,472.0,6.272877,0
96341,0.0,98.01,65728.0,4.595221,18
219025,0.0,4.49,141565.0,1.702928,39
72757,0.0,1.79,54846.0,1.026042,15
119714,0.0,29.95,75556.0,3.432373,20
245347,0.0,2.47,152710.0,1.244155,42


### Compare Distributions

In [10]:
# compare amount stats
summary = pd.DataFrame({
    "Legit (all)": analysis_df[analysis_df["y_true"]==0]["Amount"].describe(),
    "Fraud (all)": analysis_df[analysis_df["y_true"]==1]["Amount"].describe(),
    "False Positives": false_positives["Amount"].describe(),
    "False Negatives": false_negatives["Amount"].describe(),
})
summary

Unnamed: 0,Legit (all),Fraud (all),False Positives,False Negatives
count,56864.0,98.0,402.0,9.0
mean,89.009154,108.621735,184.780323,131.956667
std,247.72122,233.271878,490.670009,224.710475
min,0.0,0.0,0.0,1.0
25%,5.52,1.0,1.0,1.79
50%,22.0,10.345,22.705,4.49
75%,76.4575,99.99,133.3875,98.01
max,12910.93,1809.68,4500.0,529.0


In [11]:
# compare hour bucket counts
fp_hours = false_positives["hour"].value_counts().sort_index()
fn_hours = false_negatives["hour"].value_counts().sort_index()

fp_hours.head(), fn_hours.head()

(hour
 0    1
 1    4
 2    2
 3    2
 4    3
 Name: count, dtype: int64,
 hour
 0     1
 12    1
 14    1
 15    1
 18    1
 Name: count, dtype: int64)

### Save Error Report

In [12]:
false_positives.to_csv(REPORTS_DIR / "rf_errors_false_positives.csv", index=False)
false_negatives.to_csv(REPORTS_DIR / "rf_errors_false_negatives.csv", index=False)

print("Saved error CSVs to reports/")

Saved error CSVs to reports/


## Error Analysis (Random Forest + Feature Engineering, tuned threshold = 0.01)

### Setup 
- **Model:** Random Forest + Feature Engineering (log_amount, hour, hour_sin/cos)
- **Decision rule:** Tuned probability threshold = **0.01**
- **Goal of tuning:** Increase **Recall** (catch more fraud)

We analysed:
- **False Positives (FP):** flagged as fraud but actually legit  
- **False Negatives (FN):** fraud cases that the model missed

---

### 1) False Positives (reason why legit got flagged as fraud)
When we ranked false positives by the model’s fraud probability (`y_prob`), we found a few important patterns:

**A) “Fraud-like” transactions even with small amounts**
- Several high-confidence false positives had **very small amounts** (e.g., Amount ≈ 0.76, 1.00) but still got high fraud scores (e.g., `y_prob` up to **0.97**).
- This suggests the model is not relying on Amount alone — it is seeing combinations of features that look fraud-like.

**B) Time-window effect (hour buckets appear in top false positives)**
- Top false positives clustered in certain **hour buckets** such as **7–8**, **27**, **32–35**, **41**, and even **46**.
- This indicates some time patterns are strongly associated with fraud in training, and legit transactions occurring in those periods can be misclassified.

**C) Extreme/rare-looking transactions are more likely to be flagged**
- We saw a large-amount false positive (e.g., **Amount ≈ 1059.28**) with moderate fraud probability (`y_prob` ≈ 0.19).
- High amounts or unusual patterns can trigger false positives because they resemble rare fraud behaviour.

**Summary (FP):**
- Lower threshold increases sensitivity, so the model flags more borderline cases.
- False positives often look “fraud-like” due to **feature combinations** and/or **time-window patterns**, not just Amount.

---

### 2) False Negatives (reason why some fraud was missed)
When we examined missed fraud (false negatives), the model’s fraud probabilities were **extremely low** (many are near **0.0**, and the highest seen was ~**0.0033**), meaning:

**A) These fraud cases look normal to the model**
- Some missed fraud transactions had **ordinary-looking amounts** (e.g., Amount ≈ 1.00, 2.47, 4.49, 29.95).
- Even a few higher amounts (e.g., ~519.90, ~529.00) still received near-zero probabilities, suggesting their other feature values looked legitimate.

**B) Missed fraud appears across many time windows**
- False negatives were spread across various hours (e.g., 0, 12, 14, 15, 18, 20, 30, 39, 42).
- This suggests no single “time bucket” explains the missed fraud — it’s more likely due to feature overlap.

**Summary (FN):**
- The remaining missed fraud cases appear **hard to separate** from legitimate behaviour using the current features.
- This indicates **feature overlap** between fraud and legit, and/or fraud patterns not captured well by our engineered features.

---

### 3) Key Notes
- **Threshold tuning improved recall**, but introduced more false positives as expected.
- **False positives** often occur when legit transactions share similar feature patterns with fraud (including certain time windows).
- **False negatives** tend to be fraud cases that look “normal” to the model (very low predicted probability), indicating difficult or subtle fraud patterns.