# Outlier Detection: From IQR and Z-scores to K-means distance

**Goal:** learn quick, explainable methods to flag outliers, and a clustering-based approach for messy cases.

## 1. What’s an Outlier?
- A data point that is implausible given the bulk of the distribution or business rules.
- Treat "outlier" as a *hypothesis* to investigate, not a truth.

## 2. Visual First
- Box plots and histograms to see tails
- Scatter to check suspicious clusters or isolated points

> Exercise: plot hist + box for 2–3 columns, and mark the suspected outliers.

## 3. IQR Rule (distribution-agnostic)
- Compute Q1, Q3, IQR = Q3 - Q1
- Fence: `[Q1 - 1.5*IQR, Q3 + 1.5*IQR]` (tune 1.5 ↔ 3 for stricter/looser)
- Flag values outside the fence

**Pros:** robust to skew. **Cons:** univariate only unless applied per feature.

## 4. Z-score (parametric)
- `z = (x - mean)/std` and flag `|z| > k` (common k = 3)

**Pros:** simple, good when normal-ish. **Cons:** sensitive to outliers affecting mean/std.

## 5. Robust Z via Median and MAD
- `mad = median(|x - median(x)|)` and `z_robust = 0.6745*(x - median)/mad`
- Flag `|z_robust| > k` (k ≈ 3.5 common)

**Pros:** resilient to heavy tails. **Cons:** like IQR, typically univariate.

## 6. Multivariate Angle: K-means Distance
- Fit K-means, compute distance of each point to its assigned centroid
- Points with distances in the extreme tail are candidates
- Choose K via the elbow method on SSE (sum of squared errors)

> Exercise: run K across a range, plot SSE vs K, pick elbow; then flag top 1% farthest points.

## 7. Practical Playbook
- Start univariate (IQR/robust-z) to catch obvious issues
- Move to multivariate (distance in embedding space, clustering) when needed
- Always review a sample of flagged points; never auto-drop without context
- After cleaning, re-check distribution and downstream metrics (training stability, convergence)

## 8. Reporting
- Summarize: how many flagged, by which rule, percent of data removed/edited
- Keep a reversible log of changes (row ids, old value → new value, reason)

**Takeaways**
- Use simple, explainable rules first.
- Robust methods reduce false alarms when data are skewed.
- Clustering distances align well with messy, multi-feature data.