# Outlier Detection with Cleaned Dataset
We’ll use **`cleaned_dataset.csv`**, which has intentional outliers in numeric columns:

- `income`: extreme high values planted as outliers with some negatives
- `score`: corrupted values (negative or >100)

Covered:
- **Univariate detection** (IQR, z-scores) on `income` and `score`
- **Rule-based detection** (flag impossible values like negative scores)

In [None]:
import os, sys, glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Set paths
ROOT_ = os.getcwd()
DATA_  = os.path.join(ROOT_, "data")

# Obtain all CSV file paths in the data directory
csv_paths = glob.glob(os.path.join(DATA_, "*.csv*"))
assert len(csv_paths) > 0, f"No CSV files found in {DATA_}"

## 1. Rule-based Detection (Score sanity checks)
Some columns have *hard logical limits* rather than distribution-based limits.

**Example:**
- `score` should be between 0 and 100 (like a percentage/grade scale)
- Any value outside this range is automatically an outlier

This shows how domain rules complement statistical detection.


In [None]:
# Load cleaned dataset from pandas_basics
data_name = "cleaned"
path = [path for path in csv_paths if data_name in path][0]
df_cleaned_dataset = pd.read_csv(path)

In [None]:
# Rule-based detection for `score` and income
bad_scores = df_cleaned_dataset[(df_cleaned_dataset['score'] < 0) | (df_cleaned_dataset['score'] > 100)]
bad_income = df_cleaned_dataset[df_cleaned_dataset['yearly_income'] < 0]

if not bad_income.empty:
    print("Out-of-range values detected for income:")
    print(bad_income[['yearly_income']].head())

if not bad_scores.empty:
    print("\nOut-of-range values detected for score:")
    print(bad_scores[['score']].head())

## 2. Visual Observations
- Box plots and histograms to see tails
- Scatter to check suspicious clusters or isolated points

In [None]:
# Print unique values for a specific column and its value counts
column = 'score'
print(f"\nUnique values in 'category' column:\n{df_cleaned_dataset[column].value_counts(dropna=False)}")
print('-' * 50)

# Visualized
value_counts = df_cleaned_dataset[column].value_counts(dropna=False)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Histogram with 20 bins
ax1.hist(df_cleaned_dataset[column], bins=20, color='skyblue', edgecolor='black', alpha=0.8)
ax1.set_xlabel(column.capitalize())
ax1.set_ylabel('Frequency')
ax1.set_title(f'Histogram of {column.capitalize()}')

# Scatter plot
ax2.scatter(df_cleaned_dataset.index, df_cleaned_dataset[column], alpha=0.7, color='skyblue', edgecolor='black')
ax2.set_xlabel('Index')
ax2.set_ylabel(column.capitalize())
ax2.set_title(f'Scatter Plot of {column.capitalize()}')

plt.tight_layout()
plt.show()

## 3. Interquartile Range (IQR) Rule
---
**Example column:** `income`

- Compute Q1, Q3, and IQR = Q3 - Q1
- Define fences: `[Q1 - 1.5*IQR, Q3 + 1.5*IQR]`
- Flag values outside as potential outliers

We’ll apply this to `df['income']` to highlight the planted extreme values.

In [None]:
# Obtain column values
column = df_cleaned_dataset['yearly_income'].to_numpy()

quartiles = np.percentile(column, [25, 75])
Q1, Q3 = quartiles
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"IQR Method: Lower Bound = {lower_bound}, Upper Bound = {upper_bound}")
print('-' * 50)

outliers_iqr = df_cleaned_dataset[(column < lower_bound) | (column > upper_bound)]
print(f"IQR Method: Detected {len(outliers_iqr)} outliers in column.\n")
print(outliers_iqr.head())

## 4. Z-score
---

- Compute z = (x - mean)/std
- Flag values with |z| > threshold


In [None]:
# Obtain column values
column = df_cleaned_dataset['score'].to_numpy()

# Calculate Z-Score for each value in column
mean = np.mean(column)
std_dev = np.std(column)
z_scores = (column - mean) / std_dev

# If the absolute value of Z-score > threshold, it's an outlier
threshold = 1.75
outliers_zscore = df_cleaned_dataset[np.abs(z_scores) > threshold]
print(f"\nZ-Score Method: Detected {len(outliers_zscore)} outliers in column.")
print('-' * 50)
print(outliers_zscore)

## 5. Robust Z via Median and MAD
- `mad = median(|x - median(x)|)` and `z_robust = 0.6745*(x - median)/mad`
- Flag `|z_robust| > k
- Better for more skewed data in comparison to normal Z-Score

In [None]:
# Obtain column values
column = df_cleaned_dataset['score'].to_numpy()

# Calculate Z-Score for each value in column
median = np.median(column)
mad = np.median(np.abs(column - median))
z_robust = 0.6745 * (column - median) / mad

# If the absolute value of z_robust > threshold, it's an outlier
threshold = 1.75
outliers_z_robust = df_cleaned_dataset[np.abs(z_scores) > threshold]
print(f"\nRobust Z-Score Method: Detected {len(outliers_z_robust)} outliers in column.\n")
print(outliers_z_robust)