# **Local Outlier Factor (LOF) - Anomaly Detection**
# **CMPT 459 Course Project**

This notebook demonstrates **Local Outlier Factor** for outlier detection:
* Preprocessing pipeline
* LOF algorithm
* **2D scatter plot visualization** using PCA
* LOF score distribution
* Interpretation of results

**Reference**: `outlier_detection.py`


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler

## **1. Data Preprocessing**


In [2]:
def load_and_preprocess(path):
    df = pd.read_csv(path)
    print(f"Original shape: {df.shape}")
    df = df.replace("?", np.nan)
    threshold = 0.5 * len(df)
    df = df.dropna(thresh=threshold, axis=1)
    for col in ["encounter_id", "patient_nbr", "readmitted"]:
        if col in df.columns:
            df = df.drop(columns=[col])
    cat_cols = df.select_dtypes(include="object").columns
    for col in cat_cols:
        if df[col].nunique() > 10:
            dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
            df = pd.concat([df.drop(columns=[col]), dummies], axis=1)
        else:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].fillna("Unknown"))
    df = df.fillna(df.median())
    scaler = StandardScaler()
    X = scaler.fit_transform(df)
    print(f"Final shape: {X.shape}")
    return X

X = load_and_preprocess("data/diabetic_data.csv")

Original shape: (101766, 50)
Final shape: (101766, 2376)


## **2. Local Outlier Factor Model**

**How it works**:
* Compares local density of each point to its neighbors
* Points in sparse regions are flagged as outliers
* LOF > 1 indicates outlier, LOF ≈ 1 indicates normal

**Note**: LOF can be slow on large datasets, so we sample if needed.


In [None]:
n_neighbors = 20
contamination = 0.01
max_samples = 10000

# Sample if dataset is large
if len(X) > max_samples:
    print(f"⚠ Sampling {max_samples} points for LOF (dataset has {len(X)} points)")
    np.random.seed(42)
    indices = np.random.choice(len(X), max_samples, replace=False)
    X_sample = X[indices]
else:
    X_sample = X
    indices = np.arange(len(X))

print(f"\nFitting LOF (n_neighbors={n_neighbors}, contamination={contamination})...")
lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination, n_jobs=-1)
predictions = lof.fit_predict(X_sample)
scores = lof.negative_outlier_factor_

outlier_indices = np.where(predictions == -1)[0]
n_outliers = len(outlier_indices)

print(f"\n✓ Detected {n_outliers} outliers ({n_outliers/len(X_sample)*100:.2f}%)")
print(f"Score range: [{scores.min():.4f}, {scores.max():.4f}]")
print(f"\nInterpretation: More negative scores = stronger outlier signals")

⚠ Sampling 10000 points for LOF (dataset has 101766 points)


## **3. Visualization with PCA**


In [None]:
print("Applying PCA for visualization...")
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_sample)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Outliers highlighted
inlier_mask = predictions == 1
outlier_mask = predictions == -1

axes[0].scatter(X_pca[inlier_mask, 0], X_pca[inlier_mask, 1],
               c="blue", alpha=0.3, s=20, label="Inliers")
axes[0].scatter(X_pca[outlier_mask, 0], X_pca[outlier_mask, 1],
               c="red", alpha=0.8, s=50, marker="x", label="Outliers")
axes[0].set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
axes[0].set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
axes[0].set_title("LOF: Outliers vs Inliers")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Score distribution
axes[1].hist(scores, bins=50, edgecolor="black", alpha=0.7)
threshold = scores[outlier_mask].max() if outlier_mask.any() else scores.min()
axes[1].axvline(threshold, color="red", linestyle="--", linewidth=2,
               label="Outlier threshold")
axes[1].set_xlabel("Negative Outlier Factor")
axes[1].set_ylabel("Frequency")
axes[1].set_title("LOF Score Distribution")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## **Interpretation & Discussion**

### **How LOF Works**

1. **Local Density**: Compute density around each point using k-nearest neighbors
2. **Compare Densities**: Compare point's density to neighbors' densities
3. **LOF Score**: 
   - LOF ≈ 1: Similar density to neighbors (normal)
   - LOF > 1: Lower density than neighbors (outlier)

### **Key Findings**

* **Local vs Global**: LOF detects local anomalies, not just global outliers
* **Score Interpretation**:
  - Negative outlier factor shown (more negative = stronger outlier)
  - Points in sparse regions have more negative scores

### **Comparison: LOF vs Isolation Forest vs Elliptic Envelope**

| Aspect | LOF | Isolation Forest | Elliptic Envelope |
|--------|-----|------------------|-------------------|
| **Method** | Density-based | Tree-based | Gaussian/Covariance-based |
| **Speed** | Slower (O(n²)) | Faster (O(n log n)) | Fast (O(n) prediction) |
| **Outlier Type** | Local anomalies | Global anomalies | Global anomalies |
| **High Dimensions** | Suffers from curse | Handles well | Needs PCA/reduction |
| **Interpretability** | More intuitive | Less interpretable | Very interpretable |
| **Assumption** | Local density | None | Gaussian distribution |

### **Strengths**
* Detects local outliers in clustered data
* Intuitive: based on neighborhood density
* No assumptions about data distribution

### **Limitations**
* Slow on large datasets (O(n²) complexity)
* Sensitive to k (number of neighbors)
* Curse of dimensionality affects distance computation
* Requires setting contamination parameter

### **Medical Context**

LOF-detected outliers may represent:
1. **Unusual patient profiles**: Rare combinations of features
2. **Data quality issues**: Measurement or recording errors
3. **Special populations**: Subgroups requiring different treatment

**Recommendation**: 
* Use **LOF** when you expect clustered anomalies or local density-based outliers
* Use **Isolation Forest** for faster computation on large, high-dimensional datasets
* Use **Elliptic Envelope** when data is roughly Gaussian and you want interpretable global outliers
* Review common outliers detected by multiple methods (highest confidence)
