## Local Outlier Factor (LOF) algorithm for outlier detection when the dataset size and the number of outliers are unknown.

 LOF is a density-based method that compares the local density of a point to the local densities of its neighbors. Points with a significantly lower density than their neighbors are considered outliers.
 
 Steps:
1. Generate a dataset (we'll create one with inliers and outliers, but without knowing the exact proportion).

2. Use LocalOutlierFactor with the 'novelty' set to False (for outlier detection without needing a training set) or use the default (which is for outlier detection).
    Note: By default, LOF in scikit-learn is used as an unsupervised outlier detector and returns negative_outlier_factor_ which we can use to score the outliers.

3. Determine a threshold for the negative outlier factor to classify outliers. We can use:
    - A fixed threshold (e.g., -1.5 or -2) but that might not be robust.
    - Percentile of the scores (e.g., the bottom 5%).
    - Visual inspection of the scores to set a threshold.
 However, since we don't know the proportion of outliers, we can:
    a) Use the LOF scores and look for a natural break (elbow) in the sorted scores.
    b) Use the IQR method on the negative outlier factors (but note: the scores are negative, so we are looking for very negative values).

4. Plot the results.


<h3 style='color:black;'>Implementation</h3>

We'll generate a dataset with 95 inliers from a normal distribution and 5 outliers from a uniform distribution (as before, but without knowing that there are 5% outliers).

 Implementation:
 
Step 1: Data Generation
 
Step 2: Fit LOF and get the negative outlier factors.
 
Step 3: Set a threshold. We can use the IQR method on the negative outlier factors? But note: the more negative the score, the more outlier the point.
 
Alternatively, we can use the `contamination` parameter in LOF to specify the expected proportion. But if we don't know, we can avoid setting it and instead use the scores to set a threshold.
 However, the LOF class in scikit-learn does not set a threshold by default. We get the scores (negative_outlier_factor_) and we can set a threshold based on:
 
   threshold = np.percentile(scores, 100 * (proportion))   # if we set proportion to 5%, then 5th percentile.
   
But without knowing the proportion, we can:
   - Plot the scores and look for an elbow.
   - Use the IQR method: 
        Q1 = np.percentile(scores, 25)
        Q3 = np.percentile(scores, 75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR   # for the negative scores, we are interested in the lower tail.
     
Since the scores are negative, and outliers are the ones with the smallest (most negative) scores, we can set:
        outlier_mask = scores < (Q1 - 1.5 * IQR)
     
Step 4: Plot.

Alternatively, we can use the `LocalOutlierFactor` with `contamination='auto'` (default) which sets the threshold so that a certain percentage (10% by default) are outliers. But that might not be accurate.
 
Let me note: The default contamination in LOF is 'auto', which means 10%. But we can change that if we have a better estimate.

<h3 style='color:black;'>Detecting Outliers with Local Outlier Factor (LOF) for Unknown Datasets</h3>

When dealing with datasets of unknown size and unknown outlier proportion, Local Outlier Factor (LOF) is a powerful density-based method. Unlike Isolation Forest, LOF doesn't require specifying contamination upfront. Here's how to approach it:

<h3 style='color:black;'>Key Concept: Local Density Deviation</h3>

LOF compares the local density of a point to its neighbors:

- LOF ≈ 1: Similar density to neighbors (inlier)

- LOF > 1: Lower density than neighbors (outlier)

- LOF < 1: Higher density than neighbors (core point)

No prior knowledge of outlier count needed!

<h3 style='color:black;'>Step-by-Step Methodology</h3>

<h3 style='color:black;'>1. Compute LOF Scores</h3>

- Use LocalOutlierFactor with automatic neighbor selection

- Extract negative outlier factors (lower = more outlier-like)

<h3 style='color:black;'>2. Dynamic Thresholding</h3>

- Method 1: Percentile-based (e.g., flag top 5% as outliers)

- Method 2: Statistical cutoff (IQR rule)

- Method 3: Knee-point detection (optimal for unknown distributions)

<h3 style='color:black;'>3. Visual Validation</h3>

- Plot LOF score distribution

- Visualize spatial outlier distribution

<h3 style='color:black;'>Complete Implementation with Knee-Point Detection</h3>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from kneed import KneeLocator  # For automatic threshold detection

ModuleNotFoundError: No module named 'kneed'

In [2]:
# Generate dataset with unknown outlier ratio
np.random.seed(42)
inliers = 0.7 * np.random.randn(300, 2)  # Core cluster
cluster = 0.3 * np.random.randn(100, 2) + [3, 3]  # Secondary cluster
outliers = np.random.uniform(-4, 8, (20, 2))  # Unknown outlier count
X = np.vstack([inliers, cluster, outliers])

In [3]:
# 1. Compute LOF scores (auto neighbor selection)
lof = LocalOutlierFactor(n_neighbors=20, novelty=False)
lof.fit(X)
scores = -lof.negative_outlier_factor_  # Higher = more anomalous

In [4]:
# 2. Automatic threshold detection using knee-point
sorted_scores = np.sort(scores)[::-1]  # Descending order
knee = KneeLocator(
    range(len(sorted_scores)), 
    sorted_scores,
    curve="convex",
    direction="decreasing"
)
threshold = sorted_scores[knee.knee] if knee.knee else np.percentile(scores, 95)

NameError: name 'KneeLocator' is not defined

In [5]:
# 3. Identify outliers
outlier_mask = scores > threshold

NameError: name 'threshold' is not defined

In [None]:
# 4. Visualize results
plt.figure(figsize=(15, 5))

In [None]:
# Score distribution
plt.subplot(131)
plt.hist(scores, bins=50, alpha=0.7, color="skyblue")
plt.axvline(threshold, color="red", linestyle="--", 
            label=f'Threshold: {threshold:.2f}')
plt.title("LOF Score Distribution")
plt.xlabel("LOF Score")
plt.ylabel("Frequency")
plt.legend()

In [None]:
# Sorted scores with knee-point
plt.subplot(132)
plt.plot(sorted_scores, 'b-', label="LOF Scores")
plt.axvline(knee.knee, color="red", linestyle="--", 
            label=f'Knee Point (n={knee.knee})')
plt.xlabel("Points (sorted by score)")
plt.ylabel("LOF Score")
plt.title("Knee-Point Detection")
plt.legend()
plt.grid(True)

In [None]:
# Spatial distribution
plt.subplot(133)
plt.scatter(X[~outlier_mask, 0], X[~outlier_mask, 1], 
            c='blue', alpha=0.5, label="Inliers")
plt.scatter(X[outlier_mask, 0], X[outlier_mask, 1], 
            c='red', s=60, edgecolor='k', label="Outliers")
plt.title("Spatial Outlier Distribution")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()

plt.tight_layout()
plt.show()

print(f"Detected outliers: {outlier_mask.sum()} ({outlier_mask.sum()/len(X):.1%})")

<h3 style='color:black;'>Key Components Explained</h3>

<h3 style='color:black;'>1. Neighbor Selection (n_neighbors=20)</h3>

- Start with default min(20, n_samples-1)

- Critical parameter: Controls locality scale

- Tuning Tip: Use larger values for clustered data, smaller for scattered outliers

<h3 style='color:black;'>2. Knee-Point Detection</h3>

- Automatically finds the "elbow" in sorted LOF scores

- Uses kneed package for optimal cutoff identification

- Robust to unknown outlier proportions

<h3 style='color:black;'>3. Alternative Thresholding Methods</h3>

In [None]:
# Method 1: Fixed percentile (conservative)
threshold = np.percentile(scores, 95)  # Flag top 5%

# Method 2: IQR rule (robust to skewness)
Q1, Q3 = np.percentile(scores, [25, 75])
threshold = Q3 + 1.5 * (Q3 - Q1)

<h3 style='color:black;'>Interpreting Results</h3>

<h3 style='color:black;'>1. LOF Score Distribution (Left Plot)</h3>

- Shows score histogram with automatic threshold

- Expect right-skewed distribution with outlier "tail"

<h3 style='color:black;'>2. Knee-Point Plot (Middle)</h3>

- Red line indicates optimal cutoff point

- Points left of line are flagged as outliers

<h3 style='color:black;'>3. Spatial Plot (Right)</h3>

- Visual confirmation of outlier detection

- Colors show inliers (blue) vs. outliers (red)

<h3 style='color:black;'>Practical Recommendations</h3>

<h3 style='color:black;'>1. Neighbor Tuning:</h3>

In [None]:
# Auto-tune n_neighbors using score stability
for k in [5, 10, 20, 30, 40]:
    lof = LocalOutlierFactor(n_neighbors=k)
    scores = -lof.fit(X).negative_outlier_factor_
    # Check score distribution stability

<h3 style='color:black;'>2. Validation:</h3>

- Combine with DBSCAN: db = DBSCAN(eps=0.3).fit(X)

- Consensus outliers = Points flagged by both LOF and DBSCAN

<h3 style='color:black;'>3. Scalability:</h3>

- For large datasets (>10k points), use n_neighbors=50 and subsampling

LOF excels at detecting local density anomalies without prior knowledge of outlier characteristics, making it ideal for exploratory analysis of unknown datasets.

