# Raport 2: Implementacja Podstawowa
## Detekcja Anomalii - LOF i PCA

**Projekt 7**: Anomalia i uczenie maszynowe  
**Rok akademicki**: 2025/2026

---

Ten notebook demonstruje podstawową implementację algorytmów:
1. **LOF** (Local Outlier Factor)
2. **PCA** (Principal Component Analysis) dla detekcji anomalii


In [None]:
# Imports
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons

# Add parent directory to path
sys.path.insert(0, os.path.abspath('..'))

from src.algorithms.lof import LOF
from src.algorithms.pca_anomaly import PCAAnomaly

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

print("✓ All imports successful")

: 

## 1. Local Outlier Factor (LOF)

### 1.1 Algorytm

LOF identyfikuje anomalie lokalne poprzez porównanie gęstości punktu z gęstością jego sąsiadów.

**Kluczowe koncepcje**:
- **k-distance**: odległość do k-tego najbliższego sąsiada
- **Reachability distance**: `reach-dist(p, o) = max(k-distance(o), d(p, o))`
- **Local Reachability Density** (LRD): odwrotność średniej reachability distance
- **LOF score**: `LOF(p) = średnia(LRD sąsiadów) / LRD(p)`

**Interpretacja**:
- LOF ≈ 1: punkt normalny
- LOF > 1: potencjalna anomalia (niższa gęstość niż sąsiedzi)


In [None]:
# Test 1: Prosty przykład 2D z jednym outlierem
print("Test 1: LOF na prostych danych 2D")
print("=" * 50)

# Create data: cluster + 1 outlier
X_cluster = np.array([
    [0, 0],
    [1, 1],
    [1, 0],
    [0, 1],
    [0.5, 0.5],
    [1, 0.5],
    [0.5, 1]
])
X_outlier = np.array([[5, 5]])
X = np.vstack([X_cluster, X_outlier])

# Fit LOF
lof = LOF(n_neighbors=3)
scores = lof.fit_predict(X)

print(f"Data shape: {X.shape}")
print(f"\nLOF scores:")
for i, score in enumerate(scores):
    label = "OUTLIER" if score > 1.5 else "NORMAL"
    print(f"  Point {i}: LOF = {score:.3f} [{label}]")

# Visualization
fig, ax = plt.subplots(1, 1, figsize=(8, 6))

# Plot points
scatter = ax.scatter(X[:, 0], X[:, 1], c=scores, s=200, 
                    cmap='RdYlGn_r', edgecolors='black', linewidths=2)

# Annotate with LOF scores
for i, (x, y) in enumerate(X):
    ax.annotate(f'{scores[i]:.2f}', (x, y), 
               textcoords="offset points", xytext=(0,10), 
               ha='center', fontsize=10, fontweight='bold')

ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.set_title('LOF: Simple 2D Example (k=3)', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=ax, label='LOF Score')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Test 2: Gaussowskie klastry z outlierami
print("\nTest 2: LOF na danych gaussowskich")
print("=" * 50)

np.random.seed(42)

# Generate Gaussian cluster
X_inliers = np.random.randn(100, 2) * 0.5

# Add outliers
X_outliers = np.array([
    [3, 3],
    [-3, 3],
    [3, -3],
    [-3, -2.5]
])

X = np.vstack([X_inliers, X_outliers])
y_true = np.hstack([np.zeros(100), np.ones(4)])  # Ground truth labels

# Fit LOF
lof = LOF(n_neighbors=20)
scores = lof.fit_predict(X)

# Threshold
threshold = 1.5
y_pred = (scores > threshold).astype(int)

# Calculate metrics
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"\nMetrics (threshold={threshold}):")
print(f"  Precision: {precision:.3f}")
print(f"  Recall: {recall:.3f}")
print(f"  F1-score: {f1:.3f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: LOF scores
scatter1 = axes[0].scatter(X[:, 0], X[:, 1], c=scores, s=50, 
                          cmap='RdYlGn_r', edgecolors='black', linewidths=0.5)
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('LOF Scores (k=20)', fontsize=14, fontweight='bold')
plt.colorbar(scatter1, ax=axes[0], label='LOF Score')
axes[0].grid(True, alpha=0.3)

# Plot 2: Predictions vs Ground Truth
axes[1].scatter(X[y_true == 0, 0], X[y_true == 0, 1], 
               c='blue', s=50, alpha=0.6, label='True Inliers', edgecolors='black', linewidths=0.5)
axes[1].scatter(X[y_true == 1, 0], X[y_true == 1, 1], 
               c='red', s=200, marker='*', label='True Outliers', edgecolors='black', linewidths=1)
axes[1].scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], 
               facecolors='none', s=300, marker='o', 
               edgecolors='orange', linewidths=3, label='Detected Outliers')
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title(f'Detection Results (F1={f1:.3f})', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test 3: Wpływ parametru k
print("\nTest 3: Wpływ parametru k na wyniki LOF")
print("=" * 50)

k_values = [5, 10, 20, 30]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    lof = LOF(n_neighbors=k)
    scores = lof.fit_predict(X)
    
    scatter = axes[idx].scatter(X[:, 0], X[:, 1], c=scores, s=50, 
                               cmap='RdYlGn_r', edgecolors='black', linewidths=0.5)
    axes[idx].set_xlabel('Feature 1', fontsize=10)
    axes[idx].set_ylabel('Feature 2', fontsize=10)
    axes[idx].set_title(f'LOF with k={k}', fontsize=12, fontweight='bold')
    plt.colorbar(scatter, ax=axes[idx], label='LOF Score')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObserwacje:")
print("  - Małe k: bardziej wrażliwe na anomalie lokalne")
print("  - Duże k: wykrywa anomalie globalne")

## 2. Principal Component Analysis (PCA)

### 2.1 Algorytm

PCA redukuje wymiarowość danych poprzez projekcję na kierunki maksymalnej wariancji.

**Kroki algorytmu**:
1. Standaryzacja danych
2. Obliczenie macierzy kowariancji
3. Wyznaczenie wektorów własnych (składowych głównych)
4. Projekcja danych na PC

**Detekcja anomalii**:
- **Reconstruction Error**: `||x - x_reconstructed||²`
- **Mahalanobis Distance**: odległość w przestrzeni PC z uwzględnieniem wariancji


In [None]:
# Test 4: PCA - Reconstruction Error
print("\nTest 4: PCA - Reconstruction Error Method")
print("=" * 50)

# Create data along main axis with outlier perpendicular
np.random.seed(42)
X_line = np.column_stack([
    np.linspace(0, 10, 30),
    np.linspace(0, 10, 30) + np.random.randn(30) * 0.3
])
X_outliers_pca = np.array([
    [5, -2],
    [2, 8]
])
X_pca = np.vstack([X_line, X_outliers_pca])

# Fit PCA
pca = PCAAnomaly(n_components=1, method='reconstruction')
pca.fit(X_pca)

# Get reconstruction errors
errors = pca.reconstruction_error(X_pca)
scores = pca.score_samples(X_pca)

print(f"\nExplained variance ratio: {pca.explained_variance_ratio_[0]:.3f}")
print(f"Mean reconstruction error (inliers): {np.mean(errors[:30]):.4f}")
print(f"Reconstruction errors (outliers): {errors[30:]}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Data with PC direction
axes[0].scatter(X_pca[:30, 0], X_pca[:30, 1], c='blue', s=50, 
               alpha=0.6, label='Inliers', edgecolors='black', linewidths=0.5)
axes[0].scatter(X_pca[30:, 0], X_pca[30:, 1], c='red', s=200, 
               marker='*', label='Outliers', edgecolors='black', linewidths=1)

# Plot principal component
mean = pca.mean_
pc = pca.components_[0] * pca.std_  # Unstandardize for plotting
axes[0].arrow(mean[0], mean[1], pc[0]*3, pc[1]*3, 
             head_width=0.3, head_length=0.3, fc='green', ec='green', linewidth=3)
axes[0].plot([mean[0] - pc[0]*3, mean[0] + pc[0]*3],
            [mean[1] - pc[1]*3, mean[1] + pc[1]*3],
            'g--', linewidth=2, label='1st Principal Component', alpha=0.7)

axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('PCA: Data and Principal Component', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Reconstruction errors
scatter2 = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=errors, s=100, 
                          cmap='RdYlGn_r', edgecolors='black', linewidths=1)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title('Reconstruction Error', fontsize=14, fontweight='bold')
plt.colorbar(scatter2, ax=axes[1], label='Reconstruction Error')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test 5: PCA - Explained Variance
print("\nTest 5: PCA - Explained Variance Analysis")
print("=" * 50)

# Generate 5D data
np.random.seed(42)
n_samples = 200

# Create correlated features
z1 = np.random.randn(n_samples)
z2 = np.random.randn(n_samples)
z3 = np.random.randn(n_samples) * 0.1

X_5d = np.column_stack([
    z1 + z2 + np.random.randn(n_samples) * 0.1,  # Highly correlated
    z1 - z2 + np.random.randn(n_samples) * 0.1,  # Highly correlated
    z1 * 0.5 + np.random.randn(n_samples) * 0.3,  # Moderately correlated
    z3,  # Low variance
    np.random.randn(n_samples) * 0.05  # Very low variance
])

# Fit PCA with all components
pca_all = PCAAnomaly(n_components=5)
pca_all.fit(X_5d)

print(f"\nExplained variance by each component:")
for i, var in enumerate(pca_all.explained_variance_ratio_):
    print(f"  PC{i+1}: {var:.4f} ({var*100:.2f}%)")

cumsum = np.cumsum(pca_all.explained_variance_ratio_)
print(f"\nCumulative explained variance:")
for i, var in enumerate(cumsum):
    print(f"  First {i+1} components: {var:.4f} ({var*100:.2f}%)")

# Plot explained variance
fig = pca_all.plot_explained_variance()
plt.show()

In [None]:
# Test 6: PCA - Mahalanobis Distance
print("\nTest 6: PCA - Mahalanobis Distance Method")
print("=" * 50)

# Generate elongated cluster
np.random.seed(42)
X_elongated = np.random.randn(100, 2)
X_elongated[:, 0] *= 3  # Stretch along x-axis

# Add outliers
X_outliers_maha = np.array([[8, 8], [-8, -7]])
X_maha = np.vstack([X_elongated, X_outliers_maha])

# Fit PCA with Mahalanobis
pca_maha = PCAAnomaly(n_components=2, method='mahalanobis', contamination=0.05)
pca_maha.fit(X_maha)

distances = pca_maha.mahalanobis_distance(X_maha)
labels = pca_maha.predict(X_maha)

print(f"\nMahalanobis distances:")
print(f"  Mean (inliers): {np.mean(distances[:100]):.3f}")
print(f"  Outliers: {distances[100:]}")
print(f"  Threshold: {-pca_maha.threshold_:.3f}")
print(f"  Detected outliers: {np.sum(labels)}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Mahalanobis distances
scatter1 = axes[0].scatter(X_maha[:, 0], X_maha[:, 1], c=distances, s=80, 
                          cmap='RdYlGn_r', edgecolors='black', linewidths=1)
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Mahalanobis Distance', fontsize=14, fontweight='bold')
plt.colorbar(scatter1, ax=axes[0], label='Distance')
axes[0].grid(True, alpha=0.3)

# Plot 2: Detection results
axes[1].scatter(X_maha[labels == 0, 0], X_maha[labels == 0, 1], 
               c='blue', s=50, alpha=0.6, label='Inliers', edgecolors='black', linewidths=0.5)
axes[1].scatter(X_maha[labels == 1, 0], X_maha[labels == 1, 1], 
               c='red', s=200, marker='*', label='Detected Outliers', 
               edgecolors='black', linewidths=1)
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title('Detection Results', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Porównanie LOF vs PCA

Sprawdzmy, jak obie metody radzą sobie z tym samym zbiorem danych.

In [None]:
# Test 7: Porównanie LOF vs PCA
print("\nTest 7: Porównanie LOF vs PCA")
print("=" * 50)

# Generate test data
np.random.seed(42)
X_test = np.random.randn(150, 2) * 0.8
X_test_outliers = np.array([
    [4, 4],
    [-4, 3],
    [3, -4],
    [-3, -4],
    [0, 5]
])
X_test_combined = np.vstack([X_test, X_test_outliers])
y_test_true = np.hstack([np.zeros(150), np.ones(5)])

# LOF
lof_comp = LOF(n_neighbors=15)
lof_scores = lof_comp.fit_predict(X_test_combined)
lof_labels = (lof_scores > 1.5).astype(int)

# PCA
pca_comp = PCAAnomaly(n_components=1, method='reconstruction', contamination=0.05)
pca_comp.fit(X_test_combined)
pca_errors = pca_comp.reconstruction_error(X_test_combined)
pca_labels = pca_comp.predict(X_test_combined)

# Metrics
from sklearn.metrics import precision_score, recall_score, f1_score

lof_precision = precision_score(y_test_true, lof_labels, zero_division=0)
lof_recall = recall_score(y_test_true, lof_labels)
lof_f1 = f1_score(y_test_true, lof_labels)

pca_precision = precision_score(y_test_true, pca_labels, zero_division=0)
pca_recall = recall_score(y_test_true, pca_labels)
pca_f1 = f1_score(y_test_true, pca_labels)

print(f"\nLOF Results:")
print(f"  Precision: {lof_precision:.3f}")
print(f"  Recall: {lof_recall:.3f}")
print(f"  F1-score: {lof_f1:.3f}")

print(f"\nPCA Results:")
print(f"  Precision: {pca_precision:.3f}")
print(f"  Recall: {pca_recall:.3f}")
print(f"  F1-score: {pca_f1:.3f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Plot 1: Ground truth
axes[0].scatter(X_test_combined[y_test_true == 0, 0], 
               X_test_combined[y_test_true == 0, 1], 
               c='blue', s=50, alpha=0.6, label='True Inliers', 
               edgecolors='black', linewidths=0.5)
axes[0].scatter(X_test_combined[y_test_true == 1, 0], 
               X_test_combined[y_test_true == 1, 1], 
               c='red', s=200, marker='*', label='True Outliers', 
               edgecolors='black', linewidths=1)
axes[0].set_xlabel('Feature 1', fontsize=12)
axes[0].set_ylabel('Feature 2', fontsize=12)
axes[0].set_title('Ground Truth', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: LOF results
scatter2 = axes[1].scatter(X_test_combined[:, 0], X_test_combined[:, 1], 
                          c=lof_scores, s=80, cmap='RdYlGn_r', 
                          edgecolors='black', linewidths=1)
axes[1].scatter(X_test_combined[lof_labels == 1, 0], 
               X_test_combined[lof_labels == 1, 1], 
               facecolors='none', s=300, marker='o', 
               edgecolors='orange', linewidths=3, label='Detected')
axes[1].set_xlabel('Feature 1', fontsize=12)
axes[1].set_ylabel('Feature 2', fontsize=12)
axes[1].set_title(f'LOF (F1={lof_f1:.3f})', fontsize=14, fontweight='bold')
plt.colorbar(scatter2, ax=axes[1], label='LOF Score')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

# Plot 3: PCA results
scatter3 = axes[2].scatter(X_test_combined[:, 0], X_test_combined[:, 1], 
                          c=pca_errors, s=80, cmap='RdYlGn_r', 
                          edgecolors='black', linewidths=1)
axes[2].scatter(X_test_combined[pca_labels == 1, 0], 
               X_test_combined[pca_labels == 1, 1], 
               facecolors='none', s=300, marker='o', 
               edgecolors='orange', linewidths=3, label='Detected')
axes[2].set_xlabel('Feature 1', fontsize=12)
axes[2].set_ylabel('Feature 2', fontsize=12)
axes[2].set_title(f'PCA (F1={pca_f1:.3f})', fontsize=14, fontweight='bold')
plt.colorbar(scatter3, ax=axes[2], label='Reconstruction Error')
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Podsumowanie

### 4.1 LOF (Local Outlier Factor)

**Zalety**:
- Wykrywa anomalie lokalne
- Odporny na różne gęstości w zbiorze
- Nie wymaga założeń o rozkładzie danych

**Wady**:
- Wysoka złożoność obliczeniowa O(n²) lub O(n log n) z KD-Tree
- Wrażliwy na wybór parametru k
- Może mieć problemy z danymi wysokowymiarowymi

### 4.2 PCA (Principal Component Analysis)

**Zalety**:
- Redukcja wymiarowości
- Interpretowalna transformacja
- Skuteczna dla danych z liniowymi zależnościami
- Niższa złożoność: O(nd² + d³)

**Wady**:
- Zakłada liniowość
- Wrażliwa na skalę danych
- Może przeoczyć nieliniowe anomalie
- Może nie wykryć anomalii w kierunkach niskiej wariancji

### 4.3 Kiedy używać której metody?

**LOF**:
- Dane z różnymi gęstościami
- Anomalie lokalne
- Niewielkie zbiory danych (< 10k próbek)

**PCA**:
- Dane wysokowymiarowe wymagające redukcji
- Liniowe zależności między cechami
- Duże zbiory danych
- Anomalie w kierunkach głównych wariancji


In [None]:
print("=" * 70)
print("RAPORT 2: Implementacja Podstawowa - ZAKOŃCZONY")
print("=" * 70)
print("\n✓ Implementacja LOF - DONE")
print("✓ Implementacja PCA - DONE")
print("✓ Testy jednostkowe - DONE")
print("✓ Demonstracja algorytmów - DONE")
print("\nNastępny krok: Raport 3 - Isolation Forest i optymalizacje")