# PCA-Based Gross Error Detection

### 🎯 Objective:
Detect gross sensor errors using Principal Component Analysis (PCA), by identifying multivariate outliers based on reconstruction error.

---

### 🧠 Why PCA?
In physical systems like gas compressors, features such as pressure, temperature, flow rate, and vibration are often correlated.  
If one or more sensors fail or behave abnormally, the structure of this correlation is disrupted.

PCA captures this correlation structure:
- It projects data into a lower-dimensional space that retains the dominant patterns
- Reconstruction error indicates how much each data point deviates from those learned patterns


In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Assume df is your dataset with compressor features
features = ["Pressure_In", "Temperature_In", "Flow_Rate", "Efficiency", "Vibration"]
df_clean = df[features].dropna()

# Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_clean)

# Apply PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# Reconstruct the input
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2, axis=1)

# Threshold at 95th percentile
threshold = np.percentile(reconstruction_error, 95)
outlier_indices = np.where(reconstruction_error > threshold)[0]

print(f"Number of suspected gross errors: {len(outlier_indices)}")

## Ploting

In [None]:
# Plot reconstruction error
plt.figure(figsize=(10, 4))
plt.hist(reconstruction_error, bins=100, color='skyblue')
plt.axvline(threshold, color='red', linestyle='--', label=f"95th percentile threshold")
plt.title("PCA Reconstruction Error Distribution")
plt.xlabel("Error")
plt.ylabel("Frequency")
plt.legend()
plt.show()

### further changes...