# Anomaly Detection — Invisible City

**Goal**: Find facilities that behave very differently from others in their industry (quiet anomalies).

- Use Isolation Forest on facility-level features
- Label anomalies; compare to high-risk facilities
- Why anomalies matter: they are statistical outliers, not legal judgments

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

BASE = os.getcwd()
PROC = os.path.join(BASE, 'data', 'processed')
OUT = os.path.join(BASE, 'output')
os.makedirs(OUT, exist_ok=True)
os.makedirs(os.path.join(OUT, 'figures'), exist_ok=True)

In [None]:
df = pd.read_csv(os.path.join(PROC, 'facilities_with_risk.csv'))
le = LabelEncoder()
df['industry_enc'] = le.fit_transform(df['industry'].astype(str))

features_anom = ['total_release_kg', 'release_count', 'n_chemicals', 'industry_enc', 'risk_score']
X = df[features_anom].copy()
X['total_release_kg'] = np.log1p(X['total_release_kg'])
X['release_count'] = np.log1p(X['release_count'])

## Train Isolation Forest

In [None]:
iso = IsolationForest(contamination=0.08, random_state=42)
df['anomaly'] = iso.fit_predict(X)
df['anomaly'] = (df['anomaly'] == -1).astype(int)  # 1 = anomaly

print('Anomalies:', df['anomaly'].sum())
print('Anomaly facilities:')
display(df[df['anomaly'] == 1][['facility_id', 'industry', 'total_release_kg', 'risk_score']])

## Why anomalies matter

Anomalies are facilities that behave *differently* from peers (same industry, release patterns). They are not necessarily "worst" by risk score — they are *unusual*. Highlighting them helps regulators and the public notice outliers worth a closer look.

In [None]:
df.to_csv(os.path.join(PROC, 'facilities_with_risk_and_anomaly.csv'), index=False)
print('Saved facilities_with_risk_and_anomaly.csv')

## Distribution: risk scores with top 5% and anomalies

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
q95 = df['risk_score'].quantile(0.95)
normal = df[(df['anomaly'] == 0) & (df['risk_score'] < q95)]
top5 = df[df['risk_score'] >= q95]
anoms = df[df['anomaly'] == 1]
ax.hist(normal['risk_score'], bins=25, color='steelblue', alpha=0.7, label='Normal')
ax.hist(top5['risk_score'], bins=15, color='coral', alpha=0.8, label='Top 5% risk')
ax.scatter(anoms['risk_score'], [0]*len(anoms), color='black', s=80, zorder=5, marker='^', label='Anomalies')
ax.set_xlabel('Risk score')
ax.set_ylabel('Count')
ax.set_title('Risk score distribution: top 5% and anomalies highlighted')
ax.legend()
plt.tight_layout()
plt.savefig(os.path.join(OUT, 'figures', 'risk_distribution_with_anomalies.png'), dpi=150)
plt.show()