# 03: M1/S1 - Unsupervised Anomaly Detection (Token Level)

**Goal:** Detect subjective words (LJMPNIK) using unsupervised methods (Mahalanobis, Isolation Forest, OCSVM).
 
**Methodology:**
1. **Training:** On purely neutral tokens (L0) from `gold` dataset.
2. **Validation:** On mixed data (L0 + L1) to find optimal threshold.
3. **Testing:** On held-out mixed data (Document-level split).
 
**Note:** Uses new `data_splitting` module to prevent data leakage.


## 1. Setup & Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from itables import show
import sys
import os

# Auto-reload modules for development
%load_ext autoreload
%autoreload 2
%matplotlib inline

# Add src to path
current_dir = os.getcwd()
src_dir = os.path.abspath(os.path.join(current_dir, '..', 'src'))
if src_dir not in sys.path:
    sys.path.append(src_dir)

# Import custom modules
import config
import data_splitting
import models
import visualization
import experiments
import evaluation


# Setup visualization style
visualization.setup_style()

print(f"‚úÖ Setup complete. Data dir: {config.DATA_DIR}")


‚öôÔ∏è Configuration loaded. Device: cpu
‚úÖ Setup complete. Data dir: C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data


## 2. Data Check

In [5]:
# Load sample data (Aggressive filter) just to check stats

data_sample = data_splitting.get_unsupervised_splits(
    scenario='baseline', 
    level='token', 
    filter_type='aggressive'
)

print(f"üîπ TRAIN set (Neutral only): {data_sample['X_train'].shape}")
print(f"üîπ VAL set (Mixed):          {data_sample['X_val'].shape}")
print(f"üîπ TEST set (Mixed):         {data_sample['X_test'].shape}")

# Verify Train contains only L0
train_anomalies = data_sample['y_train'].sum()
print(f"‚ö†Ô∏è Anomalies in Train: {train_anomalies} (Should be 0)")


2026-02-03 20:56:57,097 - INFO - üîÑ Preparing UNSUPERVISED splits for scenario: baseline (Training strictly on L0)
2026-02-03 20:56:57,097 - INFO - üìä Preparing scenario: baseline (token level, aggressive filter)
2026-02-03 20:56:57,616 - INFO - ‚úÖ Loaded 17557 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\gold_tokens.pkl
2026-02-03 20:57:00,073 - INFO - ‚úÖ Loaded 78991 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\silver_tokens.pkl
2026-02-03 20:57:00,131 - INFO - Splitting 520 documents: 104 test, 41 val, 375 train
2026-02-03 20:57:00,139 - INFO - ‚úÖ Document-level split completed:
2026-02-03 20:57:00,141 - INFO -    Train: 376 docs, 2585 samples
2026-02-03 20:57:00,142 - INFO -    Val:   41 docs, 270 samples
2026-02-03 20:57:00,143 - INFO -    Test:  103 docs, 741 samples
2026-02-03 20:57:00,146 - INFO -    ‚úì No document leakage detected between splits
2026-02-03 20:57:00,148 - INFO - ‚úÖ Scen

üîπ TRAIN set (Neutral only): (900, 768)
üîπ VAL set (Mixed):          (270, 768)
üîπ TEST set (Mixed):         (741, 768)
‚ö†Ô∏è Anomalies in Train: 0 (Should be 0)


## 3. Experimental Loop

In [3]:
# Define scenarios to run

scenarios = []
filters = ['aggressive', 'mild', 'none']
models_list = ['Mahalanobis', 'IsolationForest', 'OCSVM']

for m in models_list:
    for f in filters:
        scenarios.append({
            'model': m,
            'filter': f,
            'level': 'token' # This notebook is S1 (Token)
        })

print(f"üöÄ Defined {len(scenarios)} scenarios.")


üöÄ Defined 9 scenarios.


In [4]:
# RUN EXPERIMENTS (This uses the new src.experiments module)

df_results = experiments.run_unsupervised_benchmark(scenarios)

# Save results
save_path = config.RESULTS_DIR / 'M1_S1_results.csv'
df_results.to_csv(save_path, index=False)
print(f"üíæ Results saved to {save_path}")


Running M1 Experiments:   0%|          | 0/9 [00:00<?, ?it/s]2026-02-03 20:56:05,016 - INFO - üîÑ Preparing UNSUPERVISED splits for scenario: baseline (Training strictly on L0)
2026-02-03 20:56:05,016 - INFO - üìä Preparing scenario: baseline (token level, aggressive filter)
2026-02-03 20:56:05,449 - INFO - ‚úÖ Loaded 17557 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\gold_tokens.pkl
2026-02-03 20:56:08,050 - INFO - ‚úÖ Loaded 78991 rows from C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\data\processed\silver_tokens.pkl
2026-02-03 20:56:08,103 - INFO - Splitting 520 documents: 104 test, 41 val, 375 train
2026-02-03 20:56:08,114 - INFO - ‚úÖ Document-level split completed:
2026-02-03 20:56:08,114 - INFO -    Train: 376 docs, 2585 samples
2026-02-03 20:56:08,114 - INFO -    Val:   41 docs, 270 samples
2026-02-03 20:56:08,114 - INFO -    Test:  103 docs, 741 samples
2026-02-03 20:56:08,114 - INFO -    ‚úì No document leakage dete

üíæ Results saved to C:\Users\dobes\Documents\UniversityCodingProject\ThesisCoding\results\M1_S1_results.csv





## 4. Results Analysis

In [None]:
# Show interactive table
show(df_results.sort_values('auprc', ascending=False), classes="display compact")

In [None]:
# Barplot of AUPRC Scores
plt.figure(figsize=(10, 6))
sns.barplot(data=df_results, x='model', y='auprc', hue='filter', palette='viridis')
plt.title('M1/S1 Performance (AUPRC) by Model and Filter')
plt.ylim(0, 1.0)
plt.legend(title='POS Filter', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()


## 5. Deep Dive: Best Model Analysis

In [None]:
# 1. Get Winner
best_run = df_results.sort_values('auprc', ascending=False).iloc[0]
print(f"üèÜ WINNER: {best_run['model']} ({best_run['filter']})")
print(f"   AUPRC: {best_run['auprc']:.4f}")
print(f"   F1:    {best_run['f1']:.4f}")

# 2. Reload Data for Winner
data_best = data_splitting.get_train_val_test_splits(
    scenario='baseline', 
    level='token', 
    filter_type=best_run['filter']
)

# 3. Retrain (to get the object)
model = models.get_unsupervised_model(best_run['model'])
model.fit(data_best['X_train'][data_best['y_train'] == 0])

# 4. Get Scores
scores_val = model.decision_function(data_best['X_val'])
scores_test = model.decision_function(data_best['X_test'])

# 5. Optimal Threshold (from Val)
threshold, _ = evaluation.find_optimal_threshold(data_best['y_val'], scores_val, metric='f1')
print(f"‚öôÔ∏è Optimal Threshold (from Val): {threshold:.4f}")


### Visualizations

In [None]:
# 1. Anomaly Score Histogram
visualization.plot_anomaly_histogram(
    scores_test, 
    data_best['y_test'], 
    threshold=threshold,
    title=f"Anomaly Scores: {best_run['model']} ({best_run['filter']})"
)


In [None]:
# 2. Precision-Recall Curve
visualization.plot_pr_curve(
    data_best['y_test'], 
    scores_test, 
    title=f"PR Curve: {best_run['model']}"
)

In [None]:
# 3. Confusion Matrix
y_pred_test = (scores_test > threshold).astype(int)
visualization.plot_confusion_matrix_heatmap(
    data_best['y_test'], 
    y_pred_test, 
    normalize=False
)

## 6. Qualitative Analysis (Export)

In [None]:
# Create comparison dataframe
df_qual = pd.DataFrame({
    'token_text': data_best['meta_test']['analyzed_token'],
    'full_text': data_best['meta_test']['text'],
    'true_label': data_best['y_test'],
    'pred_label': y_pred_test,
    'anomaly_score': scores_test,
    'document_id': data_best['meta_test']['document_id']
})

# Add Error Category
conditions = [
    (df_qual.true_label == 1) & (df_qual.pred_label == 1),
    (df_qual.true_label == 0) & (df_qual.pred_label == 0),
    (df_qual.true_label == 0) & (df_qual.pred_label == 1),
    (df_qual.true_label == 1) & (df_qual.pred_label == 0)
]
df_qual['category'] = np.select(conditions, ['TP', 'TN', 'FP', 'FN'])

# Save
qual_path = config.RESULTS_DIR / f"M1_S1_qualitative_{best_run['model']}.csv"
df_qual.to_csv(qual_path, index=False)
print(f"üìù Qualitative analysis saved to: {qual_path}")

# Show top False Positives (High score, but neutral)
print("\n‚ùå Top False Positives (Model thought it was anomaly, but it's not):")
show(df_qual[df_qual.category == 'FP'].sort_values('anomaly_score', ascending=False).head(10))