# 03. Anomaly Detection & Theft Modeling (Champion-Challenger)

## Overview
In this notebook, we implement a **Champion-Challenger** framework to select the best electricity theft detection model. We train multiple models (Champion: XGBoost, Challenger: Random Forest) and automatically select the one with the highest AUPRC (Area Under Precision-Recall Curve).

We also train an Unsupervised Isolation Forest as a safety net for detecting novel anomaly patterns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib
from sklearn.metrics import average_precision_score, precision_recall_curve, roc_curve, auc, confusion_matrix
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, IsolationForest

# Set Style
plt.style.use('dark_background')
pd.set_option('display.max_columns', None)

PROJECT_ROOT = Path('..')
DATA_DIR = PROJECT_ROOT / 'data'
ARTIFACTS_DIR = PROJECT_ROOT / 'artifacts'
ARTIFACTS_DIR.mkdir(exist_ok=True)

## 1. Load Preprocessed Data
Loading the feature-engineered dataset from the pipeline.

In [None]:
df_features = pd.read_csv(ARTIFACTS_DIR / 'preprocessed.csv')
print(f"Data Shape: {df_features.shape}")
df_features.head()

## 2. Champion-Challenger Training
We split the data temporally (training on past, validating on future) to prevent leakage.

In [None]:
# Prepare Data
X = df_features.drop(columns=['CONS_NO', 'FLAG'])
y = df_features['FLAG']

# Temporal Split (First 80% Train, Last 20% Valid)
train_size = int(len(df_features) * 0.8)
X_train, X_val = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_val = y.iloc[:train_size], y.iloc[train_size:]

print(f"Train Samples: {len(X_train)}, Validation Samples: {len(X_val)}")
print(f"Theft Rate Train: {y_train.mean():.2%}, Val: {y_val.mean():.2%}")

### Train Champion: XGBoost

In [None]:
model_xgb = XGBClassifier(
    n_estimators=100, 
    max_depth=6, 
    learning_rate=0.1, 
    eval_metric='logloss',
    random_state=42
)
model_xgb.fit(X_train, y_train)

y_pred_xgb = model_xgb.predict_proba(X_val)[:, 1]
auprc_xgb = average_precision_score(y_val, y_pred_xgb)
print(f"Champion (XGBoost) AUPRC: {auprc_xgb:.4f}")

### Train Challenger: Random Forest (Surgical Ensemble)

In [None]:
model_rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
model_rf.fit(X_train, y_train)

y_pred_rf = model_rf.predict_proba(X_val)[:, 1]
auprc_rf = average_precision_score(y_val, y_pred_rf)
print(f"Challenger (Random Forest) AUPRC: {auprc_rf:.4f}")

## 3. Model Arena: The Conclusion
Comparing performance to select the winner.

In [None]:
plt.figure(figsize=(10, 6))

# Data for both models
precision_xgb, recall_xgb, _ = precision_recall_curve(y_val, y_pred_xgb)
precision_rf, recall_rf, _ = precision_recall_curve(y_val, y_pred_rf)

plt.plot(recall_xgb, precision_xgb, label=f'XGBoost (AUC={auprc_xgb:.3f})', color='cyan')
plt.plot(recall_rf, precision_rf, label=f'Random Forest (AUC={auprc_rf:.3f})', color='magenta')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('The Model Arena: Precision-Recall Comparison')
plt.legend()
plt.grid(True, alpha=0.2)
plt.show()

In [None]:
if auprc_rf > auprc_xgb:
    print("üèÜ NEW CHAMPION: Random Forest Wins!")
    best_model = model_rf
else:
    print("üèÜ CHAMPION RETAINED: XGBoost Wins!")
    best_model = model_xgb

# Save the winner
joblib.dump(best_model, ARTIFACTS_DIR / 'model_xgb.joblib') # Saving to main model path