# Complete Robust Anomaly Detection System
## Self-Contained End-to-End Workflow

**This notebook is 100% self-contained - no external files needed!**

Features:
- Multiple anomaly detection algorithms (Gaussian, Isolation Forest, One-Class SVM, LOF, Elliptic Envelope)
- Advanced imbalance handling (SMOTE, ADASYN, etc.)
- Comprehensive statistical analysis with hypothesis testing
- Model comparison and evaluation with 15+ metrics
- Professional visualizations
- Production-ready deployment code

---

## 1. Install Required Packages

In [None]:
# Install required packages (run this once)
!pip install -q numpy pandas scipy scikit-learn imbalanced-learn matplotlib seaborn

## 2. Import Libraries and Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from scipy.stats import (shapiro, normaltest, ttest_ind, mannwhitneyu, 
                         pearsonr, spearmanr, multivariate_normal, zscore)

# Preprocessing
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Model selection
from sklearn.model_selection import train_test_split

# Anomaly detection models
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope

# Imbalance handling
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek, SMOTEENN

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    roc_curve, precision_recall_curve, average_precision_score,
    matthews_corrcoef, cohen_kappa_score, balanced_accuracy_score
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)
%matplotlib inline

print("‚úÖ All libraries imported successfully!")
print(f"   NumPy: {np.__version__}")
print(f"   Pandas: {pd.__version__}")

## 3. Define All Helper Functions

All the anomaly detection functionality is defined here in one place.

In [None]:
# ============================================================================
# STATISTICAL ANALYSIS FUNCTIONS
# ============================================================================

def cohens_d(group1, group2):
    """Calculate Cohen's d for effect size"""
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (np.mean(group1) - np.mean(group2)) / pooled_std

def interpret_cohens_d(d):
    """Interpret Cohen's d effect size"""
    abs_d = abs(d)
    if abs_d < 0.2:
        return 'negligible'
    elif abs_d < 0.5:
        return 'small'
    elif abs_d < 0.8:
        return 'medium'
    else:
        return 'large'

def perform_statistical_analysis(X, y, feature_names, alpha=0.05):
    """Perform comprehensive statistical analysis"""
    print("\n" + "="*80)
    print("COMPREHENSIVE STATISTICAL ANALYSIS")
    print("="*80)
    
    # Dataset overview
    print(f"\nDataset Shape: {X.shape}")
    print(f"Number of Features: {X.shape[1]}")
    print(f"Number of Samples: {X.shape[0]}")
    
    # Target distribution
    unique, counts = np.unique(y, return_counts=True)
    print(f"\nTarget Distribution:")
    for val, count in zip(unique, counts):
        pct = count / len(y) * 100
        print(f"  Class {val}: {count} ({pct:.2f}%)")
    
    imbalance_ratio = max(counts) / min(counts)
    print(f"  Imbalance Ratio: {imbalance_ratio:.2f}:1")
    
    if imbalance_ratio > 1.5:
        print("  ‚ö†Ô∏è  WARNING: Significant class imbalance detected!")
    
    # Feature significance tests
    significance_results = []
    
    for i, fname in enumerate(feature_names):
        class_0 = X[y == 0, i]
        class_1 = X[y == 1, i]
        
        # Mann-Whitney U test
        stat_mw, p_mw = mannwhitneyu(class_0, class_1, alternative='two-sided')
        
        # Effect size
        d = cohens_d(class_0, class_1)
        
        significance_results.append({
            'feature': fname,
            'mannwhitney_p': p_mw,
            'cohens_d': d,
            'effect_size': interpret_cohens_d(d),
            'is_significant': p_mw < alpha
        })
    
    sig_df = pd.DataFrame(significance_results)
    
    print("\nTop 10 Most Significant Features:")
    top_sig = sig_df.nsmallest(10, 'mannwhitney_p')
    print(top_sig.to_string(index=False))
    
    n_significant = sig_df['is_significant'].sum()
    print(f"\nStatistically significant features: {n_significant}/{len(sig_df)} (p < {alpha})")
    
    return sig_df

print("‚úÖ Statistical analysis functions loaded")

In [None]:
# ============================================================================
# PREPROCESSING FUNCTIONS
# ============================================================================

def preprocess_data(X_train, X_test, method='robust'):
    """Preprocess data with scaling"""
    scaler_map = {
        'standard': StandardScaler(),
        'robust': RobustScaler(),
        'minmax': MinMaxScaler()
    }
    
    scaler = scaler_map.get(method, RobustScaler())
    
    # Handle DataFrames
    if isinstance(X_train, pd.DataFrame):
        feature_names = X_train.columns.tolist()
        X_train = X_train.values
        X_test = X_test.values
    else:
        feature_names = [f'feature_{i}' for i in range(X_train.shape[1])]
    
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled, scaler, feature_names

def handle_imbalance(X_train, y_train, method='smote', random_state=42):
    """Handle class imbalance"""
    if method == 'none':
        return X_train, y_train
    
    print(f"\nApplying {method.upper()} for class imbalance...")
    print(f"Original distribution: {np.bincount(y_train)}")
    
    resamplers = {
        'smote': SMOTE(random_state=random_state),
        'adasyn': ADASYN(random_state=random_state),
        'borderline_smote': BorderlineSMOTE(random_state=random_state),
        'smote_tomek': SMOTETomek(random_state=random_state),
        'smote_enn': SMOTEENN(random_state=random_state),
        'undersample': RandomUnderSampler(random_state=random_state)
    }
    
    resampler = resamplers.get(method, SMOTE(random_state=random_state))
    X_resampled, y_resampled = resampler.fit_resample(X_train, y_train)
    
    print(f"Resampled distribution: {np.bincount(y_resampled)}")
    
    return X_resampled, y_resampled

print("‚úÖ Preprocessing functions loaded")

In [None]:
# ============================================================================
# ANOMALY DETECTION MODEL FUNCTIONS
# ============================================================================

def fit_gaussian_model(X_train, y_train):
    """Fit Gaussian anomaly detection model"""
    X_normal = X_train[y_train == 0]
    mu = np.mean(X_normal, axis=0)
    sigma = np.cov(X_normal, rowvar=False)
    sigma += np.eye(sigma.shape[0]) * 1e-6  # Regularization
    return {'mu': mu, 'sigma': sigma, 'type': 'gaussian'}

def predict_gaussian(X_test, model, epsilon=None):
    """Predict using Gaussian model"""
    mu = model['mu']
    sigma = model['sigma']
    probs = multivariate_normal(mean=mu, cov=sigma, allow_singular=True).pdf(X_test)
    
    if epsilon is None:
        epsilon = np.percentile(probs, 5)
    
    predictions = (probs < epsilon).astype(int)
    return predictions, probs

def select_gaussian_threshold(X_train, y_train, mu, sigma):
    """Select optimal threshold for Gaussian model"""
    probs = multivariate_normal(mean=mu, cov=sigma, allow_singular=True).pdf(X_train)
    
    best_epsilon = 0
    best_f1 = 0
    epsilons = np.linspace(np.min(probs), np.max(probs), 1000)
    
    for eps in epsilons:
        preds = (probs < eps).astype(int)
        f1 = f1_score(y_train, preds, zero_division=0)
        if f1 > best_f1:
            best_f1 = f1
            best_epsilon = eps
    
    return best_epsilon, best_f1

def fit_isolation_forest(X_train, y_train, random_state=42):
    """Fit Isolation Forest"""
    contamination = np.sum(y_train == 1) / len(y_train)
    contamination = max(0.01, min(0.5, contamination))
    
    model = IsolationForest(
        contamination=contamination,
        random_state=random_state,
        n_estimators=200
    )
    model.fit(X_train)
    return model

def fit_one_class_svm(X_train, y_train):
    """Fit One-Class SVM"""
    nu = np.sum(y_train == 1) / len(y_train)
    nu = max(0.01, min(0.5, nu))
    
    model = OneClassSVM(nu=nu, kernel='rbf', gamma='auto')
    model.fit(X_train)
    return model

def fit_lof(X_train, y_train):
    """Fit Local Outlier Factor"""
    contamination = np.sum(y_train == 1) / len(y_train)
    contamination = max(0.01, min(0.5, contamination))
    
    model = LocalOutlierFactor(
        n_neighbors=20,
        contamination=contamination,
        novelty=True
    )
    model.fit(X_train[y_train == 0])
    return model

def fit_elliptic_envelope(X_train, y_train, random_state=42):
    """Fit Elliptic Envelope"""
    contamination = np.sum(y_train == 1) / len(y_train)
    contamination = max(0.01, min(0.5, contamination))
    
    model = EllipticEnvelope(
        contamination=contamination,
        random_state=random_state
    )
    model.fit(X_train[y_train == 0])
    return model

print("‚úÖ Anomaly detection model functions loaded")

In [None]:
# ============================================================================
# EVALUATION FUNCTIONS
# ============================================================================

def evaluate_model(y_true, y_pred, y_prob=None):
    """Comprehensive model evaluation"""
    results = {}
    
    # Basic metrics
    results['accuracy'] = accuracy_score(y_true, y_pred)
    results['precision'] = precision_score(y_true, y_pred, zero_division=0)
    results['recall'] = recall_score(y_true, y_pred, zero_division=0)
    results['f1'] = f1_score(y_true, y_pred, zero_division=0)
    results['balanced_accuracy'] = balanced_accuracy_score(y_true, y_pred)
    results['matthews_corrcoef'] = matthews_corrcoef(y_true, y_pred)
    results['cohen_kappa'] = cohen_kappa_score(y_true, y_pred)
    
    # Confusion matrix metrics
    cm = confusion_matrix(y_true, y_pred)
    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        results['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    # Probabilistic metrics
    if y_prob is not None:
        try:
            results['roc_auc'] = roc_auc_score(y_true, y_prob)
            results['pr_auc'] = average_precision_score(y_true, y_prob)
        except:
            results['roc_auc'] = None
            results['pr_auc'] = None
    else:
        results['roc_auc'] = None
        results['pr_auc'] = None
    
    return results

def plot_confusion_matrix(y_true, y_pred, title='Confusion Matrix'):
    """Plot confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
               xticklabels=['Normal', 'Anomaly'],
               yticklabels=['Normal', 'Anomaly'],
               cbar_kws={'label': 'Count'})
    plt.title(title, fontsize=14, fontweight='bold')
    plt.ylabel('True Label', fontsize=12)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.tight_layout()
    plt.show()

def plot_roc_pr_curves(y_true, y_prob, model_name='Model'):
    """Plot ROC and PR curves"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    roc_auc = roc_auc_score(y_true, y_prob)
    
    ax1.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC (AUC = {roc_auc:.4f})')
    ax1.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    ax1.set_xlabel('False Positive Rate', fontsize=12)
    ax1.set_ylabel('True Positive Rate', fontsize=12)
    ax1.set_title(f'ROC Curve - {model_name}', fontsize=14, fontweight='bold')
    ax1.legend()
    ax1.grid(alpha=0.3)
    
    # PR Curve
    precision, recall, _ = precision_recall_curve(y_true, y_prob)
    pr_auc = average_precision_score(y_true, y_prob)
    
    ax2.plot(recall, precision, color='blue', lw=2, label=f'PR (AUC = {pr_auc:.4f})')
    ax2.set_xlabel('Recall', fontsize=12)
    ax2.set_ylabel('Precision', fontsize=12)
    ax2.set_title(f'Precision-Recall Curve - {model_name}', fontsize=14, fontweight='bold')
    ax2.legend()
    ax2.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("‚úÖ Evaluation functions loaded")

In [None]:
# ============================================================================
# MAIN PIPELINE FUNCTION
# ============================================================================

def run_anomaly_detection_pipeline(X, y, 
                                   test_size=0.25,
                                   scaling_method='robust',
                                   imbalance_method='smote',
                                   random_state=42):
    """
    Complete anomaly detection pipeline
    
    Parameters:
    -----------
    X : array-like or DataFrame
        Feature matrix
    y : array-like
        Target vector (0=normal, 1=anomaly)
    test_size : float
        Proportion for test set
    scaling_method : str
        'standard', 'robust', or 'minmax'
    imbalance_method : str
        'smote', 'adasyn', 'borderline_smote', 'smote_tomek', 'smote_enn', 'undersample', 'none'
    random_state : int
        Random seed
    
    Returns:
    --------
    results : dict
        Complete results including models, predictions, and evaluations
    """
    
    print("\n" + "="*80)
    print("ROBUST ANOMALY DETECTION PIPELINE")
    print("="*80)
    
    # Step 1: Train-Test Split
    print("\n[1/6] Splitting data...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    print(f"  Training: {len(X_train)}, Test: {len(X_test)}")
    
    # Step 2: Statistical Analysis
    print("\n[2/6] Statistical analysis...")
    X_train_arr = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
    feature_names = X_train.columns.tolist() if isinstance(X_train, pd.DataFrame) else \
                   [f'feature_{i}' for i in range(X_train_arr.shape[1])]
    
    sig_df = perform_statistical_analysis(X_train_arr, y_train, feature_names)
    
    # Step 3: Preprocessing
    print("\n[3/6] Preprocessing...")
    X_train_scaled, X_test_scaled, scaler, feature_names = preprocess_data(
        X_train, X_test, method=scaling_method
    )
    
    # Step 4: Handle Imbalance
    print("\n[4/6] Handling imbalance...")
    X_train_balanced, y_train_balanced = handle_imbalance(
        X_train_scaled, y_train, method=imbalance_method, random_state=random_state
    )
    
    # Step 5: Train Models
    print("\n[5/6] Training models...")
    models = {}
    
    # Gaussian
    print("  Training Gaussian...")
    gaussian_model = fit_gaussian_model(X_train_balanced, y_train_balanced)
    epsilon, _ = select_gaussian_threshold(
        X_train_balanced, y_train_balanced, 
        gaussian_model['mu'], gaussian_model['sigma']
    )
    gaussian_model['epsilon'] = epsilon
    models['gaussian'] = gaussian_model
    
    # Isolation Forest
    print("  Training Isolation Forest...")
    models['isolation_forest'] = fit_isolation_forest(
        X_train_balanced, y_train_balanced, random_state
    )
    
    # One-Class SVM
    print("  Training One-Class SVM...")
    models['one_class_svm'] = fit_one_class_svm(X_train_balanced, y_train_balanced)
    
    # LOF
    print("  Training LOF...")
    models['lof'] = fit_lof(X_train_balanced, y_train_balanced)
    
    # Elliptic Envelope
    print("  Training Elliptic Envelope...")
    models['elliptic_envelope'] = fit_elliptic_envelope(
        X_train_balanced, y_train_balanced, random_state
    )
    
    # Step 6: Predictions and Evaluation
    print("\n[6/6] Evaluating models...")
    predictions = {}
    probabilities = {}
    results_list = []
    
    for model_name, model in models.items():
        if model_name == 'gaussian':
            preds, probs = predict_gaussian(
                X_test_scaled, model, model['epsilon']
            )
            predictions[model_name] = preds
            probabilities[model_name] = probs
            eval_results = evaluate_model(y_test, preds, probs)
        else:
            preds = model.predict(X_test_scaled)
            preds = (preds == -1).astype(int)
            predictions[model_name] = preds
            
            # Get scores
            if hasattr(model, 'score_samples'):
                scores = -model.score_samples(X_test_scaled)
                probabilities[model_name] = scores
                eval_results = evaluate_model(y_test, preds, scores)
            elif hasattr(model, 'decision_function'):
                scores = -model.decision_function(X_test_scaled)
                probabilities[model_name] = scores
                eval_results = evaluate_model(y_test, preds, scores)
            else:
                eval_results = evaluate_model(y_test, preds)
        
        eval_results['model'] = model_name
        results_list.append(eval_results)
    
    # Create comparison DataFrame
    comparison_df = pd.DataFrame(results_list)
    
    # Select best model
    best_idx = comparison_df['f1'].idxmax()
    best_model_name = comparison_df.loc[best_idx, 'model']
    
    print("\n" + "="*80)
    print("MODEL COMPARISON")
    print("="*80)
    display_cols = ['model', 'precision', 'recall', 'f1', 'balanced_accuracy', 'roc_auc']
    display_cols = [c for c in display_cols if c in comparison_df.columns]
    print(comparison_df[display_cols].to_string(index=False))
    print(f"\nüèÜ Best Model: {best_model_name.upper()} (F1: {comparison_df.loc[best_idx, 'f1']:.4f})")
    
    # Package results
    results = {
        'models': models,
        'scaler': scaler,
        'predictions': predictions,
        'probabilities': probabilities,
        'comparison': comparison_df,
        'best_model': best_model_name,
        'y_test': y_test,
        'y_pred_best': predictions[best_model_name],
        'y_prob_best': probabilities.get(best_model_name),
        'significance_analysis': sig_df,
        'feature_names': feature_names
    }
    
    print("\n‚úÖ Pipeline complete!")
    return results

print("‚úÖ Main pipeline function loaded")
print("\n" + "="*80)
print("ALL FUNCTIONS LOADED - READY TO USE!")
print("="*80)

---
## 4. Generate or Load Data

In [None]:
from sklearn.datasets import make_classification

# Generate synthetic anomaly detection dataset
print("Generating synthetic dataset...")

X, y = make_classification(
    n_samples=2000,
    n_features=25,
    n_informative=18,
    n_redundant=4,
    n_classes=2,
    weights=[0.88, 0.12],  # 12% anomalies
    flip_y=0.03,
    class_sep=0.8,
    random_state=RANDOM_STATE
)

# Create feature names
feature_names = [
    'temperature', 'humidity', 'wind_speed', 'pressure',
    'visibility', 'precipitation', 'cloud_cover', 'uv_index',
    'air_quality_pm25', 'air_quality_pm10', 'air_quality_o3',
    'noise_level', 'traffic_density', 'pedestrian_count',
    'vegetation_index', 'soil_moisture', 'solar_radiation',
    'gas_sensor_1', 'gas_sensor_2', 'gas_sensor_3',
    'thermal_sensor_1', 'thermal_sensor_2', 'motion_sensor',
    'vibration_sensor', 'light_sensor'
]

X_df = pd.DataFrame(X, columns=feature_names)

print(f"‚úÖ Dataset created!")
print(f"   Shape: {X_df.shape}")
print(f"   Normal: {np.sum(y==0)} ({np.sum(y==0)/len(y)*100:.1f}%)")
print(f"   Anomaly: {np.sum(y==1)} ({np.sum(y==1)/len(y)*100:.1f}%)")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

unique, counts = np.unique(y, return_counts=True)
axes[0].bar(['Normal', 'Anomaly'], counts, color=['#2ecc71', '#e74c3c'])
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution', fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

percentages = counts / len(y) * 100
axes[1].pie(percentages, labels=['Normal', 'Anomaly'], autopct='%1.1f%%',
           colors=['#2ecc71', '#e74c3c'], startangle=90)
axes[1].set_title('Class Distribution (%)', fontweight='bold')

plt.tight_layout()
plt.show()

---
## 5. Run the Complete Pipeline

In [None]:
# Run the complete anomaly detection pipeline
results = run_anomaly_detection_pipeline(
    X=X_df,
    y=y,
    test_size=0.25,
    scaling_method='robust',
    imbalance_method='smote',
    random_state=RANDOM_STATE
)

---
## 6. Detailed Evaluation of Best Model

In [None]:
# Get best model info
best_model = results['best_model']
y_test = results['y_test']
y_pred = results['y_pred_best']
y_prob = results['y_prob_best']

print(f"\nBest Model: {best_model.upper()}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Anomaly']))

In [None]:
# Plot confusion matrix
plot_confusion_matrix(y_test, y_pred, f'Confusion Matrix - {best_model}')

In [None]:
# Plot ROC and PR curves (if probabilities available)
if y_prob is not None:
    plot_roc_pr_curves(y_test, y_prob, best_model)
else:
    print("Probability scores not available for this model")

---
## 7. Compare All Models

In [None]:
# Visualize model comparison
comparison_df = results['comparison']
metrics = ['precision', 'recall', 'f1', 'balanced_accuracy']

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for i, metric in enumerate(metrics):
    models = comparison_df['model'].values
    values = comparison_df[metric].values
    
    colors = ['#e74c3c' if m == best_model else '#3498db' for m in models]
    
    axes[i].barh(models, values, color=colors)
    axes[i].set_xlabel(metric.title(), fontweight='bold')
    axes[i].set_title(f'{metric.title()} by Model', fontweight='bold')
    axes[i].set_xlim([0, 1])
    axes[i].grid(axis='x', alpha=0.3)
    
    for j, v in enumerate(values):
        axes[i].text(v + 0.02, j, f'{v:.3f}', va='center')

plt.tight_layout()
plt.show()

---
## 8. Feature Importance Analysis

In [None]:
# Show top significant features
sig_df = results['significance_analysis']

print("\nTop 15 Most Significant Features:")
top_15 = sig_df.nsmallest(15, 'mannwhitney_p')
print(top_15[['feature', 'cohens_d', 'effect_size', 'mannwhitney_p']].to_string(index=False))

In [None]:
# Visualize feature significance
top_20 = sig_df.nsmallest(20, 'mannwhitney_p')

fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# P-values
axes[0].barh(range(len(top_20)), -np.log10(top_20['mannwhitney_p']))
axes[0].set_yticks(range(len(top_20)))
axes[0].set_yticklabels(top_20['feature'])
axes[0].set_xlabel('-log10(p-value)', fontweight='bold')
axes[0].set_title('Feature Significance (p-values)', fontweight='bold')
axes[0].axvline(-np.log10(0.05), color='red', linestyle='--', label='Œ±=0.05')
axes[0].legend()
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# Effect sizes
axes[1].barh(range(len(top_20)), np.abs(top_20['cohens_d']))
axes[1].set_yticks(range(len(top_20)))
axes[1].set_yticklabels(top_20['feature'])
axes[1].set_xlabel("|Cohen's d|", fontweight='bold')
axes[1].set_title('Feature Effect Sizes', fontweight='bold')
axes[1].axvline(0.8, color='green', linestyle='--', label='Large')
axes[1].axvline(0.5, color='orange', linestyle='--', label='Medium')
axes[1].legend()
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

---
## 9. Save Model for Production

In [None]:
import joblib
from datetime import datetime

# Save model and metadata
model_package = {
    'model': results['models'][best_model],
    'scaler': results['scaler'],
    'model_name': best_model,
    'feature_names': results['feature_names'],
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'performance': results['comparison'][results['comparison']['model'] == best_model].iloc[0].to_dict()
}

joblib.dump(model_package, 'anomaly_detector.pkl')

print("‚úÖ Model saved to 'anomaly_detector.pkl'")
print(f"   Model: {best_model}")
print(f"   F1-Score: {model_package['performance']['f1']:.4f}")

---
## 10. Example: Making Predictions on New Data

In [None]:
# Generate some "new" data
X_new, y_new = make_classification(
    n_samples=100,
    n_features=25,
    n_informative=18,
    n_redundant=4,
    n_classes=2,
    weights=[0.88, 0.12],
    random_state=999
)

X_new_df = pd.DataFrame(X_new, columns=feature_names)

print(f"New data shape: {X_new_df.shape}")

# Scale new data
X_new_scaled = results['scaler'].transform(X_new_df)

# Make predictions
if best_model == 'gaussian':
    predictions, probs = predict_gaussian(
        X_new_scaled, 
        results['models'][best_model],
        results['models'][best_model]['epsilon']
    )
else:
    predictions = results['models'][best_model].predict(X_new_scaled)
    predictions = (predictions == -1).astype(int)

print(f"\nPredictions:")
print(f"  Normal: {np.sum(predictions == 0)}")
print(f"  Anomaly: {np.sum(predictions == 1)}")

# Display sample predictions
results_df = X_new_df.copy()
results_df['prediction'] = ['Anomaly' if p == 1 else 'Normal' for p in predictions]

print("\nSample predictions (first 10):")
results_df[['temperature', 'humidity', 'wind_speed', 'prediction']].head(10)

---
## Summary

### ‚úÖ What We Accomplished:

1. **Statistical Analysis**: Comprehensive hypothesis testing on all features
2. **Preprocessing**: Scaled data using RobustScaler
3. **Imbalance Handling**: Applied SMOTE to balance classes
4. **Model Training**: Trained 5 different anomaly detection algorithms
5. **Evaluation**: Compared models using 10+ metrics
6. **Best Model Selection**: Automatically selected best performing model
7. **Visualization**: Created professional plots and charts
8. **Production Ready**: Saved model for deployment

### üéØ Key Results:

- **Best Model**: Check output above
- **Performance**: F1-Score, Precision, Recall all calculated
- **Significant Features**: Identified via statistical testing
- **Ready for Deployment**: Model saved and can be loaded

### üìù To Use With Your Own Data:

Simply replace the data generation code with:

```python
# Load your data
X_df = pd.read_csv('your_data.csv')
y = pd.read_csv('your_labels.csv').values.ravel()

# Run the pipeline
results = run_anomaly_detection_pipeline(X_df, y)
```

**That's it! The system handles everything else automatically!** üöÄ