# Complete Model Comparison on CICIDS2017 Dataset

This notebook trains and compares **Random Forest**, **KNN**, and **Decision Tree** classifiers on the CICIDS2017 intrusion detection dataset.

## Models Tested:
1. **Random Forest** - Ensemble of decision trees
2. **K-Nearest Neighbors (KNN)** - Instance-based learning
3. **Decision Tree** - Single tree classifier

## Features:
- Unified preprocessing pipeline
- SMOTE for class balancing
- Cross-validation for all models
- Comprehensive performance comparison
- Feature importance analysis
- Visual comparison of results

## 1. Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..')))

# Import dataset
from CICIDS2017.preprocessing.dataset import CICIDS2017

# Import shared utilities
from scripts.models.model_utils import (
    prepare_data,
    evaluate_model,
    check_data_leakage,
    get_feature_importance,
    balance_classes_info,
    remove_rare_classes,
    print_performance_summary
)

# Import model-specific modules
from scripts.models.random_forest import create_rf_pipeline, train_random_forest
from scripts.models.knn import create_knn_pipeline, train_knn, find_optimal_k
from scripts.models.decision_tree import (
    create_dt_pipeline, 
    train_decision_tree,
    analyze_tree_complexity
)

# Import sklearn utilities
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler

# Import logger
from scripts.logger import LoggerManager

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("‚úì All imports successful")

## 2. Configuration

In [None]:
# Configuration
CONFIG = {
    'sample_size': 100000,  # Adjust based on your resources
    'test_size': 0.25,
    'cv_folds': 5,
    'random_state': 0,
    'use_smote': True,
    'variance_threshold': 0.01,
    'leakage_features': ['Attack Number']  # Known leakage features
}

# Model configurations
MODEL_CONFIGS = {
    'random_forest': {
        'n_estimators': 10,
        'max_depth': 3,
        'min_samples_split': 5,
        'min_samples_leaf': 2,
        'max_features': 'sqrt',
        'class_weight': 'balanced'
    },
    'knn': {
        'n_neighbors': 5,  # Will be optimized
        'weights': 'distance',
        'metric': 'minkowski',
        'p': 2
    },
    'decision_tree': {
        'max_depth': 3,
        'min_samples_split': 10,
        'min_samples_leaf': 5,
        'criterion': 'gini',
        'class_weight': 'balanced'
    }
}

print("Configuration:")
print(f"  Sample size: {CONFIG['sample_size']:,}")
print(f"  Test size: {CONFIG['test_size']}")
print(f"  CV folds: {CONFIG['cv_folds']}")
print(f"  SMOTE: {CONFIG['use_smote']}")

## 3. Initialize Logger

In [None]:
logger = LoggerManager(log_name="model_comparison").get_logger()
logger.info("Starting complete model comparison notebook")

## 4. Load and Prepare Data

In [None]:
# Load dataset
logger.info("Loading CICIDS2017 dataset...")
dataset = CICIDS2017(logger=logger)
dataset.encode().optimize_memory()
data = dataset.data

print(f"Original dataset shape: {data.shape}")
print(f"Columns: {len(data.columns)}")
print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# Sample data
logger.info(f"Sampling {CONFIG['sample_size']} rows...")
data_sample = data.sample(n=min(CONFIG['sample_size'], len(data)), 
                          random_state=CONFIG['random_state'])

print(f"Sampled data shape: {data_sample.shape}")

In [None]:
# Prepare data using shared utility
X, y, removed_features = prepare_data(
    data_sample,
    target_column='Attack Type',
    leakage_features=CONFIG['leakage_features'],
    remove_low_var=True,
    var_threshold=CONFIG['variance_threshold'],
    logger=logger
)

print(f"\nRemoved features:")
if removed_features is not None:
    leakage = removed_features.get('leakage', [])
    low_var = removed_features.get('low_variance', [])
    print(f"  Leakage: {len(leakage)}")
    if leakage:
        print(f"    Names: {list(leakage)}")
    print(f"  Low variance: {len(low_var)}")
    if low_var:
        print(f"    Names: {list(low_var)}")

## 5. Data Analysis

In [None]:
# Check class balance
balance_info = balance_classes_info(y, logger=logger)

# Plot class distribution
plt.figure(figsize=(14, 6))
y.value_counts().plot(kind='bar', color='steelblue', edgecolor='black')
plt.title('Class Distribution Before SMOTE', fontsize=14, fontweight='bold')
plt.xlabel('Attack Type', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Data leakage diagnostics
diagnostics = check_data_leakage(X, y, logger=logger)

## 6. Train/Test Split

In [None]:
# Remove rare classes for stratified split
X, y, removed_classes = remove_rare_classes(X, y, min_samples=2, logger=logger)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=CONFIG['test_size'],
    random_state=CONFIG['random_state'],
    stratify=y
)

print(f"\nTrain set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: {X_train.shape[1]}")

## 7. Model Training and Evaluation

We'll train all three models and collect their results.

In [None]:
# Storage for results
results = {}
training_times = {}
prediction_times = {}

### 7.1 Random Forest

In [None]:
print("=" * 70)
print("TRAINING RANDOM FOREST")
print("=" * 70)

# Create pipeline
rf_pipeline = create_rf_pipeline(
    **MODEL_CONFIGS['random_forest'],
    random_state=CONFIG['random_state'],
    use_smote=CONFIG['use_smote'],
    use_scaler=True
)

# Cross-validation
logger.info("Random Forest: Cross-validation...")
start_time = time()
rf_cv_scores = cross_val_score(rf_pipeline, X_train, y_train, 
                                cv=CONFIG['cv_folds'], n_jobs=-1)
cv_time = time() - start_time

print(f"CV Time: {cv_time:.2f}s")
print(f"CV Scores: {rf_cv_scores}")
print(f"Mean CV: {rf_cv_scores.mean():.4f} (+/- {rf_cv_scores.std():.4f})")

# Train final model
logger.info("Random Forest: Training final model...")
start_time = time()
rf_pipeline.fit(X_train, y_train)
training_times['Random Forest'] = time() - start_time

# Evaluate
start_time = time()
rf_results = evaluate_model(rf_pipeline, X_test, y_test, logger=logger)
prediction_times['Random Forest'] = time() - start_time

results['Random Forest'] = {
    'cv_scores': rf_cv_scores,
    'test_accuracy': rf_results['accuracy'],
    'report': rf_results['report'],
    'confusion_matrix': rf_results['confusion_matrix'],
    'model': rf_pipeline
}

print(f"\n‚úì Random Forest completed")
print(f"  Training time: {training_times['Random Forest']:.2f}s")
print(f"  Test accuracy: {rf_results['accuracy']:.4f}")

### 7.2 K-Nearest Neighbors
Why we use pipelines instead of the `train_knn` function

In this notebook, we use scikit-learn pipelines for all models, including KNN, instead of the standalone `train_knn` function. Pipelines allow us to chain preprocessing steps (such as scaling and SMOTE) together with the model, ensuring that all transformations are applied consistently and only to the training data during cross-validation. This prevents data leakage and makes the workflow more robust and reproducible. The `train_knn` function does not integrate preprocessing or handle cross-validation in the same way, so using pipelines is considered best practice for reliable model evaluation.

In [None]:
print("\n" + "=" * 70)
print("TRAINING K-NEAREST NEIGHBORS")
print("=" * 70)

# Scale data for k-finding
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Find optimal k
logger.info("KNN: Finding optimal k...")
k_results = find_optimal_k(
    X_train_scaled, 
    y_train, 
    k_range=range(3, 16, 2),
    cv=CONFIG['cv_folds'],
    logger=logger
)

optimal_k = k_results['optimal_k']
print(f"\nOptimal k: {optimal_k}")

# Create pipeline with optimal k
knn_pipeline = create_knn_pipeline(
    n_neighbors=optimal_k,
    **{k: v for k, v in MODEL_CONFIGS['knn'].items() if k != 'n_neighbors'},
    random_state=CONFIG['random_state'],
    use_smote=CONFIG['use_smote'],
    use_scaler=True  # CRITICAL for KNN
)

# Cross-validation
logger.info("KNN: Cross-validation...")
start_time = time()
knn_cv_scores = cross_val_score(knn_pipeline, X_train, y_train, 
                                 cv=CONFIG['cv_folds'], n_jobs=-1)
cv_time = time() - start_time

print(f"CV Time: {cv_time:.2f}s")
print(f"CV Scores: {knn_cv_scores}")
print(f"Mean CV: {knn_cv_scores.mean():.4f} (+/- {knn_cv_scores.std():.4f})")

# Train final model
logger.info("KNN: Training final model...")
start_time = time()
knn_pipeline.fit(X_train, y_train)
training_times['KNN'] = time() - start_time

# Evaluate
start_time = time()
knn_results = evaluate_model(knn_pipeline, X_test, y_test, logger=logger)
prediction_times['KNN'] = time() - start_time

results['KNN'] = {
    'cv_scores': knn_cv_scores,
    'test_accuracy': knn_results['accuracy'],
    'report': knn_results['report'],
    'confusion_matrix': knn_results['confusion_matrix'],
    'model': knn_pipeline,
    'optimal_k': optimal_k
}

print(f"\n‚úì KNN completed")
print(f"  Training time: {training_times['KNN']:.2f}s")
print(f"  Test accuracy: {knn_results['accuracy']:.4f}")

### 7.3 Decision Tree

In [None]:
print("\n" + "=" * 70)
print("TRAINING DECISION TREE")
print("=" * 70)

# Create pipeline
dt_pipeline = create_dt_pipeline(
    **MODEL_CONFIGS['decision_tree'],
    random_state=CONFIG['random_state'],
    use_smote=CONFIG['use_smote'],
    use_scaler=False  # Not needed for DT
)

# Cross-validation
logger.info("Decision Tree: Cross-validation...")
start_time = time()
dt_cv_scores = cross_val_score(dt_pipeline, X_train, y_train, 
                                cv=CONFIG['cv_folds'], n_jobs=-1)
cv_time = time() - start_time

print(f"CV Time: {cv_time:.2f}s")
print(f"CV Scores: {dt_cv_scores}")
print(f"Mean CV: {dt_cv_scores.mean():.4f} (+/- {dt_cv_scores.std():.4f})")

# Train final model
logger.info("Decision Tree: Training final model...")
start_time = time()
dt_pipeline.fit(X_train, y_train)
training_times['Decision Tree'] = time() - start_time

# Analyze tree complexity
tree_complexity = analyze_tree_complexity(dt_pipeline, logger=logger)

# Evaluate
start_time = time()
dt_results = evaluate_model(dt_pipeline, X_test, y_test, logger=logger)
prediction_times['Decision Tree'] = time() - start_time

results['Decision Tree'] = {
    'cv_scores': dt_cv_scores,
    'test_accuracy': dt_results['accuracy'],
    'report': dt_results['report'],
    'confusion_matrix': dt_results['confusion_matrix'],
    'model': dt_pipeline,
    'tree_complexity': tree_complexity
}

print(f"\n‚úì Decision Tree completed")
print(f"  Training time: {training_times['Decision Tree']:.2f}s")
print(f"  Test accuracy: {dt_results['accuracy']:.4f}")
print(f"  Tree nodes: {tree_complexity['n_nodes']}")
print(f"  Tree depth: {tree_complexity['max_depth']}")

## 8. Results Comparison

In [None]:
# Create comparison DataFrame
comparison_data = []
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    comparison_data.append({
        'Model': model_name,
        'Mean CV Score': results[model_name]['cv_scores'].mean(),
        'CV Std': results[model_name]['cv_scores'].std(),
        'Test Accuracy': results[model_name]['test_accuracy'],
        'Training Time (s)': training_times[model_name],
        'Prediction Time (s)': prediction_times[model_name]
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df['CV-Test Gap'] = abs(comparison_df['Mean CV Score'] - comparison_df['Test Accuracy'])

print("\n" + "=" * 80)
print("MODEL COMPARISON SUMMARY")
print("=" * 80)
print(comparison_df.to_string(index=False))

# Find best model
best_model = comparison_df.loc[comparison_df['Test Accuracy'].idxmax(), 'Model']
print(f"\nüèÜ Best Model: {best_model}")
print(f"   Test Accuracy: {comparison_df.loc[comparison_df['Test Accuracy'].idxmax(), 'Test Accuracy']:.4f}")

## 9. Visual Comparison

In [None]:
# Create comprehensive comparison plots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Accuracy Comparison
ax1 = axes[0, 0]
x_pos = np.arange(len(comparison_df))
ax1.bar(x_pos - 0.2, comparison_df['Mean CV Score'], 0.4, 
        label='CV Score', color='steelblue', alpha=0.8)
ax1.bar(x_pos + 0.2, comparison_df['Test Accuracy'], 0.4, 
        label='Test Accuracy', color='coral', alpha=0.8)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(comparison_df['Model'])
ax1.set_ylabel('Accuracy', fontsize=11)
ax1.set_title('Model Accuracy Comparison', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim([0.8, 1.0])

# 2. Training Time Comparison
ax2 = axes[0, 1]
ax2.bar(comparison_df['Model'], comparison_df['Training Time (s)'], 
        color='lightgreen', edgecolor='black')
ax2.set_ylabel('Time (seconds)', fontsize=11)
ax2.set_title('Training Time Comparison', fontsize=13, fontweight='bold')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(axis='y', alpha=0.3)

# 3. CV Score Distribution
ax3 = axes[1, 0]
cv_data = [results[model]['cv_scores'] for model in ['Random Forest', 'KNN', 'Decision Tree']]
ax3.boxplot(cv_data, labels=['RF', 'KNN', 'DT'])
ax3.set_ylabel('CV Accuracy', fontsize=11)
ax3.set_title('Cross-Validation Score Distribution', fontsize=13, fontweight='bold')
ax3.grid(axis='y', alpha=0.3)

# 4. CV-Test Gap
ax4 = axes[1, 1]
colors = ['green' if gap < 0.05 else 'orange' for gap in comparison_df['CV-Test Gap']]
ax4.bar(comparison_df['Model'], comparison_df['CV-Test Gap'], color=colors, edgecolor='black')
ax4.axhline(y=0.05, color='red', linestyle='--', label='Threshold (0.05)')
ax4.set_ylabel('Gap', fontsize=11)
ax4.set_title('CV-Test Gap (Overfitting Check)', fontsize=13, fontweight='bold')
ax4.tick_params(axis='x', rotation=45)
ax4.legend()
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 10. Confusion Matrices

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, model_name in enumerate(['Random Forest', 'KNN', 'Decision Tree']):
    cm = results[model_name]['confusion_matrix']
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, 
                                   display_labels=results[model_name]['model'].classes_)
    disp.plot(ax=axes[idx], cmap='Blues', xticks_rotation=45)
    axes[idx].set_title(f'{model_name}\n(Acc: {results[model_name]["test_accuracy"]:.4f})', 
                       fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 11. Feature Importance (Tree-based Models)
KNN does not provide built-in feature importances because it makes predictions based on distances in feature space

In [None]:
# Compare feature importance for tree-based models
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Random Forest Feature Importance
rf_features = get_feature_importance(
    results['Random Forest']['model'],
    feature_names=list(X.columns),
    top_n=15
)
features, importances = zip(*rf_features)
axes[0].barh(range(len(features)), importances, color='steelblue')
axes[0].set_yticks(range(len(features)))
axes[0].set_yticklabels(features)
axes[0].set_xlabel('Importance', fontsize=11)
axes[0].set_title('Random Forest - Top 15 Features', fontsize=13, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# Decision Tree Feature Importance
dt_features = get_feature_importance(
    results['Decision Tree']['model'],
    feature_names=list(X.columns),
    top_n=15
)
features, importances = zip(*dt_features)
axes[1].barh(range(len(features)), importances, color='lightgreen')
axes[1].set_yticks(range(len(features)))
axes[1].set_yticklabels(features)
axes[1].set_xlabel('Importance', fontsize=11)
axes[1].set_title('Decision Tree - Top 15 Features', fontsize=13, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNote: KNN does not provide feature importance.")

## 12. Performance Summaries

In [None]:
# Print detailed summaries for each model
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    print("\n" + "=" * 70)
    print(f"{model_name.upper()} DETAILED SUMMARY")
    print("=" * 70)
    
    model_results = results[model_name]
    
    print(f"\nCross-Validation:")
    print(f"  Scores: {model_results['cv_scores']}")
    print(f"  Mean: {model_results['cv_scores'].mean():.4f}")
    print(f"  Std: {model_results['cv_scores'].std():.4f}")
    
    print(f"\nTest Set:")
    print(f"  Accuracy: {model_results['test_accuracy']:.4f}")
    print(f"  CV-Test Gap: {abs(model_results['cv_scores'].mean() - model_results['test_accuracy']):.4f}")
    
    print(f"\nTiming:")
    print(f"  Training: {training_times[model_name]:.2f}s")
    print(f"  Prediction: {prediction_times[model_name]:.2f}s")
    print(f"  Time per sample: {prediction_times[model_name]/len(X_test)*1000:.3f}ms")
    
    # Model-specific info
    if model_name == 'KNN':
        print(f"\nKNN Specific:")
        print(f"  Optimal k: {model_results['optimal_k']}")
        print(f"  Memory usage: ~{X_train.memory_usage(deep=True).sum() / 1024**2:.2f} MB (stores all training data)")
    elif model_name == 'Decision Tree':
        print(f"\nTree Complexity:")
        print(f"  Total nodes: {model_results['tree_complexity']['n_nodes']}")
        print(f"  Leaf nodes: {model_results['tree_complexity']['n_leaves']}")
        print(f"  Max depth: {model_results['tree_complexity']['max_depth']}")
        print(f"  Features used: {model_results['tree_complexity']['n_features_used']}/{X.shape[1]}")
    elif model_name == 'Random Forest':
        print(f"\nRandom Forest Specific:")
        print(f"  Number of trees: {MODEL_CONFIGS['random_forest']['n_estimators']}")
        print(f"  Max depth per tree: {MODEL_CONFIGS['random_forest']['max_depth']}")

## 13. Final Recommendations

In [None]:
print("\n" + "=" * 70)
print("FINAL RECOMMENDATIONS")
print("=" * 70)

# Determine best model by accuracy
best_by_accuracy = comparison_df.loc[comparison_df['Test Accuracy'].idxmax()]
best_by_speed = comparison_df.loc[comparison_df['Prediction Time (s)'].idxmin()]
best_by_training = comparison_df.loc[comparison_df['Training Time (s)'].idxmin()]

print(f"\nüèÜ Best Overall Accuracy:")
print(f"   Model: {best_by_accuracy['Model']}")
print(f"   Accuracy: {best_by_accuracy['Test Accuracy']:.4f}")

print(f"\n‚ö° Fastest Prediction:")
print(f"   Model: {best_by_speed['Model']}")
print(f"   Time: {best_by_speed['Prediction Time (s)']:.2f}s")

print(f"\nüöÄ Fastest Training:")
print(f"   Model: {best_by_training['Model']}")
print(f"   Time: {best_by_training['Training Time (s)']:.2f}s")

print("\nüìä Model Characteristics:")
print("\n  Random Forest:")
print("    + Best accuracy (usually)")
print("    + Robust to overfitting")
print("    + Provides feature importance")
print("    - Slower training and prediction")
print("    - Less interpretable than single tree")

print("\n  K-Nearest Neighbors:")
print("    + Simple and intuitive")
print("    + No training time (lazy learner)")
print("    + Good for non-linear boundaries")
print("    - Slow prediction (stores all data)")
print("    - Requires feature scaling")
print("    - No feature importance")

print("\n  Decision Tree:")
print("    + Very interpretable")
print("    + Fast training and prediction")
print("    + No feature scaling needed")
print("    + Provides decision rules")
print("    - Prone to overfitting")
print("    - Less accurate than RF")

print("\nüí° Use Case Recommendations:")
print("  - Production system (accuracy priority): Random Forest")
print("  - Real-time detection (speed priority): Decision Tree")
print("  - Explainability needed: Decision Tree")
print("  - Research/prototyping: Try all and compare")

# Check for overfitting
print("\n‚ö†Ô∏è  Overfitting Check:")
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    gap = comparison_df[comparison_df['Model'] == model_name]['CV-Test Gap'].values[0]
    if gap > 0.05:
        print(f"  {model_name}: Large gap ({gap:.4f}) - may be overfitting")
    else:
        print(f"  {model_name}: Good generalization (gap: {gap:.4f})")

logger.info("Model comparison completed successfully!")

## 14. Export Results (Optional)

In [None]:
# Export comparison results to CSV
output_dir = 'results'
os.makedirs(output_dir, exist_ok=True)

# Save comparison DataFrame
comparison_df.to_csv(f'{output_dir}/model_comparison.csv', index=False)
print(f"‚úì Results saved to {output_dir}/model_comparison.csv")

# Save detailed results for each model
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    report = results[model_name]['report']
    filename = model_name.lower().replace(' ', '_')
    with open(f'{output_dir}/{filename}_report.txt', 'w') as f:
        f.write(f"{model_name} Classification Report\n")
        f.write("=" * 50 + "\n\n")
        f.write(report)
    print(f"‚úì {model_name} report saved")

print("\n‚úì All results exported successfully!")

## 15. Model-Specific Analysis

In [None]:
# Additional analysis for specific models
print("=" * 70)
print("MODEL-SPECIFIC INSIGHTS")
print("=" * 70)

# Random Forest: Feature importance distribution
print("\nüìä Random Forest Feature Importance Distribution:")
rf_model = results['Random Forest']['model'].named_steps['rf']
importances = rf_model.feature_importances_
print(f"  Features with >1% importance: {np.sum(importances > 0.01)}")
print(f"  Top 10 features contribute: {np.sum(sorted(importances, reverse=True)[:10]):.1%}")
print(f"  Mean importance: {np.mean(importances):.4f}")

# KNN: Distance analysis
print("\nüìè KNN Analysis:")
print(f"  Optimal k: {results['KNN']['optimal_k']}")
print(f"  Training samples stored: {len(X_train):,}")
print(f"  Features per sample: {X_train.shape[1]}")
print(f"  Avg prediction time per sample: {prediction_times['KNN']/len(X_test)*1000:.2f}ms")

# Decision Tree: Depth analysis
print("\nüå≥ Decision Tree Structure:")
dt_complexity = results['Decision Tree']['tree_complexity']
print(f"  Depth ratio: {dt_complexity['max_depth']}/{MODEL_CONFIGS['decision_tree']['max_depth']} (actual/max)")
print(f"  Node efficiency: {dt_complexity['n_leaves']/dt_complexity['n_nodes']:.1%} leaves")
print(f"  Feature usage: {dt_complexity['n_features_used']}/{X.shape[1]} features used")

if dt_complexity['max_depth'] >= MODEL_CONFIGS['decision_tree']['max_depth']:
    print("  ‚ö†Ô∏è Tree reached max_depth - consider increasing or pruning")
else:
    print("  ‚úì Tree stopped before max_depth - good regularization")

## 16. Prediction Speed Analysis

In [None]:
# Detailed timing analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training time
models = list(training_times.keys())
train_times = list(training_times.values())
colors = ['steelblue', 'coral', 'lightgreen']

axes[0].barh(models, train_times, color=colors, edgecolor='black')
axes[0].set_xlabel('Time (seconds)', fontsize=11)
axes[0].set_title('Training Time Comparison', fontsize=13, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)
for i, v in enumerate(train_times):
    axes[0].text(v + 0.5, i, f'{v:.2f}s', va='center')

# Prediction time per sample
pred_per_sample = [prediction_times[m]/len(X_test)*1000 for m in models]
axes[1].barh(models, pred_per_sample, color=colors, edgecolor='black')
axes[1].set_xlabel('Time per sample (ms)', fontsize=11)
axes[1].set_title('Prediction Speed (per sample)', fontsize=13, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)
for i, v in enumerate(pred_per_sample):
    axes[1].text(v + 0.01, i, f'{v:.2f}ms', va='center')

plt.tight_layout()
plt.show()

print("\nThroughput Analysis (samples per second):")
for model_name in models:
    throughput = len(X_test) / prediction_times[model_name]
    print(f"  {model_name}: {throughput:.0f} samples/sec")

## 17. Cross-Validation Stability

In [None]:
# Plot CV score stability
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, model_name in enumerate(['Random Forest', 'KNN', 'Decision Tree']):
    cv_scores = results[model_name]['cv_scores']
    folds = range(1, len(cv_scores) + 1)
    
    axes[idx].plot(folds, cv_scores, marker='o', markersize=8, 
                   linewidth=2, color=colors[idx], label='CV Scores')
    axes[idx].axhline(y=cv_scores.mean(), color='red', linestyle='--', 
                      label=f'Mean: {cv_scores.mean():.4f}')
    axes[idx].axhline(y=results[model_name]['test_accuracy'], 
                      color='green', linestyle=':', 
                      label=f'Test: {results[model_name]["test_accuracy"]:.4f}')
    axes[idx].set_xlabel('Fold', fontsize=11)
    axes[idx].set_ylabel('Accuracy', fontsize=11)
    axes[idx].set_title(f'{model_name}', fontsize=12, fontweight='bold')
    axes[idx].legend(loc='lower right', fontsize=9)
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim([cv_scores.min() - 0.01, cv_scores.max() + 0.01])

plt.suptitle('Cross-Validation Stability Analysis', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nCV Stability (lower std = more stable):")
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    std = results[model_name]['cv_scores'].std()
    stability = "High" if std < 0.01 else "Medium" if std < 0.02 else "Low"
    print(f"  {model_name}: std={std:.4f} ({stability})")

## 18. Per-Class Performance Analysis

In [None]:
# Analyze performance per class
from sklearn.metrics import classification_report
import re

print("=" * 70)
print("PER-CLASS PERFORMANCE COMPARISON")
print("=" * 70)

# Parse classification reports to compare per-class performance
for model_name in ['Random Forest', 'KNN', 'Decision Tree']:
    print(f"\n{model_name}:")
    print(results[model_name]['report'])
    print("-" * 70)

## 19. Memory and Scalability Analysis

In [None]:
import sys

print("=" * 70)
print("MEMORY AND SCALABILITY ANALYSIS")
print("=" * 70)

# Estimate model sizes
print("\nüíæ Model Memory Usage:")

# Random Forest
rf_model = results['Random Forest']['model']
rf_size = sys.getsizeof(rf_model) / 1024**2
print(f"  Random Forest: ~{rf_size:.2f} MB")
print(f"    ({MODEL_CONFIGS['random_forest']['n_estimators']} trees √ó complexity)")

# KNN (stores all training data)
knn_size = X_train.memory_usage(deep=True).sum() / 1024**2
print(f"  KNN: ~{knn_size:.2f} MB")
print(f"    (Stores all {len(X_train):,} training samples)")

# Decision Tree
dt_model = results['Decision Tree']['model']
dt_size = sys.getsizeof(dt_model) / 1024**2
print(f"  Decision Tree: ~{dt_size:.2f} MB")
print(f"    ({dt_complexity['n_nodes']} nodes)")

print("\nüìà Scalability to Larger Datasets:")
print("  Random Forest:")
print("    ‚úì Scales well (parallel trees)")
print("    ‚úì Can handle millions of samples")
print("    ~ Training time: O(n √ó log(n) √ó trees √ó features)")

print("\n  KNN:")
print("    ‚ö†Ô∏è Poor scalability")
print("    ‚ö†Ô∏è Prediction time grows with dataset size O(n)")
print("    ‚ö†Ô∏è Memory usage grows linearly with samples")
print("    üí° Consider approximate KNN (FAISS) for >1M samples")

print("\n  Decision Tree:")
print("    ‚úì Good scalability for training")
print("    ‚úì Fast prediction O(log(n))")
print("    ‚ö†Ô∏è May overfit on large datasets without pruning")

## 20. Production Deployment Recommendations

In [None]:
print("=" * 70)
print("PRODUCTION DEPLOYMENT RECOMMENDATIONS")
print("=" * 70)

# Determine best model for different scenarios
best_accuracy = comparison_df.loc[comparison_df['Test Accuracy'].idxmax(), 'Model']
fastest_pred = comparison_df.loc[comparison_df['Prediction Time (s)'].idxmin(), 'Model']
fastest_train = comparison_df.loc[comparison_df['Training Time (s)'].idxmin(), 'Model']
most_stable = comparison_df.loc[comparison_df['CV Std'].idxmin(), 'Model']

print("\nüéØ Scenario-Based Recommendations:\n")

print("1Ô∏è‚É£ High-Throughput Real-Time System (e.g., IDS)")
print(f"   Recommended: {fastest_pred}")
print(f"   Reason: Fastest prediction ({comparison_df[comparison_df['Model']==fastest_pred]['Prediction Time (s)'].values[0]:.2f}s for {len(X_test):,} samples)")
print(f"   Accuracy: {comparison_df[comparison_df['Model']==fastest_pred]['Test Accuracy'].values[0]:.4f}")
print("   Deployment: Save model, load at startup, minimal latency")

print("\n2Ô∏è‚É£ Batch Processing / Offline Analysis")
print(f"   Recommended: {best_accuracy}")
print(f"   Reason: Best accuracy ({comparison_df[comparison_df['Model']==best_accuracy]['Test Accuracy'].values[0]:.4f})")
print("   Deployment: Can afford longer prediction time for better results")

print("\n3Ô∏è‚É£ Frequent Model Retraining")
print(f"   Recommended: {fastest_train}")
print(f"   Reason: Fastest training ({comparison_df[comparison_df['Model']==fastest_train]['Training Time (s)'].values[0]:.2f}s)")
print("   Use Case: Models retrained hourly/daily with new data")

print("\n4Ô∏è‚É£ Explainable AI / Regulatory Compliance")
print("   Recommended: Decision Tree")
print("   Reason: Fully interpretable decision rules")
print("   Use Case: Need to explain why alerts were triggered")

print("\n5Ô∏è‚É£ Mobile/Edge Deployment")
print("   Recommended: Decision Tree")
print(f"   Reason: Smallest model size (~{dt_size:.2f} MB)")
print("   Use Case: Embedded systems, IoT devices")

print("\nüí° General Production Checklist:")
print("   ‚úÖ Serialize model: Use joblib or pickle")
print("   ‚úÖ Version control: Track model versions with metrics")
print("   ‚úÖ Input validation: Check feature ranges, handle missing values")
print("   ‚úÖ Monitoring: Log predictions, accuracy, latency")
print("   ‚úÖ A/B testing: Compare new models against baseline")
print("   ‚úÖ Fallback: Have backup model if primary fails")
print("   ‚úÖ Update strategy: Plan for model retraining schedule")

## 21. Save Best Model

In [None]:
import joblib
from datetime import datetime

# Save the best model
best_model_name = comparison_df.loc[comparison_df['Test Accuracy'].idxmax(), 'Model']
best_model = results[best_model_name]['model']
best_accuracy_val = results[best_model_name]['test_accuracy']

## 22. Final Summary and Next Steps

In [None]:
print("\n" + "="*70)
print("üéâ ANALYSIS COMPLETE!")
print("="*70)

print("\nüìä Results Summary:")
print(comparison_df.to_string(index=False))

print(f"\nüèÜ Winner: {best_model_name}")
print(f"   Test Accuracy: {best_accuracy_val:.4f}")
print(f"   CV Score: {results[best_model_name]['cv_scores'].mean():.4f}")


print("\nüöÄ Next Steps:")
print("   1. Hyperparameter tuning for best model")
print("   2. Feature engineering to improve performance")
print("   3. Try ensemble methods (VotingClassifier)")
print("   4. Deploy to production environment")
print("   5. Set up monitoring and retraining pipeline")

print("\nüí° Improvement Ideas:")
if best_accuracy_val < 0.95:
    print("   - Accuracy <0.95: Try XGBoost or Neural Networks")
    print("   - Increase training data size")
    print("   - Perform feature engineering")
elif best_accuracy_val > 0.99:
    print("   - Accuracy >0.99: Double-check for data leakage!")
    print("   - Verify SMOTE is applied correctly")
else:
    print("   - Good accuracy achieved!")
    print("   - Focus on deployment and monitoring")

gap = abs(results[best_model_name]['cv_scores'].mean() - best_accuracy_val)
if gap > 0.05:
    print(f"\n   ‚ö†Ô∏è Large CV-Test gap ({gap:.4f}): Model may be overfitting")
    print("   - Increase regularization")
    print("   - Reduce model complexity")
    print("   - Get more training data")

print("\n" + "="*70)
print("Thank you for using this model comparison framework!")
print("="*70)

logger.info("Model comparison completed successfully!")

## Summary

This notebook provides a comprehensive comparison of three machine learning models:
- **Random Forest**: Ensemble method, best accuracy
- **K-Nearest Neighbors**: Instance-based, simple but slow
- **Decision Tree**: Interpretable, fast, but prone to overfitting

### Key Takeaways:
1. All models use SMOTE within CV pipeline to prevent data leakage
2. Feature scaling is critical for KNN but not for tree-based models
3. Random Forest usually provides the best accuracy
4. Decision Tree is fastest for both training and prediction
5. Check CV-Test gap to detect overfitting

### Next Steps:
- Try hyperparameter tuning with GridSearchCV
- Experiment with other models (XGBoost, Neural Networks)
- Perform feature engineering for better performance
- Deploy the best model in a production environment