# 02 - Machine Learning Baselines

This notebook demonstrates how to train and evaluate machine learning models for network anomaly detection.

## Overview
- Run data preprocessing pipeline
- Train baseline ML models
- Evaluate model performance
- Visualize results with ROC curves and confusion matrices

## Models
We'll train and compare multiple ML algorithms including Random Forest, SVM, and XGBoost.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import subprocess
import joblib
import warnings
warnings.filterwarnings('ignore')

# Add src to path for imports
sys.path.append('../src')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")


## 1. Data Preprocessing

First, let's run the preprocessing pipeline to prepare our data for machine learning.


In [None]:
# Check if processed data exists
processed_data_path = Path("../data/processed/processed.csv")

if not processed_data_path.exists():
    print("Processed data not found. Running preprocessing...")
    
    # Run preprocessing script
    try:
        result = subprocess.run([
            "python", "../src/preprocess.py",
            "--input", "../data/raw/sample.csv",
            "--output", "../data/processed/processed.csv"
        ], capture_output=True, text=True, check=True)
        
        print("✅ Preprocessing completed successfully!")
        print("Output:", result.stdout)
        
    except subprocess.CalledProcessError as e:
        print("❌ Preprocessing failed:")
        print("Error:", e.stderr)
        print("Please run preprocessing manually:")
        print("python ../src/preprocess.py --input ../data/raw/sample.csv --output ../data/processed/processed.csv")
else:
    print("✅ Processed data already exists!")


In [None]:
# Load processed data
df = pd.read_csv(processed_data_path)
print(f"Loaded processed data: {df.shape}")

# Display basic info
print(f"Features: {df.shape[1] - 1}")  # Excluding label column
print(f"Label distribution:")
print(df['label'].value_counts())


## 2. Train Machine Learning Models

Now let's train our baseline ML models using the training script.


In [None]:
# Train ML models using the training script
print("Training ML models...")

try:
    result = subprocess.run([
        "python", "../src/train_ml.py",
        "--data", "../data/processed/processed.csv",
        "--out-model", "../models/ml_best.pkl",
        "--mode", "quick"
    ], capture_output=True, text=True, check=True)
    
    print("✅ ML training completed successfully!")
    print("Training output:")
    print(result.stdout)
    
except subprocess.CalledProcessError as e:
    print("❌ ML training failed:")
    print("Error:", e.stderr)
    print("Please run training manually:")
    print("python ../src/train_ml.py --data ../data/processed/processed.csv --out-model ../models/ml_best.pkl")


## 3. Load and Evaluate Trained Model

Let's load the trained model and evaluate its performance on a test split.


In [None]:
# Load the trained model
model_path = Path("../models/ml_best.pkl")

if model_path.exists():
    model = joblib.load(model_path)
    print(f"✅ Model loaded successfully!")
    print(f"Model type: {type(model).__name__}")
    
    # Load metadata if available
    metadata_path = Path("../models/ml_best_metadata.joblib")
    if metadata_path.exists():
        metadata = joblib.load(metadata_path)
        print(f"Model metadata: {metadata}")
else:
    print("❌ Model file not found. Please train the model first.")


In [None]:
# Prepare data for evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Separate features and target
X = df.drop(columns=['label'])
y = df['label']

# Split data (same as training)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Test set class distribution:")
print(y_test.value_counts())


In [None]:
# Load preprocessing components
scaler_path = Path("../models/scaler.joblib")
label_encoder_path = Path("../models/label_encoder_label.joblib")

scaler = None
label_encoder = None

if scaler_path.exists():
    scaler = joblib.load(scaler_path)
    print("✅ Scaler loaded")
    
if label_encoder_path.exists():
    label_encoder = joblib.load(label_encoder_path)
    print("✅ Label encoder loaded")

# Apply preprocessing to test data
if scaler is not None:
    X_test_scaled = scaler.transform(X_test)
else:
    X_test_scaled = X_test

print(f"Test data shape after scaling: {X_test_scaled.shape}")


In [None]:
# Make predictions
if model_path.exists():
    # Predictions
    y_pred = model.predict(X_test_scaled)
    
    # Prediction probabilities (if available)
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test_scaled)
        y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of positive class
    else:
        y_pred_proba_positive = y_pred  # Use predictions as probabilities
    
    print("✅ Predictions made successfully!")
    print(f"Prediction shape: {y_pred.shape}")
else:
    print("❌ Cannot make predictions - model not loaded")


## 4. Model Evaluation and Visualization

Let's evaluate the model performance and create visualizations.


In [None]:
# Calculate evaluation metrics
if model_path.exists():
    # Classification report
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"\nConfusion Matrix:")
    print(cm)
    
    # ROC AUC score
    if hasattr(model, 'predict_proba'):
        roc_auc = roc_auc_score(y_test, y_pred_proba_positive)
        print(f"\nROC AUC Score: {roc_auc:.4f}")
    else:
        print("\nROC AUC not available (model doesn't support predict_proba)")


In [None]:
# Plot confusion matrix
if model_path.exists():
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Normal', 'Anomaly'], 
                yticklabels=['Normal', 'Anomaly'])
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()


In [None]:
# Plot ROC curve
if model_path.exists() and hasattr(model, 'predict_proba'):
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba_positive)
    
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.grid(True, alpha=0.3)
    plt.show()
else:
    print("ROC curve not available - model doesn't support probability predictions")


## 5. Feature Importance Analysis

Let's examine which features are most important for the model's predictions.


In [None]:
# Feature importance (if available)
if model_path.exists() and hasattr(model, 'feature_importances_'):
    feature_importance = model.feature_importances_
    feature_names = X.columns
    
    # Create feature importance DataFrame
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': feature_importance
    }).sort_values('importance', ascending=False)
    
    print("Top 10 Most Important Features:")
    print(importance_df.head(10))
    
    # Plot feature importance
    plt.figure(figsize=(10, 8))
    top_features = importance_df.head(15)
    plt.barh(range(len(top_features)), top_features['importance'])
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Feature Importance')
    plt.title('Top 15 Most Important Features')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
    
elif model_path.exists():
    print("Feature importance not available for this model type")
else:
    print("Model not loaded - cannot analyze feature importance")


## 6. Summary and Next Steps

### Key Findings:
1. **Model Performance**: [To be filled based on actual results]
2. **Feature Importance**: [To be filled based on actual results]
3. **ROC AUC Score**: [To be filled based on actual results]

### Next Steps:
1. **Hyperparameter Tuning**: Optimize model parameters for better performance
2. **Ensemble Methods**: Try combining multiple models for improved accuracy
3. **Feature Engineering**: Create new features based on domain knowledge
4. **Deep Learning**: Explore neural network approaches
5. **Cross-Validation**: Implement k-fold cross-validation for robust evaluation
6. **Model Deployment**: Prepare the best model for production use
7. **Real-time Monitoring**: Set up monitoring for model performance in production
