# Network Model 2 - Multiclass IDS Training

This notebook trains a production-grade multiclass Intrusion Detection System (IDS) model using network flow features from `train.parquet`. The model detects the following attack classes:

- BENIGN
- DoS Hulk
- DDoS
- PortScan
- DoS GoldenEye
- FTP-Patator
- DoS slowloris
- DoS Slowhttptest
- SSH-Patator
- Bot
- Web Attack ‚Äì Brute Force
- Web Attack ‚Äì XSS

**Author**: Senior ML Engineer + Cybersecurity Data Scientist
**Date**: January 2025
**Model**: LightGBM Multiclass Classifier

## 1. Setup & Install

Install required libraries and verify versions. This ensures reproducibility across different Colab environments.

In [1]:
# Install required libraries
%pip install lightgbm scikit-learn pandas pyarrow joblib matplotlib seaborn plotly

# Optional: Install optuna for hyperparameter optimization
%pip install optuna

print("Libraries installed successfully!")

Collecting optuna
  Downloading optuna-4.6.0-py3-none-any.whl.metadata (17 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading optuna-4.6.0-py3-none-any.whl (404 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m404.7/404.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading colorlog-6.10.1-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna
Successfully installed colorlog-6.10.1 optuna-4.6.0
Libraries installed successfully!


In [2]:
# Import statements and version check
import pandas as pd
import numpy as np
import lightgbm as lgb
import joblib
import json
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
import warnings
warnings.filterwarnings('ignore')

# Optional: Optuna for hyperparameter tuning
try:
    import optuna
    OPTUNA_AVAILABLE = True
    print("Optuna available for hyperparameter optimization")
except ImportError:
    OPTUNA_AVAILABLE = False
    print("Optuna not available, using default hyperparameters")

# Version information
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"lightgbm: {lgb.__version__}")
print(f"scikit-learn: {pd.__version__}")  # sklearn version

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print(f"Random seed set to: {RANDOM_SEED}")

Optuna available for hyperparameter optimization
pandas: 2.2.2
numpy: 2.0.2
lightgbm: 4.6.0
scikit-learn: 2.2.2
Random seed set to: 42


## 2. Load Data

Load the training data from `train.parquet`. You have two options:
- **Option A**: Upload the file directly to Colab
- **Option B**: Mount Google Drive and load from a specified path

Choose one option below and comment out the other.

In [4]:
# Option A: Upload file directly to Colab
# Uncomment the lines below if using direct upload

from google.colab import files
uploaded = files.upload()
data_path = list(uploaded.keys())[0]  # Assumes train.parquet is uploaded
print(f"File uploaded: {data_path}")

# Option B: Mount Google Drive (recommended for large files)
# Uncomment the lines below if using Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

# # Specify the path to your train.parquet file in Google Drive
# # Update this path to match your Drive structure
# data_path = '/content/drive/MyDrive/train.parquet'  # Change this to your actual path

# print(f"Using data from: {data_path}")

KeyboardInterrupt: 

In [None]:
# Load the parquet file
print("Loading train.parquet...")
df = pd.read_parquet(data_path)

print("Data loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of columns: {len(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display basic information
print("\nColumn information:")
print(df.dtypes.value_counts())
print("\nFirst few columns:")
print(df.dtypes.head(10))

# Check for label column (common names)
label_candidates = ['Label', 'label', 'class', 'target', 'attack_type']
label_col = None
for col in label_candidates:
    if col in df.columns:
        label_col = col
        break

if label_col:
    print(f"\nDetected label column: '{label_col}'")
    print(f"Unique labels: {df[label_col].nunique()}")
    print(f"Sample labels: {df[label_col].unique()[:10]}")
else:
    print("\nWarning: Could not automatically detect label column.")
    print("Available columns:", list(df.columns))
    # Manually set if needed
    label_col = 'Label'  # Change this if your label column has a different name

## 3. Label Filtering & Distribution

Filter the dataset to include only the target attack classes. We need to handle potential unicode variations in "Web Attack" labels.

In [None]:
# Define target labels (normalized)
TARGET_LABELS = [
    'BENIGN',
    'DoS Hulk',
    'DDoS',
    'PortScan',
    'DoS GoldenEye',
    'FTP-Patator',
    'DoS slowloris',
    'DoS Slowhttptest',
    'SSH-Patator',
    'Bot',
    'Web Attack ‚Äì Brute Force',  # Note: en-dash
    'Web Attack ‚Äì XSS'  # Note: en-dash
]

# Alternative variations (handle different unicode dashes)
LABEL_MAPPINGS = {
    'Web Attack - Brute Force': 'Web Attack ‚Äì Brute Force',  # hyphen to en-dash
    'Web Attack ‚Äî Brute Force': 'Web Attack ‚Äì Brute Force',  # em-dash to en-dash
    'Web Attack - XSS': 'Web Attack ‚Äì XSS',
    'Web Attack ‚Äî XSS': 'Web Attack ‚Äì XSS',
    'WEB ATTACK ‚Äì BRUTE FORCE': 'Web Attack ‚Äì Brute Force',
    'WEB ATTACK ‚Äì XSS': 'Web Attack ‚Äì XSS',
    'Web Attack Brute Force': 'Web Attack ‚Äì Brute Force',
    'Web Attack XSS': 'Web Attack ‚Äì XSS',
}

print("Target labels to keep:")
for label in TARGET_LABELS:
    print(f"  - {label}")

# Check original labels in dataset
original_labels = df[label_col].unique()
print(f"\nOriginal labels in dataset ({len(original_labels)}):")
for label in sorted(original_labels):
    print(f"  - '{label}'")

# Normalize labels using mapping
df[label_col] = df[label_col].map(LABEL_MAPPINGS).fillna(df[label_col])

# Filter to target labels only
df_filtered = df[df[label_col].isin(TARGET_LABELS)].copy()

print(f"\nAfter filtering:")
print(f"  Original dataset: {df.shape[0]} rows")
print(f"  Filtered dataset: {df_filtered.shape[0]} rows")
print(f"  Removed: {df.shape[0] - df_filtered.shape[0]} rows")

# Check if all target labels are present
present_labels = df_filtered[label_col].unique()
missing_labels = set(TARGET_LABELS) - set(present_labels)

if missing_labels:
    print(f"\nWarning: Missing target labels: {missing_labels}")
else:
    print("\nAll target labels present in filtered dataset.")

In [None]:
# Analyze class distribution
label_counts = df_filtered[label_col].value_counts()
label_percentages = df_filtered[label_col].value_counts(normalize=True) * 100

print("Class distribution:")
print("=" * 50)
for label in TARGET_LABELS:
    if label in label_counts.index:
        count = label_counts[label]
        percentage = label_percentages[label]
        print(f"{label:<25} {count:>8} ({percentage:>6.2f}%)")
    else:
        print(f"{label:<25} {0:>8} ({0.00:>6.2f}%)")

# Create distribution plot
plt.figure(figsize=(12, 6))
ax = label_counts.plot(kind='bar', color='skyblue')
plt.title('Class Distribution After Filtering', fontsize=14, fontweight='bold')
plt.xlabel('Attack Type', fontsize=12)
plt.ylabel('Number of Samples', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(label_counts):
    ax.text(i, v + max(label_counts) * 0.01, f'{v:,}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Check for class imbalance
max_class = label_counts.max()
min_class = label_counts.min()
imbalance_ratio = max_class / min_class

print(f"\nClass imbalance analysis:")
print(f"  Largest class: {max_class:,} samples")
print(f"  Smallest class: {min_class:,} samples")
print(f"  Imbalance ratio: {imbalance_ratio:.1f}x")

if imbalance_ratio > 10:
    print("  ‚ö†Ô∏è  Severe class imbalance detected - will use class weights")
elif imbalance_ratio > 5:
    print("  ‚ö†Ô∏è  Moderate class imbalance detected - will use class weights")
else:
    print("  ‚úì Class distribution is relatively balanced")

## 4. Data Cleaning (must do all)

Perform comprehensive data cleaning including:
- Clip invalid negatives to 0 for: FlowDuration, FlowBytes/s, FlowPackets/s (only if columns exist)
- Replace inf/-inf with NaN
- Drop fully-constant columns automatically
- Drop duplicate columns automatically (example: FwdHeaderLength.1 vs FwdHeaderLength if present)
- Impute missing values (median)
- Convert numeric features to float32 for memory efficiency

In [None]:
# Work with filtered dataset
df_clean = df_filtered.copy()
print(f"Starting with {df_clean.shape[0]} rows and {df_clean.shape[1]} columns")

# Step 1: Clip invalid negatives to 0
columns_to_clip = ['Flow Duration', 'Flow Bytes/s', 'Flow Packets/s']
clipped_cols = []

for col in columns_to_clip:
    if col in df_clean.columns:
        negative_count = (df_clean[col] < 0).sum()
        if negative_count > 0:
            df_clean[col] = df_clean[col].clip(lower=0)
            clipped_cols.append(col)
            print(f"Clipped {negative_count} negative values in '{col}'")

if not clipped_cols:
    print("No negative values found in flow-related columns")

# Step 2: Replace inf/-inf with NaN
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
inf_count_before = df_clean[numeric_cols].isin([np.inf, -np.inf]).sum().sum()

df_clean[numeric_cols] = df_clean[numeric_cols].replace([np.inf, -np.inf], np.nan)

inf_count_after = df_clean[numeric_cols].isin([np.inf, -np.inf]).sum().sum()
print(f"Replaced {inf_count_before} infinite values with NaN")

# Step 3: Drop fully-constant columns
constant_cols = []
for col in df_clean.columns:
    if col != label_col and df_clean[col].nunique() == 1:
        constant_cols.append(col)

if constant_cols:
    df_clean = df_clean.drop(columns=constant_cols)
    print(f"Dropped {len(constant_cols)} constant columns: {constant_cols}")
else:
    print("No constant columns found")

# Step 4: Drop duplicate columns (exact duplicates)
duplicate_cols = []
cols_to_check = [col for col in df_clean.columns if col != label_col]

for i, col1 in enumerate(cols_to_check):
    for col2 in cols_to_check[i+1:]:
        if df_clean[col1].equals(df_clean[col2]):
            duplicate_cols.append(col2)

if duplicate_cols:
    df_clean = df_clean.drop(columns=duplicate_cols)
    print(f"Dropped {len(duplicate_cols)} duplicate columns: {duplicate_cols}")
else:
    print("No duplicate columns found")

# Step 5: Impute missing values with median
missing_before = df_clean.isnull().sum().sum()
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns

for col in numeric_cols:
    if df_clean[col].isnull().any():
        median_val = df_clean[col].median()
        df_clean[col] = df_clean[col].fillna(median_val)

missing_after = df_clean.isnull().sum().sum()
print(f"Imputed {missing_before} missing values with column medians")

# Step 6: Convert numeric features to float32 for memory efficiency
for col in numeric_cols:
    if col != label_col:  # Don't convert label column
        df_clean[col] = df_clean[col].astype(np.float32)

print(f"Converted {len(numeric_cols)} numeric columns to float32")

# Final summary
print("\nData cleaning completed:")
print(f"  Final shape: {df_clean.shape}")
print(f"  Memory usage: {df_clean.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"  Missing values remaining: {df_clean.isnull().sum().sum()}")

## 5. Feature Set

Define the X columns automatically as all numeric columns excluding label columns.

In [None]:
# Define feature columns (all numeric columns except label)
feature_cols = [col for col in df_clean.select_dtypes(include=[np.number]).columns if col != label_col]

print(f"Identified {len(feature_cols)} feature columns")
print("\nFeature columns:")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i:2d}. {col}")

# Verify no label column in features
if label_col in feature_cols:
    raise ValueError(f"Label column '{label_col}' should not be in feature columns")

# Check for any remaining non-numeric columns
non_numeric = [col for col in df_clean.columns if col not in feature_cols and col != label_col]
if non_numeric:
    print(f"\nWarning: {len(non_numeric)} non-numeric columns will be excluded: {non_numeric}")

# Store feature list for later use
FEATURE_LIST = feature_cols

# Display feature statistics
X_summary = df_clean[feature_cols].describe().T
print(f"\nFeature summary (showing first 10):")
print(X_summary.head(10)[['count', 'mean', 'std', 'min', 'max']].round(3))

## 6. Train/Val/Test Split

Perform stratified split maintaining class distribution across splits. Using 70/15/15 split for train/validation/test.

In [None]:
# Prepare data for splitting
X = df_clean[feature_cols]
y = df_clean[label_col]

print(f"Feature matrix shape: {X.shape}")
print(f"Label vector shape: {y.shape}")

# First split: separate test set (15%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y,
    test_size=0.15,
    stratify=y,
    random_state=RANDOM_SEED
)

# Second split: separate validation from remaining (15% of total = 17.65% of remaining)
val_size = 0.1765  # 15% / 85% ‚âà 0.1765
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp,
    test_size=val_size,
    stratify=y_temp,
    random_state=RANDOM_SEED
)

print("\nSplit sizes:")
print(f"  Train: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  Validation: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"  Test: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"  Total: {len(X)} samples")

# Verify stratification
def print_class_distribution(y_data, title):
    counts = y_data.value_counts()
    percentages = (counts / len(y_data) * 100).round(2)
    print(f"\n{title} class distribution:")
    for label in TARGET_LABELS:
        if label in counts.index:
            print(f"  {label:<25} {counts[label]:>6} ({percentages[label]:>5.2f}%)")
        else:
            print(f"  {label:<25} {0:>6} ({0.00:>5.2f}%)")

print_class_distribution(y_train, "Train set")
print_class_distribution(y_val, "Validation set")
print_class_distribution(y_test, "Test set")

# Verify no data leakage
train_indices = set(X_train.index)
val_indices = set(X_val.index)
test_indices = set(X_test.index)

assert len(train_indices & val_indices) == 0, "Data leakage between train and validation"
assert len(train_indices & test_indices) == 0, "Data leakage between train and test"
assert len(val_indices & test_indices) == 0, "Data leakage between validation and test"

print("\n‚úì No data leakage detected between splits")

## 7. Modeling (Primary: LightGBM multiclass)

Train a LightGBM multiclass classifier with class weights to handle imbalance and early stopping for optimal performance.

In [None]:
# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)

# Create label mapping
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
reverse_label_mapping = dict(zip(label_encoder.transform(label_encoder.classes_), label_encoder.classes_))

print("Label encoding:")
for label, code in label_mapping.items():
    print(f"  {code}: {label}")

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train_encoded), y=y_train_encoded)
class_weight_dict = dict(zip(np.unique(y_train_encoded), class_weights))

print(f"\nClass weights: {class_weight_dict}")

# Convert to sample weights for LightGBM
sample_weights = np.array([class_weight_dict[class_] for class_ in y_train_encoded])

print(f"Sample weights shape: {sample_weights.shape}")
print(f"Sample weights range: {sample_weights.min():.3f} - {sample_weights.max():.3f}")

In [None]:
# Optional: Hyperparameter tuning with Optuna (fast version)
DO_HYPERPARAMETER_TUNING = OPTUNA_AVAILABLE and X_train.shape[0] > 100000  # Only if dataset is large enough

if DO_HYPERPARAMETER_TUNING:
    print("Performing hyperparameter tuning with Optuna...")
    
    # Sample a subset for faster tuning
    sample_size = min(200000, len(X_train))
    sample_indices = np.random.choice(len(X_train), sample_size, replace=False)
    X_sample = X_train.iloc[sample_indices]
    y_sample = y_train_encoded[sample_indices]
    weights_sample = sample_weights[sample_indices]
    
    print(f"Using {sample_size} samples for hyperparameter tuning")
    
    def objective(trial):
        params = {
            'objective': 'multiclass',
            'num_class': len(TARGET_LABELS),
            'metric': 'multi_logloss',
            'boosting_type': 'gbdt',
            'verbosity': -1,
            'seed': RANDOM_SEED,
            'num_leaves': trial.suggest_int('num_leaves', 20, 100),
            'max_depth': trial.suggest_int('max_depth', 6, 15),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'n_estimators': 1000,  # Will be controlled by early stopping
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 1.0, log=True),
            'reg_lambda': trial.suggest_float('reg_alpha', 1e-5, 1.0, log=True),
        }
        
        # Create datasets
        train_dataset = lgb.Dataset(X_sample, y_sample, weight=weights_sample)
        val_dataset = lgb.Dataset(X_val, y_val_encoded, reference=train_dataset)
        
        # Train with early stopping
        model = lgb.train(
            params,
            train_dataset,
            valid_sets=[val_dataset],
            callbacks=[
                lgb.early_stopping(50, verbose=False),
                lgb.log_evaluation(0)
            ]
        )
        
        # Return best score
        return model.best_score['valid_0']['multi_logloss']
    
    # Run optimization
    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=20, timeout=600)  # 20 trials, max 10 minutes
    
    best_params = study.best_params
    print(f"Best hyperparameters: {best_params}")
    
else:
    print("Using default hyperparameters (tuning skipped)")
    best_params = {}

In [None]:
# Set final model parameters
default_params = {
    'objective': 'multiclass',
    'num_class': len(TARGET_LABELS),
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'verbosity': 1,
    'seed': RANDOM_SEED,
    'num_leaves': 50,
    'max_depth': 10,
    'learning_rate': 0.1,
    'n_estimators': 1000,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
}

# Merge with tuned parameters if available
final_params = {**default_params, **best_params}

print("Final model parameters:")
for key, value in final_params.items():
    print(f"  {key}: {value}")

# Create LightGBM datasets
train_dataset = lgb.Dataset(X_train, y_train_encoded, weight=sample_weights)
val_dataset = lgb.Dataset(X_val, y_val_encoded, reference=train_dataset)

print("\nTraining LightGBM model...")

# Train the model
model = lgb.train(
    final_params,
    train_dataset,
    valid_sets=[train_dataset, val_dataset],
    valid_names=['train', 'valid'],
    callbacks=[
        lgb.early_stopping(50, verbose=True),
        lgb.log_evaluation(50)
    ]
)

print(f"\nTraining completed after {model.best_iteration} iterations")
print(f"Best validation score: {model.best_score['valid']['multi_logloss']:.4f}")

## 8. Evaluation (must include)

Evaluate the trained model on the test set using multiple metrics including macro and weighted F1 scores, per-class metrics, and confusion matrix.

In [None]:
# Make predictions on test set
print("Generating predictions on test set...")
y_pred_encoded = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(y_pred_encoded, axis=1)
y_pred_labels = label_encoder.inverse_transform(y_pred)

# Get prediction probabilities
y_pred_proba = y_pred_encoded

# Calculate overall metrics
macro_f1 = f1_score(y_test_encoded, y_pred, average='macro')
weighted_f1 = f1_score(y_test_encoded, y_pred, average='weighted')

print(f"\nOverall Metrics:")
print(f"  Macro F1 Score: {macro_f1:.4f}")
print(f"  Weighted F1 Score: {weighted_f1:.4f}")

# Generate classification report
print("\nDetailed Classification Report:")
report = classification_report(y_test_encoded, y_pred, target_names=TARGET_LABELS, output_dict=True)

# Convert to DataFrame for better display
report_df = pd.DataFrame(report).transpose()
report_df = report_df.round(4)
print(report_df.to_string())

# Store metrics for export
evaluation_metrics = {
    'macro_f1': macro_f1,
    'weighted_f1': weighted_f1,
    'accuracy': report['accuracy'],
    'per_class_metrics': {}
}

# Add per-class metrics
for i, label in enumerate(TARGET_LABELS):
    if label in report:
        evaluation_metrics['per_class_metrics'][label] = {
            'precision': report[label]['precision'],
            'recall': report[label]['recall'],
            'f1-score': report[label]['f1-score'],
            'support': report[label]['support']
        }

In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test_encoded, y_pred)

# Plot confusion matrix
plt.figure(figsize=(12, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=TARGET_LABELS, yticklabels=TARGET_LABELS)
plt.title('Confusion Matrix - Test Set', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Analyze confusion matrix
print("\nConfusion Matrix Analysis:")

# Most confused pairs
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
np.fill_diagonal(cm_normalized, 0)  # Remove diagonal

# Find most confused pairs
max_conf_idx = np.unravel_index(np.argmax(cm_normalized), cm_normalized.shape)
true_class = TARGET_LABELS[max_conf_idx[0]]
pred_class = TARGET_LABELS[max_conf_idx[1]]
conf_rate = cm_normalized[max_conf_idx] * 100

print(f"Most confused pair: {true_class} ‚Üí {pred_class} ({conf_rate:.1f}% of {true_class} samples)")

# Per-class accuracy
class_accuracy = np.diag(cm) / np.sum(cm, axis=1)
print("\nPer-class accuracy:")
for i, label in enumerate(TARGET_LABELS):
    acc = class_accuracy[i] * 100
    print(f"  {label:<25} {acc:>6.2f}%")

In [None]:
# Performance interpretation
print("\nModel Performance Interpretation:")
print("=" * 50)

# Identify hardest classes
f1_scores = [report[label]['f1-score'] for label in TARGET_LABELS if label in report]
min_f1_idx = np.argmin(f1_scores)
max_f1_idx = np.argmax(f1_scores)

hardest_class = TARGET_LABELS[min_f1_idx]
easiest_class = TARGET_LABELS[max_f1_idx]

print(f"Best performing class: {easiest_class} (F1 = {f1_scores[max_f1_idx]:.4f})")
print(f"Worst performing class: {hardest_class} (F1 = {f1_scores[min_f1_idx]:.4f})")

# Analyze class imbalance impact
support_values = [report[label]['support'] for label in TARGET_LABELS if label in report]
min_support = min(support_values)
max_support = max(support_values)

print(f"\nClass imbalance analysis:")
print(f"  Sample size range: {min_support} - {max_support} ({max_support/min_support:.1f}x ratio)")

if max_support / min_support > 10:
    print("  ‚ö†Ô∏è  Severe class imbalance likely contributes to poor performance on minority classes")
elif max_support / min_support > 5:
    print("  ‚ö†Ô∏è  Moderate class imbalance may affect minority class performance")
else:
    print("  ‚úì Class distribution is relatively balanced")

# Overall assessment
if macro_f1 > 0.9:
    print("\nüéâ Excellent model performance!")
elif macro_f1 > 0.8:
    print("\n‚úÖ Good model performance")
elif macro_f1 > 0.7:
    print("\n‚ö†Ô∏è  Acceptable model performance - may need improvement")
else:
    print("\n‚ùå Poor model performance - requires significant improvement")

## 9. Feature Importance

Analyze and export the most important features for model interpretability.

In [None]:
# Get feature importance
feature_importance = model.feature_importance(importance_type='gain')
feature_names = FEATURE_LIST

# Create DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
})

# Sort by importance
importance_df = importance_df.sort_values('importance', ascending=False).reset_index(drop=True)

print(f"Top 20 most important features:")
print("=" * 50)
for i, row in importance_df.head(20).iterrows():
    print(f"{i+1:2d}. {row['feature']:<30} {row['importance']:>10.2f}")

# Plot top 20 features
plt.figure(figsize=(12, 8))
top_20 = importance_df.head(20)
plt.barh(range(len(top_20)), top_20['importance'][::-1])
plt.yticks(range(len(top_20)), top_20['feature'][::-1])
plt.xlabel('Feature Importance (Gain)')
plt.ylabel('Features')
plt.title('Top 20 Most Important Features', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# Export top 50 features
top_50_features = importance_df.head(50)
print(f"\nExporting top {len(top_50_features)} features for analysis")

## 10. Export Artifacts (to Drive and to local download)

Save all model artifacts to both Google Drive and local download.

In [None]:
# Create artifacts directory in Drive
artifacts_dir = '/content/drive/MyDrive/ids_artifacts/'
!mkdir -p "$artifacts_dir"

print(f"Saving artifacts to: {artifacts_dir}")

# 1. Save model
model_path = f"{artifacts_dir}model.joblib"
joblib.dump(model, model_path)
print(f"‚úì Model saved: {model_path}")

# 2. Save label mapping
label_map_path = f"{artifacts_dir}label_map.json"
with open(label_map_path, 'w') as f:
    json.dump(label_mapping, f, indent=2)
print(f"‚úì Label mapping saved: {label_map_path}")

# 3. Save feature list
feature_list_path = f"{artifacts_dir}feature_list.json"
with open(feature_list_path, 'w') as f:
    json.dump(FEATURE_LIST, f, indent=2)
print(f"‚úì Feature list saved: {feature_list_path}")

# 4. Save evaluation metrics
metrics_path = f"{artifacts_dir}metrics.json"
with open(metrics_path, 'w') as f:
    json.dump(evaluation_metrics, f, indent=2)
print(f"‚úì Metrics saved: {metrics_path}")

# 5. Save feature importance
importance_path = f"{artifacts_dir}feature_importance.csv"
importance_df.to_csv(importance_path, index=False)
print(f"‚úì Feature importance saved: {importance_path}")

# Also save to local Colab files for download
local_artifacts_dir = '/content/artifacts/'
!mkdir -p "$local_artifacts_dir"

joblib.dump(model, f"{local_artifacts_dir}model.joblib")
with open(f"{local_artifacts_dir}label_map.json", 'w') as f:
    json.dump(label_mapping, f, indent=2)
with open(f"{local_artifacts_dir}feature_list.json", 'w') as f:
    json.dump(FEATURE_LIST, f, indent=2)
with open(f"{local_artifacts_dir}metrics.json", 'w') as f:
    json.dump(evaluation_metrics, f, indent=2)
importance_df.to_csv(f"{local_artifacts_dir}feature_importance.csv", index=False)

print(f"\n‚úì All artifacts also saved locally to: {local_artifacts_dir}")

# Create zip file for easy download
zip_path = '/content/ids_artifacts.zip'
!zip -r "$zip_path" "$local_artifacts_dir"
print(f"‚úì Created zip archive: {zip_path}")

# List files
print("\nArtifacts created:")
!ls -la "$local_artifacts_dir"

## 11. Inference Demo Cell

Demonstrate how to load the saved model and use it for predictions.

In [None]:
# Load saved artifacts (simulate production usage)
print("Loading saved model and artifacts for inference demo...")

# Load model
loaded_model = joblib.load(f"{local_artifacts_dir}model.joblib")
print("‚úì Model loaded")

# Load mappings
with open(f"{local_artifacts_dir}label_map.json", 'r') as f:
    loaded_label_map = json.load(f)
print("‚úì Label mapping loaded")

with open(f"{local_artifacts_dir}feature_list.json", 'r') as f:
    loaded_feature_list = json.load(f)
print("‚úì Feature list loaded")

# Create reverse mapping for predictions
reverse_mapping = {v: k for k, v in loaded_label_map.items()}

# Prepare sample data for prediction
sample_size = min(100, len(X_test))
X_sample = X_test.head(sample_size)
y_sample_true = y_test.head(sample_size)

print(f"\nRunning inference on {sample_size} sample instances...")

# Make predictions
predictions_proba = loaded_model.predict(X_sample, num_iteration=loaded_model.best_iteration)
predictions_encoded = np.argmax(predictions_proba, axis=1)
predictions_labels = [reverse_mapping[pred] for pred in predictions_encoded]

# Create results DataFrame
results_df = pd.DataFrame({
    'True_Label': y_sample_true.values,
    'Predicted_Label': predictions_labels,
    'Correct': y_sample_true.values == predictions_labels
})

# Add confidence scores (probability of predicted class)
confidence_scores = np.max(predictions_proba, axis=1)
results_df['Confidence'] = confidence_scores

print("\nPrediction Results:")
print("=" * 50)
print(results_df.head(10).to_string(index=False))

# Show probability distribution for first sample
print(f"\nProbability distribution for first sample:")
print(f"True label: {results_df.iloc[0]['True_Label']}")
print(f"Predicted: {results_df.iloc[0]['Predicted_Label']} (confidence: {results_df.iloc[0]['Confidence']:.4f})")
print("\nClass probabilities:")
for i, prob in enumerate(predictions_proba[0]):
    class_name = reverse_mapping[i]
    marker = " ‚Üê" if i == predictions_encoded[0] else ""
    print(f"  {class_name:<25} {prob:>7.4f}{marker}")

# Calculate accuracy on sample
sample_accuracy = results_df['Correct'].mean()
print(f"\nSample accuracy: {sample_accuracy:.4f} ({results_df['Correct'].sum()}/{len(results_df)} correct)")

print("\n‚úì Inference demo completed successfully!")
print("\nTo use this model in production:")
print("1. Load the model: joblib.load('model.joblib')")
print("2. Load feature list and ensure input data has these columns")
print("3. Load label mapping to convert predictions to class names")
print("4. Use model.predict() for predictions or model.predict_proba() for probabilities")

# Training Complete!

**Summary:**
- ‚úÖ Trained multiclass LightGBM model for IDS
- ‚úÖ Handles 12 attack classes including BENIGN
- ‚úÖ Uses class weights to handle imbalance
- ‚úÖ Includes early stopping and hyperparameter tuning
- ‚úÖ Comprehensive evaluation with multiple metrics
- ‚úÖ All artifacts exported and ready for deployment

**Next Steps:**
1. Download the artifacts zip file
2. Deploy model to production environment
3. Set up monitoring and retraining pipeline
4. Consider model compression for edge deployment

**Model Performance:**
- Macro F1: {macro_f1:.4f}
- Weighted F1: {weighted_f1:.4f}
- Best class: {easiest_class}
- Needs improvement: {hardest_class}

**Files Generated:**
- `model.joblib` - Trained LightGBM model
- `feature_list.json` - List of features used
- `label_map.json` - Class name to ID mapping
- `metrics.json` - Evaluation metrics
- `feature_importance.csv` - Feature importance rankings