# Random Forest Training on CICIDS2017 Dataset

This notebook trains a Random Forest classifier on the CICIDS2017 intrusion detection dataset.

**Key Features:**
- SMOTE balancing applied within CV pipeline (prevents data leakage)
- Comprehensive preprocessing and feature engineering
- Data leakage diagnostics
- Feature importance analysis

## 1. Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
root_dir = os.getcwd().split("AdversarialNIDS")[0] + "AdversarialNIDS"
sys.path.append(root_dir)

from CICIDS2017.preprocessing.dataset import CICIDS2017
from UNSWNB15.preprocessing.dataset import UNSWNB15
from scripts.models.model_utils import (
    check_data_leakage,
    get_tree_feature_importance    
)

# Import model-specific modules
from scripts.models.random_forest.random_forest import train_random_forest

from scripts.logger import LoggerManager
from scripts.analysis.model_analysis import perform_model_analysis

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Imports successful")

## 2. Initialize Logger

In [None]:
logger = LoggerManager(log_name="rf_notebook").get_logger()
logger.info("Starting Random Forest training notebook")

## 3. Load and Preprocess Data

In [None]:
logger.info("Loading CICIDS2017 dataset...")
dataset = CICIDS2017(logger=logger).optimize_memory().encode().scale().subset(size=100000, multi_class=True)

## 4. Visualize Class Distribution

In [None]:
#TO_DO

## 5. Data Leakage Diagnostics

In [None]:
# Check for potential data leakage
#diagnostics = check_data_leakage(X, y, logger=logger)

## 6. Train/Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = dataset.split(test_size=0.2, apply_smote=True)
print(f"Train set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

## 7. Cross-Validation

### What does Step 11 (Cross-Validation) do?

In this step, we use 5-fold cross-validation to estimate how well our Random Forest pipeline will generalize to new, unseen data. The process works by splitting the training data into 5 parts (folds), training the model on 4 parts, and validating it on the remaining part. This is repeated so each part is used as a validation set once. The average accuracy across all folds gives us a robust measure of model performance before we evaluate on the true test set.

**Why is this important?**
- It helps detect overfitting or underfitting.
- It allows us to compare different model settings fairly.
- It provides a realistic estimate of how the model will perform in practice, using only the training data.
- **Importantly, the real test set is never touched during cross-validation.** This ensures our final evaluation is unbiased and reflects true out-of-sample performance.

In [None]:
# Train Random Forest using standalone function and plot CV scores
logger.info("Training Random Forest with cross-validation using train_random_forest...")
rf_model, cv_scores = train_random_forest(
    X_train,
    y_train,
    n_estimators=10,
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=0,
    cv=5,
    class_weight='balanced',
    logger=None
)
if cv_scores!= None:
    print("\n" + "="*50)
    print("CROSS-VALIDATION RESULTS")
    print("="*50)
    print(f"CV Scores: {cv_scores}")
    print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
    # Plot CV scores
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(cv_scores)+1), cv_scores, marker='o', markersize=10, linewidth=2, color='green')
    plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', 
                label=f'Mean: {cv_scores.mean():.4f}')
    plt.xlabel('Fold', fontsize=12)
    plt.ylabel('Accuracy', fontsize=12)
    plt.title('Decision Tree Cross-Validation Scores', fontsize=14, fontweight='bold')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 8. Evaluate on Test Set

In [None]:
# Evaluate model
cm, cr = perform_model_analysis(
    model=rf_model,
    X_test=X_test,
    y_test=y_test,
    logger=logger,
    model_name="RandomForest",
    dir=os.getcwd(),
    plot=True
)

## 9. Feature Importance Analysis

In [None]:
# Get feature importance
top_features = get_tree_feature_importance(
    rf_model,
    feature_names=list(dataset.data.columns),
    top_n=15,
    logger=logger
)

# Plot feature importance
features, importances = zip(*top_features)

plt.figure(figsize=(12, 8))
colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(features)))
plt.barh(range(len(features)), importances, color=colors)
plt.yticks(range(len(features)), features)
plt.xlabel('Importance', fontsize=12)
plt.title('Top 15 Feature Importances', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()