# Random Forest Training on CICIDS2017 Dataset

This notebook trains a Random Forest classifier on the CICIDS2017 intrusion detection dataset.

**Key Features:**
- SMOTE balancing applied within CV pipeline (prevents data leakage)
- Comprehensive preprocessing and feature engineering
- Data leakage diagnostics
- Feature importance analysis

## 1. Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..')))

from CICIDS2017.preprocessing.dataset import CICIDS2017
from scripts.models.model_utils import (
    evaluate_model,
    check_data_leakage,
    get_feature_importance,
    remove_low_variance_features    
)

# Import model-specific modules
from scripts.models.random_forest.random_forest import train_random_forest

from scripts.logger import LoggerManager

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Imports successful")

## 2. Initialize Logger

In [None]:
logger = LoggerManager(log_name="rf_notebook").get_logger()
logger.info("Starting Random Forest training notebook")

## 3. Load and Preprocess Data

In [None]:
# Load dataset
logger.info("Loading CICIDS2017 dataset...")
dataset = CICIDS2017(logger=logger)
dataset.encode().optimize_memory()
data = dataset.data

print(f"Dataset shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nFirst few rows:")
data.head()

## 4. Sample Data

In [None]:
# Sample data for faster training
SAMPLE_SIZE = 200000

logger.info(f"Sampling {SAMPLE_SIZE} rows from dataset...")
data_sample = data.sample(n=min(SAMPLE_SIZE, len(data)), random_state=0)

print(f"Sampled data shape: {data_sample.shape}")

## 5. Prepare Features and Labels

In [None]:
# Split features and labels
X = data_sample.drop('Attack Type', axis=1)
y = data_sample['Attack Type']

# Remove known leakage features
leakage_features = ['Attack Number']  # Add other suspicious features here
existing_leakage = [f for f in leakage_features if f in X.columns]

if existing_leakage:
    logger.warning(f"üö® REMOVING LEAKAGE FEATURES: {existing_leakage}")
    X = X.drop(columns=existing_leakage)

# Convert to numeric
X = X.apply(pd.to_numeric, errors='coerce')

# Handle missing values
if X.isnull().sum().sum() > 0:
    n_missing = X.isnull().sum().sum()
    logger.info(f"Filling {n_missing} missing values with 0")
    X = X.fillna(0)

print(f"Feature matrix shape: {X.shape}")
print(f"\nClass distribution:")
print(y.value_counts())

## 6. Visualize Class Distribution

In [None]:
# Plot class distribution
plt.figure(figsize=(12, 6))
y.value_counts().plot(kind='bar')
plt.title('Class Distribution (Before SMOTE)', fontsize=14, fontweight='bold')
plt.xlabel('Attack Type', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Class imbalance ratio
class_counts = y.value_counts()
imbalance_ratio = class_counts.min() / class_counts.max()
print(f"\nClass imbalance ratio: {imbalance_ratio:.4f}")
print(f"Most common class: {class_counts.idxmax()} ({class_counts.max()} samples)")
print(f"Least common class: {class_counts.idxmin()} ({class_counts.min()} samples)")

## 7. Data Leakage Diagnostics

In [None]:
# Check for potential data leakage
diagnostics = check_data_leakage(X, y, logger=logger)

# Display diagnostics
print("\n" + "="*50)
print("DATA LEAKAGE SUMMARY")
print("="*50)
print(f"Duplicates: {diagnostics['duplicates']} ({diagnostics['duplicate_pct']:.2f}%)")
print(f"High correlation features: {len(diagnostics['high_correlation_features'])}")
print(f"Constant features: {len(diagnostics['constant_features'])}")
print(f"Class balance ratio: {diagnostics['class_balance_ratio']:.4f}")

## 8. Remove Low Variance Features

In [None]:
# Remove low variance features
X, removed_features = remove_low_variance_features(X, threshold=0.01, logger=logger)

print(f"\nRemoved {len(removed_features)} low variance features")
print(f"Remaining features: {X.shape[1]}")

## 9. Train/Test Split

In [None]:
# Remove classes with fewer than 2 samples (needed for stratified split)
class_counts = y.value_counts()
valid_classes = class_counts[class_counts >= 2].index
X = X[y.isin(valid_classes)]
y = y[y.isin(valid_classes)]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=0,
    stratify=y
)

print(f"Train set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nTrain class distribution:")
print(y_train.value_counts())

## 10. Create Pipeline with SMOTE

In [None]:
# Apply SMOTE before training (outside pipeline)
from imblearn.over_sampling import SMOTE


smote = SMOTE(random_state=0)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

print("SMOTE-applied data.")

## 11. Cross-Validation

### What does Step 11 (Cross-Validation) do?

In this step, we use 5-fold cross-validation to estimate how well our Random Forest pipeline will generalize to new, unseen data. The process works by splitting the training data into 5 parts (folds), training the model on 4 parts, and validating it on the remaining part. This is repeated so each part is used as a validation set once. The average accuracy across all folds gives us a robust measure of model performance before we evaluate on the true test set.

**Why is this important?**
- It helps detect overfitting or underfitting.
- It allows us to compare different model settings fairly.
- It provides a realistic estimate of how the model will perform in practice, using only the training data.
- **Importantly, the real test set is never touched during cross-validation.** This ensures our final evaluation (in Step 13) is unbiased and reflects true out-of-sample performance.

In [None]:
# Train Random Forest using standalone function and plot CV scores
logger.info("Training Random Forest with cross-validation using train_random_forest...")
rf_model, cv_scores = train_random_forest(
    X_train_res,
    y_train_res,
    n_estimators=10,
    max_depth=3,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=0,
    cv=5,
    class_weight='balanced',
    logger=None
)

print("\n" + "="*50)
print("CROSS-VALIDATION RESULTS")
print("="*50)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# Plot CV scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(cv_scores)+1), cv_scores, marker='o', markersize=10, linewidth=2)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', label=f'Mean: {cv_scores.mean():.4f}')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Cross-Validation Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 12. Train Final Model

In [None]:
# Train on full training set
logger.info("Training final model on full training set...")
rf_model.fit(X_train_res, y_train_res)

print("‚úì Model training completed")

## 13. Evaluate on Test Set

In [None]:
# Evaluate model
results = evaluate_model(rf_model, X_test, y_test, logger=logger)

print("\n" + "="*50)
print("TEST SET RESULTS")
print("="*50)
print(f"Test Accuracy: {results['accuracy']:.4f}")
print(f"\nClassification Report:")
print(results['report'])

## 14. Confusion Matrix

In [None]:
# Plot confusion matrix
plt.figure(figsize=(12, 10))
disp = ConfusionMatrixDisplay(confusion_matrix=results['confusion_matrix'],
                               display_labels=rf_model.classes_)
disp.plot(cmap='Blues', xticks_rotation=45)
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 15. Feature Importance Analysis

In [None]:
# Get feature importance
top_features = get_feature_importance(
    rf_model,
    feature_names=list(X.columns),
    top_n=15,
    logger=logger
)

# Plot feature importance
features, importances = zip(*top_features)

plt.figure(figsize=(12, 8))
plt.barh(range(len(features)), importances)
plt.yticks(range(len(features)), features)
plt.xlabel('Importance', fontsize=12)
plt.title('Top 15 Feature Importances', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 16. Performance Summary

In [None]:
# Summary
print("\n" + "="*70)
print("FINAL PERFORMANCE SUMMARY")
print("="*70)
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print(f"Test Accuracy: {results['accuracy']:.4f}")
print(f"\nModel Configuration:")
print(f"  - Number of trees: {rf_model.named_steps['rf'].n_estimators}")
print(f"  - Max depth: {rf_model.named_steps['rf'].max_depth}")
print(f"  - Max features: {rf_model.named_steps['rf'].max_features}")
print(f"  - SMOTE: Enabled")
print(f"  - Feature Scaling: Enabled")

# Performance indicators
if cv_scores.mean() > 0.99:
    print("\n‚ö†Ô∏è  WARNING: CV score > 0.99 may indicate data leakage!")
    print("   Review feature engineering and data preprocessing.")
elif cv_scores.mean() >= 0.95:
    print("\n‚úì Excellent performance achieved (CV score ‚â• 0.95)")
elif cv_scores.mean() >= 0.90:
    print("\n‚úì Good performance achieved (CV score ‚â• 0.90)")
else:
    print("\n‚ö†Ô∏è  Performance below 0.90")
    print("   Consider:")
    print("   - Increasing n_estimators or max_depth")
    print("   - Adding feature selection")
    print("   - Using more training data")
    print("   - Trying other algorithms (XGBoost, Neural Networks)")

logger.info("Notebook execution completed successfully!")