# üõ°Ô∏è NetGuardian-AI: Complete IDS Pipeline

**Comprehensive All-in-One Google Colab Notebook**

This notebook consolidates the entire NetGuardian-AI Intrusion Detection System workflow:
1. Dataset Construction & Cleaning
2. MITRE ATT&CK Analysis
3. Data Preparation & Feature Engineering
4. Hybrid Model Training (Binary + Multi-Class)
5. Model Evaluation & Robustness Testing
6. Real-Time Simulation
7. Model Comparison Benchmark

---

## üìã Table of Contents

- [Phase 0: Environment Setup](#phase0)
- [Phase 1: Dataset Construction](#phase1)
- [Phase 2: MITRE Analysis](#phase2)
- [Phase 3: Data Preparation](#phase3)
- [Phase 4: Hybrid Model Training](#phase4)
- [Phase 5: Model Evaluation](#phase5)
- [Phase 6: Real-Time Simulation](#phase6)
- [Phase 7: Model Comparison](#phase7)

<a id='phase0'></a>
## üì¶ Phase 0: Environment Setup

**Purpose**: Install required libraries and configure the environment.

**What this does**:
- Installs machine learning libraries (XGBoost, imbalanced-learn)
- Imports all necessary Python packages
- Configures visualization settings

In [None]:
# Install required packages
!pip install -q xgboost imbalanced-learn

print("‚úÖ Packages installed successfully!")

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
import joblib
from collections import Counter

# Scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    accuracy_score, f1_score, precision_score, 
    recall_score, roc_auc_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# XGBoost and imbalanced-learn
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

# TensorFlow for Autoencoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Configure settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print(f"‚úÖ All libraries imported!")
print(f"TensorFlow version: {tf.__version__}")

<a id='phase1'></a>
## üîç Phase 1: Dataset Construction & Cleaning

**Purpose**: Load and clean the CICIDS2017 dataset.

**Dataset**: CICIDS2017 from Kaggle
- URL: https://www.kaggle.com/datasets/cicdataset/cicids2017

**What this phase does**:
1. Loads the raw CICIDS2017 dataset
2. Explores data structure and statistics
3. Detects and fixes data quality issues (NaN, infinites, duplicates)
4. Cleans column names
5. Maps attack types to categories

### Step 1.1: Load Dataset

**Explanation**: We start by loading one file from CICIDS2017 to explore its structure. The dataset contains network flow features captured from real network traffic.

In [None]:
# Load the pre-cleaned dataset from Kaggle
# If using raw CICIDS2017, use: /kaggle/input/cicids2017/Monday-WorkingHours.pcap_ISCX.csv
df = pd.read_csv('/kaggle/input/cicids2017-cleaned-and-preprocessed/cicids2017_cleaned.csv')

print(f"üìä Dataset Shape: {df.shape}")
print(f"üíæ Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"üìù Number of Columns: {len(df.columns)}")

### Step 1.2: Initial Exploration

**Explanation**: Understanding the data structure helps us identify potential issues and plan our preprocessing strategy.

In [None]:
# Display first few rows
print("First 5 rows:")
display(df.head())

# Show data types and non-null counts
print("\nDataset Info:")
df.info()

In [None]:
# Statistical summary
print("Statistical Summary:")
display(df.describe())

### Step 1.3: Analyze Attack Distribution

**Explanation**: Understanding the distribution of attack types helps us identify class imbalance issues that we'll need to address during training.

In [None]:
# Distribution of attack types
print("Attack Type Distribution:")
print(df['Attack Type'].value_counts())
print("\nPercentages:")
print(df['Attack Type'].value_counts(normalize=True) * 100)

In [None]:
# Visualize distribution
plt.figure(figsize=(14, 6))
df['Attack Type'].value_counts().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Distribution of Attack Types - CICIDS2017', fontsize=16, fontweight='bold')
plt.xlabel('Attack Type', fontsize=12)
plt.ylabel('Number of Instances', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### Step 1.4: Data Quality Checks

**Explanation**: We check for common data quality issues:
- **NaN values**: Missing data that needs imputation
- **Infinite values**: Result of division by zero or overflow
- **Duplicates**: Redundant rows that can bias the model

In [None]:
# Check for missing values
print("="*50)
print("1. MISSING VALUES (NaN)")
print("="*50)
nan_counts = df.isnull().sum()
if nan_counts.sum() > 0:
    print(nan_counts[nan_counts > 0])
else:
    print("‚úÖ No NaN values detected")
print(f"\nTotal NaN: {nan_counts.sum()}")

In [None]:
# Check for infinite values
print("="*50)
print("2. INFINITE VALUES")
print("="*50)

numeric_cols = df.select_dtypes(include=[np.number]).columns
inf_counts = {}

for col in numeric_cols:
    inf_count = np.isinf(df[col]).sum()
    if inf_count > 0:
        inf_counts[col] = inf_count

if inf_counts:
    for col, count in sorted(inf_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"{col}: {count:,}")
else:
    print("‚úÖ No infinite values detected")

print(f"\nTotal columns with infinites: {len(inf_counts)}")

In [None]:
# Check for duplicates
print("="*50)
print("3. DUPLICATE ROWS")
print("="*50)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates:,}")
print(f"Percentage: {duplicates / len(df) * 100:.2f}%")

### Step 1.5: Data Cleaning Function

**Explanation**: This function performs comprehensive cleaning:
1. **Strip column names**: Removes leading/trailing spaces
2. **Remove duplicates**: Eliminates redundant rows
3. **Handle infinites**: Replaces inf/-inf with NaN
4. **Impute NaN**: Fills missing values with median (robust to outliers)
5. **Fix negative values**: Corrects impossible negative values in certain features
6. **Drop irrelevant columns**: Removes non-predictive features (IPs, timestamps)

In [None]:
def clean_cicids2017(df):
    """
    Clean CICIDS2017 dataset
    
    Args:
        df: Raw DataFrame
    
    Returns:
        Cleaned DataFrame
    """
    print("üßπ Starting data cleaning...")
    print(f"Initial shape: {df.shape}")
    
    # 1. Clean column names (remove spaces)
    df.columns = df.columns.str.strip().str.replace(' ', '_')
    print("‚úÖ Column names cleaned")
    
    # 2. Remove duplicates
    initial_rows = len(df)
    df = df.drop_duplicates()
    duplicates_removed = initial_rows - len(df)
    print(f"‚úÖ Duplicates removed: {duplicates_removed:,}")
    
    # 3. Replace infinite values with NaN
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    print("‚úÖ Infinite values replaced with NaN")
    
    # 4. Fill NaN with median for numeric columns
    nan_before = df.isnull().sum().sum()
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    for col in numeric_cols:
        if df[col].isnull().any():
            df[col].fillna(df[col].median(), inplace=True)
    nan_after = df.isnull().sum().sum()
    print(f"‚úÖ NaN handled: {nan_before:,} ‚Üí {nan_after:,}")
    
    # 5. Fix negative values in features that should be positive
    positive_cols = ['Flow_Duration', 'Total_Fwd_Packets', 'Total_Backward_Packets']
    for col in positive_cols:
        if col in df.columns:
            negative_count = (df[col] < 0).sum()
            if negative_count > 0:
                df.loc[df[col] < 0, col] = 0
                print(f"‚úÖ {col}: {negative_count:,} negative values fixed")
    
    print(f"\nüéâ Cleaning complete!")
    print(f"Final shape: {df.shape}")
    
    return df

In [None]:
# Apply cleaning
df_clean = clean_cicids2017(df.copy())

### Step 1.6: Verify Cleaning Results

**Explanation**: We verify that all data quality issues have been resolved.

In [None]:
# Verification
print("Post-cleaning verification:")
print(f"Shape: {df_clean.shape}")
print(f"NaN count: {df_clean.isnull().sum().sum()}")
print(f"Infinite count: {np.isinf(df_clean.select_dtypes(include=[np.number])).sum().sum()}")
print(f"Duplicate count: {df_clean.duplicated().sum()}")

<a id='phase2'></a>
## üïµÔ∏è Phase 2: MITRE ATT&CK Analysis

**Purpose**: Map attacks to the MITRE ATT&CK framework for better understanding.

**What this phase does**:
- Maps CICIDS2017 attacks to MITRE ATT&CK tactics and techniques
- Analyzes network characteristics (ports, protocols) of each attack type
- Provides cybersecurity context for the detected attacks

### MITRE ATT&CK Mapping

**Explanation**: The MITRE ATT&CK framework is a globally-accessible knowledge base of adversary tactics and techniques. Mapping our attacks helps us understand:
- **What the attacker is trying to achieve** (Tactic)
- **How they're doing it** (Technique)

| Attack Type | MITRE Tactic | Technique ID | Description |
|-------------|--------------|--------------|-------------|
| **Brute Force (FTP/SSH)** | Credential Access | T1110 | Trying multiple passwords to gain access |
| **DoS / DDoS** | Impact | T1498 | Overwhelming network resources to cause unavailability |
| **Port Scanning** | Discovery | T1046 | Searching for open ports and services |
| **Web Attacks (SQL/XSS)** | Initial Access | T1190 | Exploiting web application vulnerabilities |
| **Botnet** | Command and Control | T1043 | Zombie machines controlled remotely |

### Port Analysis

**Explanation**: Different attacks target specific ports:
- **SSH Brute Force** ‚Üí Port 22
- **FTP Brute Force** ‚Üí Port 21
- **Web Attacks** ‚Üí Ports 80 (HTTP) and 443 (HTTPS)

This analysis helps validate that our dataset contains realistic attack patterns.

In [None]:
# Analyze port targeting by attack type
# Note: This requires 'Destination_Port' column in the dataset
if 'Destination_Port' in df_clean.columns:
    attacks = df_clean[df_clean['Attack_Type'] != 'Normal_Traffic']
    top_ports = attacks.groupby(['Attack_Type', 'Destination_Port']).size().reset_index(name='count')
    top_ports = top_ports.sort_values(['Attack_Type', 'count'], ascending=[True, False]).groupby('Attack_Type').head(3)
    
    print("Top 3 ports targeted by each attack type:")
    display(top_ports)
else:
    print("‚ö†Ô∏è Port information not available in this dataset version")

<a id='phase3'></a>
## üéØ Phase 3: Data Preparation

**Purpose**: Prepare data for machine learning.

**What this phase does**:
1. Creates binary labels (Normal vs Attack)
2. Creates multi-class labels (specific attack types)
3. Encodes categorical labels to numeric
4. Separates features from labels
5. Normalizes features using StandardScaler
6. Saves preprocessed data and encoders

### Step 3.1: Create Binary Labels

**Explanation**: Binary classification is simpler and faster. It answers: "Is this traffic malicious?"
- **0** = Normal Traffic
- **1** = Attack (any type)

In [None]:
# Create binary labels
df_clean['Binary_Label'] = (df_clean['Attack_Type'] != 'Normal_Traffic').astype(int)

print("Binary Label Distribution:")
print(df_clean['Binary_Label'].value_counts())
print("\nPercentages:")
print(df_clean['Binary_Label'].value_counts(normalize=True) * 100)

### Step 3.2: Create Multi-Class Labels

**Explanation**: Multi-class classification identifies the specific attack type. We merge DoS and DDoS since they're similar attacks (both aim to overwhelm resources).

In [None]:
# Merge DoS and DDoS into one category
df_clean['Attack_Merged'] = df_clean['Attack_Type'].replace({
    'DoS': 'DoS_DDoS',
    'DDoS': 'DoS_DDoS'
})

print("Attack Type Distribution (after merging):")
print(df_clean['Attack_Merged'].value_counts())

### Step 3.3: Encode Labels

**Explanation**: Machine learning models require numeric labels. LabelEncoder converts text labels to integers (0, 1, 2, ...).

In [None]:
# Encode multi-class labels
le = LabelEncoder()
df_clean['Multiclass_Label'] = le.fit_transform(df_clean['Attack_Merged'])

# Display mapping
print("Label Encoding Mapping:")
for i, label in enumerate(le.classes_):
    count = (df_clean['Multiclass_Label'] == i).sum()
    print(f"{i}: {label:20s} ({count:,} instances)")

### Step 3.4: Visualize Label Distributions

**Explanation**: Visualizing helps us understand class imbalance, which we'll address with SMOTE during training.

In [None]:
# Create side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Binary distribution
df_clean['Binary_Label'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Binary Classification', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Label (0=Normal, 1=Attack)')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Normal', 'Attack'], rotation=0)

# Multi-class distribution
df_clean['Attack_Merged'].value_counts().plot(kind='bar', ax=axes[1], color='skyblue')
axes[1].set_title('Multi-Class Classification', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Attack Type')
axes[1].set_ylabel('Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### Step 3.5: Prepare Features

**Explanation**: We separate features (X) from labels (y). Features are the network flow characteristics used for prediction.

In [None]:
# Separate features from labels
label_cols = ['Attack_Type', 'Binary_Label', 'Attack_Merged', 'Multiclass_Label']
feature_cols = [col for col in df_clean.columns if col not in label_cols]

X = df_clean[feature_cols]
y_binary = df_clean['Binary_Label']
y_multiclass = df_clean['Multiclass_Label']

print(f"Features shape: {X.shape}")
print(f"Binary labels shape: {y_binary.shape}")
print(f"Multi-class labels shape: {y_multiclass.shape}")

### Step 3.6: Train/Test Split

**Explanation**: We split data into training (80%) and testing (20%) sets. Stratification ensures both sets have the same class distribution.

In [None]:
# Stratified split to maintain class distribution
X_train, X_test, y_binary_train, y_binary_test = train_test_split(
    X, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

# Get corresponding multi-class labels
y_multi_train = y_multiclass.loc[X_train.index]
y_multi_test = y_multiclass.loc[X_test.index]

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining set distribution:")
print(y_binary_train.value_counts())

### Step 3.7: Feature Normalization

**Explanation**: StandardScaler normalizes features to have mean=0 and std=1. This is crucial for:
- **Faster convergence** in gradient-based algorithms
- **Equal feature importance** (prevents features with large values from dominating)
- **Better performance** in distance-based algorithms (KNN, SVM)

In [None]:
# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features normalized")
print(f"Mean: {X_train_scaled.mean():.6f}")
print(f"Std: {X_train_scaled.std():.6f}")

### Step 3.8: Save Preprocessed Data

**Explanation**: We save the scaler and label encoder to ensure consistent preprocessing during deployment.

In [None]:
# Save preprocessed data and encoders
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(le, 'label_encoder.pkl')

print("‚úÖ Scaler and encoder saved")
print("\n" + "="*70)
print("PREPROCESSING SUMMARY")
print("="*70)
print(f"Total samples: {len(df_clean):,}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {len(le.classes_)}")
print(f"\nClass distribution:")
for i, label in enumerate(le.classes_):
    count = (y_multiclass == i).sum()
    pct = (count / len(y_multiclass)) * 100
    print(f"  {i}: {label:20s} {count:8,} ({pct:5.2f}%)")
print("="*70)