# **Anomaly Detection in Network Traffic Using the NSL-KDD Dataset**
## **Final Project Write-Up**
### **Author: Mehran Tajbakhsh**
---

## **1. Introduction and Motivation**

Modern digital infrastructures are increasingly dependent on interconnected systems, distributed architectures, and cloud environments that generate massive volumes of network traffic. With the rapid escalation of cyber threats, early detection of malicious activity has become a foundational requirement for cybersecurity operations. Intrusion Detection Systems (IDS) are designed to identify anomalous behavior within these traffic streams, allowing defenders to detect attacks before they escalate into large-scale incidents.

However, building an effective IDS is challenging due to several factors: the high dimensionality of network features, the imbalance between normal and attack instances, and the evolving nature of adversarial behavior. Traditional rule-based IDS systems often struggle to generalize beyond known attack signatures. In contrast, machine learning methods—especially anomaly detection—offer a promising solution by identifying deviations from learned patterns rather than relying solely on signatures.

The goal of this project is to analyze, model, and evaluate multiple anomaly detection approaches using the **NSL-KDD dataset**, an enhanced and more balanced version of the original KDDCup1999 dataset. The project systematically progresses from statistical anomaly detectors to machine learning–based models, culminating in explainability analysis using SHAP.

This write-up provides a cohesive overview of the methodology, experiments, and insights gained throughout Weeks 01–06 of the project. It includes code excerpts, figures (referenced but not embedded), results, and conclusions relevant to developing interpretable and effective IDS systems.


---
## **2. Methods, Analysis, and Results**

### **2.1 Data Preparation (Week 01)**
The NSL-KDD dataset includes 41 features per traffic record, consisting of numerical attributes, categorical protocol attributes, and derived behavioral features. Preprocessing was required to ensure model compatibility:

- Handling missing or corrupted rows
- Label binarization: normal = 0, attack = 1
- One-hot encoding of categorical values
- Standardization of continuous features
- Splitting into train/test sets

Artifacts such as `X_train.npy`, `X_test.npy`, encoders, and scalers were saved for reproducibility.


In [4]:
import warnings
warnings.filterwarnings("ignore")

In [5]:
import os, sys
sys.path.append(os.path.abspath(".."))  # from notebooks/ to project root

from src.utils import set_global_seed, Paths

set_global_seed()
print("Import OK.", Paths)


Import OK. <class 'src.utils.Paths'>


In [6]:
# Load processed data
import numpy as np
X_train = np.load('./data/processed/X_train.npy')
X_test = np.load('./data/processed/X_test.npy')
y_train = np.load('./data/processed/y_train.npy')
y_test = np.load('./data/processed/y_test.npy')
X_train.shape, X_test.shape

((395216, 115), (98805, 115))

---
### **2.2 Exploratory Data Analysis (Week 02)**

The EDA phase revealed key statistical insights:

- Several features exhibit heavy skew, especially duration-based metrics.
- Attack classes tend to cluster in PCA space, indicating separability.
- t-SNE and UMAP visualizations with stratified sampling showed clear separation between normal and malicious traffic.

These observations motivated the use of both supervised and unsupervised models.


In [None]:
# Example PCA analysis
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)

plt.scatter(X_pca[:,0], X_pca[:,1], c=y_train, s=2, cmap='coolwarm')
plt.title('PCA Projection of Training Data')
plt.xlabel('PC1'); plt.ylabel('PC2'); plt.show()

---
### **2.3 Statistical and Unsupervised Models (Week 03)**

Week 03 focused on classical anomaly detection techniques that do not rely on complex model architectures. These methods were applied to the processed NSL-KDD features and evaluated against the binary intrusion labels to understand how far simple detectors can go before introducing heavier machine learning models.

#### **Z-Score Detection**
Z-score thresholding was applied to selected continuous features after standardization. Records with absolute z-scores above a fixed cutoff (e.g., 3 standard deviations) were flagged as anomalies. This method is easy to implement and interpret, but it assumes approximately Gaussian behavior and struggles with heavy-tailed or multimodal feature distributions common in network traffic.

#### **Elliptic Envelope**
The Elliptic Envelope model fits a multivariate Gaussian distribution to the training data and identifies points lying outside a learned contour as outliers. In the NSL-KDD setting, this corresponds to learning an "elliptical" region of normal traffic. The method provided some separation between normal and attack records, but its performance was sensitive to covariance estimation and to the presence of non-Gaussian features.

#### **Mahalanobis Distance (Robust Covariance)**
Mahalanobis distance was computed using a robust covariance estimator to reduce the influence of extreme points. A chi-square quantile (e.g., α = 0.99) was used as the decision threshold: points with distance above the threshold were labeled as anomalies. This approach produced clearer separation than the plain Elliptic Envelope, particularly because the robust covariance down-weighted outliers when estimating the normal traffic manifold.

#### **Local Outlier Factor (LOF)**
LOF is a density-based method that compares the local density of each point to that of its neighbors. Points that are much less dense than their neighbors receive high LOF scores and are flagged as anomalies. Applied to NSL-KDD, LOF highlighted regions of low-density attack traffic embedded within predominantly normal flows. It captured some subtle attacks but also produced false positives in naturally sparse regions of feature space.

#### **DBSCAN**
DBSCAN clusters points based on density and labels points that do not belong to any cluster as noise. When applied to the training data, large high-density clusters corresponded roughly to normal traffic, while noise points often aligned with attack records. However, DBSCAN required careful tuning of `eps` and `min_samples`, and a single global density threshold could not perfectly accommodate all modes of network behavior.

Overall, these Week 03 models provided a baseline view of anomaly structure in NSL-KDD: they revealed that attacks often occupy lower-density regions or lie far from the robust normal manifold, but they also showed that purely unsupervised detectors have limited precision when used alone in an operational IDS.


---
### **2.4 Supervised Machine Learning Models (Week 04)**

Three main models were trained:

#### **Logistic Regression**
- Fast, linear baseline
- Performs poorly on nonlinear attack behavior

#### **Random Forest Classifier**
- High performance
- Provides interpretable feature importances

#### **SVM with RBF Kernel**
- Very strong performance but computationally expensive
- Requires scaling and careful hyperparameter tuning


---
### **2.5 Model Evaluation and Comparison (Week 05)**

Metrics such as F1-score, precision, recall, and runtime were compared across all models. Key observations:

- Machine learning models significantly outperform statistical detectors.
- Random Forest achieved the best balance of performance and interpretability.
- SVM achieved slightly higher precision but required considerably more training time.
- Unsupervised models like LOF provided useful signal but lacked consistency.


---
### **2.6 Explainability and SHAP Analysis (Week 06)**

SHAP values were used for Random Forest and SVM to analyze feature contributions:

- Features such as **src_bytes**, **service_count**, and **duration** consistently showed strong importance.
- Attack records exhibited distinct SHAP patterns, assisting analysts in understanding why alerts were triggered.
- Permutation importance provided global interpretability.

---
## **3. Discussion**

The experiments demonstrate that both supervised and unsupervised methods have unique strengths for intrusion detection. Statistical detectors, while simple, are too rigid for complex network traffic. Density-based methods like LOF and DBSCAN can detect structural anomalies without labels but cannot achieve the consistency required for operational environments.

Supervised models—especially Random Forest and SVM—provide superior accuracy. However, they rely heavily on labeled training data and may degrade over time due to concept drift. In security contexts where attack patterns evolve, frequent retraining is essential.

Explainability plays a vital role in IDS adoption. SHAP values reveal which network features most strongly influence decisions, enabling analysts to validate alerts, investigate false positives, and refine defenses. The alignment between SHAP patterns and intuitive network behavior supports model trustworthiness.


---
## **4. Conclusion**

This project developed a comprehensive anomaly detection pipeline for the NSL-KDD dataset. By combining statistical techniques, classical machine learning algorithms, and modern explainability tools, the analysis highlights both the opportunities and challenges in building effective IDS systems.

Key conclusions:

- Random Forest and SVM remain strong supervised baselines for attack detection.
- Density-based unsupervised models add complementary insight.
- Statistical methods alone are insufficient for modern threat landscapes.
- Explainability, especially SHAP, strengthens analyst trust and supports operational deployment.

The methodology developed here forms a foundation for future improvements, including deep learning, streaming detection, and continuous model retraining to address concept drift.