This repository contains all the materials, scripts, and documentation for the Laboratory 3 conducted as part of the AI and Cybersecurity course.
The goal of this project is to apply shallow and deep learning methods for anomaly detection and data representation to address the Intrusion Detection Systems (IDS) task. The main objective is to evaluate whether unsupervised algorithms can automatically detect anomalous patterns and to analyze how their performance evolves as our knowledge of the data increase. The steps performed are:
- Dataset characterization: Examine the dataset to understand the number of categorical and numerical features. Check how the attack labels and binary label is distributed.
- Shallow anomaly detection: Use One-Class SVM in a supervised and unsupervised setting.
- Deep anomaly detection and representation: Use Autoencoder for anomaly detection and compare the ability to create a meaningful data representation comparing it with PCA.
- Unsupervised detection and interpretation: Use clustering results and visualization techniques to tackle a situation where anomaly detection does not have the label to learn the patterns.
Laboratory3/
├── lab/ # Data, notebooks and support material
├── report/ # LaTeX source files for the lab report
├── resources/ # Additional resources (e.g., links, PDFs, images)
└── README.md # This file
The detailed lab report, including all experimental results and analysis, can be found here. For a runnable summary of the experiments and step-by-step code, open
lab/notebooks/.
The main learning objectives for the lab were:
- Learn possible strategies to analyze datasets composed of normal and anomalous traffic.
- Understand the impact of different assumptions in the anomaly detection process, e.g., knowing the class label or not.
- Experiment and compare different anomaly detection methods.
- Use linear and non-linear data representation techniques to i) visualize cybersecurity anomalies, ii) reduce the number of available features and iii) evaluate changes in anomaly detection performance.
We used a standard Python data-science stack. The notebooks are compatible with recent Python 3.8+ environments. Recommended packages (install with pip):
# create and activate virtual environment (zsh)
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install jupyterlab notebook pandas numpy scikit-learn matplotlib seaborn torch torchvision tqdmNotes:
- If you plan to train large models and have an NVIDIA GPU, install the CUDA-enabled PyTorch build for faster training.
- For reproducibility, set the same random seeds for numpy, torch and sklearn; the notebooks include seed-setting cells.
- The samples can be labelled as Normal, DoS, Probe or R2L. In the training set, there are 28% of anomalies, while in the test one 63%.
- Each sample is characterized by 41 features:
- 3 categorical: preprocessed using one hot encoding to transform categorical features into numerical representations.
- 38 numerical: we used z-score standardization to normalize numerical features.
- From the heatmaps, we noticed that some characteristics are strongly correlated to specific attacks (e.g., DoS attacks).
The results are given on the test set, and the model where evaluated using the full set (anomalies + normal data). While One-class SVM models trained on mixed data (some anomalies + some anomalies) can perform reasonably well if ν is carefully tuned to the contamination rate, they generally exhibit lower stability and performance compared to the model trained on clean data.
| Model | % of anomalies in training | Accuracy | Macro F1 score |
|---|---|---|---|
| OC-SVM | 0% | 74% | 73% |
| OC-SVM | 10% | 46% | 45% |
| OC-SVM | 100% | 67% | 65% |
The architecture consists of a symmetric Encoder-Decoder structure:
- Encoder: linear layers expanding dimensions from the input size to 128, and then reducing them to 64, and to a bottleneck of 16. Each intermediate layer is followed by Batch Normalization, ReLU activation, and Dropout (0.2) to prevent overfitting.
- Decoder: It mirrors the encoder, expanding dimensions from 16 to 64, 128, and finally back to the original input dimension. Learning rate = 0.0005, 50 epochs.
| Model | Accuracy | Macro F1 score |
|---|---|---|
| Auto-Encoder with reconstruction error | 78% | 75% |
| Auto-Encoder + OC-SVM | 74% | 73% |
| PCA + OC-SVM | 74% | 74% |
- K-Means (4 clusters):
- Two clusters are predominantly normal, while the others are mainly composed by DoS and Probe samples.
- The clusters with lower silhouette contain all types of attacks.
- The best t-SNE representation is the one with perplexity 30; the most misinterpreted points are those close to an area populated by samples of another attack.
- DB-Scan:
- min_points = 10 and ϵ = 0.25: 100 clusters generated, 7031 points detected as noise (of which the 74% are normal samples)
- min_points= 1600, ϵ= 0.32, and metric=cosine: 3 clusters + noise (29% of points).
| Name | GitHub | ||
|---|---|---|---|
| Renato Mignone | |||
| Claudia Sanna | |||
| Chiara Iorio |
