Laboratory 3 for the AIC Course

This repository contains all the materials, scripts, and documentation for the Laboratory 3 conducted as part of the AI and Cybersecurity course.

Overview

The goal of this project is to apply shallow and deep learning methods for anomaly detection and data representation to address the Intrusion Detection Systems (IDS) task. The main objective is to evaluate whether unsupervised algorithms can automatically detect anomalous patterns and to analyze how their performance evolves as our knowledge of the data increase. The steps performed are:

Dataset characterization: Examine the dataset to understand the number of categorical and numerical features. Check how the attack labels and binary label is distributed.
Shallow anomaly detection: Use One-Class SVM in a supervised and unsupervised setting.
Deep anomaly detection and representation: Use Autoencoder for anomaly detection and compare the ability to create a meaningful data representation comparing it with PCA.
Unsupervised detection and interpretation: Use clustering results and visualization techniques to tackle a situation where anomaly detection does not have the label to learn the patterns.

Repository Structure

Laboratory3/
├── lab/            # Data, notebooks and support material
├── report/         # LaTeX source files for the lab report
├── resources/      # Additional resources (e.g., links, PDFs, images)
└── README.md       # This file

The detailed lab report, including all experimental results and analysis, can be found here. For a runnable summary of the experiments and step-by-step code, open lab/notebooks/.

Lab Objectives & Requirements

The main learning objectives for the lab were:

Learn possible strategies to analyze datasets composed of normal and anomalous traffic.
Understand the impact of different assumptions in the anomaly detection process, e.g., knowing the class label or not.
Experiment and compare different anomaly detection methods.
Use linear and non-linear data representation techniques to i) visualize cybersecurity anomalies, ii) reduce the number of available features and iii) evaluate changes in anomaly detection performance.

Requirements

We used a standard Python data-science stack. The notebooks are compatible with recent Python 3.8+ environments. Recommended packages (install with pip):

# create and activate virtual environment (zsh)
python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install jupyterlab notebook pandas numpy scikit-learn matplotlib seaborn torch torchvision tqdm

Notes:

If you plan to train large models and have an NVIDIA GPU, install the CUDA-enabled PyTorch build for faster training.
For reproducibility, set the same random seeds for numpy, torch and sklearn; the notebooks include seed-setting cells.

Summary of the results

Task 1: Dataset Characterization and Preprocessing

The samples can be labelled as Normal, DoS, Probe or R2L. In the training set, there are 28% of anomalies, while in the test one 63%.
Each sample is characterized by 41 features:
- 3 categorical: preprocessed using one hot encoding to transform categorical features into numerical representations.
- 38 numerical: we used z-score standardization to normalize numerical features.
From the heatmaps, we noticed that some characteristics are strongly correlated to specific attacks (e.g., DoS attacks).

Task 2: Shallow Anomaly Detection - Supervised vs Unsupervised

The results are given on the test set, and the model where evaluated using the full set (anomalies + normal data). While One-class SVM models trained on mixed data (some anomalies + some anomalies) can perform reasonably well if ν is carefully tuned to the contamination rate, they generally exhibit lower stability and performance compared to the model trained on clean data.

Model	% of anomalies in training	Accuracy	Macro F1 score
OC-SVM	0%	74%	73%
OC-SVM	10%	46%	45%
OC-SVM	100%	67%	65%

Task 3: Deep Anomaly Detection and Data Representation

The architecture consists of a symmetric Encoder-Decoder structure:

Encoder: linear layers expanding dimensions from the input size to 128, and then reducing them to 64, and to a bottleneck of 16. Each intermediate layer is followed by Batch Normalization, ReLU activation, and Dropout (0.2) to prevent overfitting.
Decoder: It mirrors the encoder, expanding dimensions from 16 to 64, 128, and finally back to the original input dimension. Learning rate = 0.0005, 50 epochs.

Model	Accuracy	Macro F1 score
Auto-Encoder with reconstruction error	78%	75%
Auto-Encoder + OC-SVM	74%	73%
PCA + OC-SVM	74%	74%

Task 4: Unsupervised Anomaly Detection and Interpretation

K-Means (4 clusters):
- Two clusters are predominantly normal, while the others are mainly composed by DoS and Probe samples.
- The clusters with lower silhouette contain all types of attacks.
- The best t-SNE representation is the one with perplexity 30; the most misinterpreted points are those close to an area populated by samples of another attack.
DB-Scan:
- min_points = 10 and ϵ = 0.25: 100 clusters generated, 7031 points detected as noise (of which the 74% are normal samples)
- min_points= 1600, ϵ= 0.32, and metric=cosine: 3 clusters + noise (29% of points).

Authors

Name	GitHub	LinkedIn	Email
Renato Mignone
Claudia Sanna
Chiara Iorio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laboratory 3 for the AIC Course

Overview

Repository Structure

Lab Objectives & Requirements

Requirements

Summary of the results

Task 1: Dataset Characterization and Preprocessing

Task 2: Shallow Anomaly Detection - Supervised vs Unsupervised

Task 3: Deep Anomaly Detection and Data Representation

Task 4: Unsupervised Anomaly Detection and Interpretation

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
lab		lab
report		report
resources		resources
.gitignore		.gitignore
README.md		README.md
notes.txt		notes.txt

Folders and files

Latest commit

History

Repository files navigation

Laboratory 3 for the AIC Course

Overview

Repository Structure

Lab Objectives & Requirements

Requirements

Summary of the results

Task 1: Dataset Characterization and Preprocessing

Task 2: Shallow Anomaly Detection - Supervised vs Unsupervised

Task 3: Deep Anomaly Detection and Data Representation

Task 4: Unsupervised Anomaly Detection and Interpretation

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages