This project implements a Network Intrusion Detection System (NIDS) using a Deep Learning Autoencoder architecture. It is designed to detect anomalies in network traffic using the UNSW-NB15 dataset. The system learns the pattern of "normal" network traffic and flags any traffic with high reconstruction error as an attack/anomaly.
Traditional signature-based IDS requires a database of known threats. This project uses an anomaly detection approach:
- Training: The Autoencoder is trained only on normal network traffic data. It learns to compress (encode) and reconstruct (decode) these normal patterns efficiently.
- Detection: When new data comes in, the model attempts to reconstruct it.
- Normal Traffic: Low reconstruction error (the model knows this pattern).
- Attack Traffic: High reconstruction error (the model has never seen this pattern before).
- Robust Data Preprocessing: Handles categorical data (One-Hot Encoding), scaling (MinMax/Standardization), and cleaning of the complex UNSW-NB15 dataset.
- Deep Autoencoder Architecture: Designed with TensorFlow/Keras, using symmetrical encoder-decoder layers to capture latent features of network flow.
- Dynamic Thresholding: Automatically calculates the optimal threshold for anomaly detection based on statistical analysis of reconstruction errors (e.g., Mean + 2*Std, 95th/99th Percentile, or optimized F1-score balance).
- Comprehensive Visualization:
- Loss Curves: To monitor training progress and check for overfitting.
- Reconstruction Error Histograms: To visually inspect the separation between normal and attack traffic distributions.
- Confusion Matrix: To evaluate the classification performance (TP, FP, TN, FN).
├── main.py # Main entry point: manages the full pipeline (load -> train -> evaluate).
├── preprocess_and_save.py # Standalone script for data preprocessing and saving to CSV.
├── requirements.txt # List of python dependencies.
├── src/
│ ├── data_loader.py # Functions for loading raw CSVs and saving processed data.
│ ├── model.py # Definition of the Autoencoder class/model architecture.
│ └── utils.py # Helper functions for plotting and metrics calculation.
├── outputs/ # Directory where generated graphs and models are saved.
├── data/ # Directory for processed intermediate data (excluded from git).
└── UNSW_NB15/ # Directory for raw dataset files (excluded from git).
-
Clone the repository:
git clone https://github.com/PCopath/Autoencoder-NIDS-Project.git cd Autoencoder-NIDS-Project -
Install dependencies: It is recommended to use a virtual environment (conda or venv).
pip install -r requirements.txt
-
Download the Dataset:
- Download the UNSW-NB15 dataset (CSV files).
- Place the following files inside the
UNSW_NB15/folder in the project root:UNSW_NB15_training-set.csvUNSW_NB15_testing-set.csv
- Note: The dataset is too large to be included in this repository.
Run the main script to load data, train the model, and evaluate results:
python main.pyThis script checks if processed data exists; if not, it performs preprocessing automatically.
If you want to prepare the data without training (useful for debugging or preparing large datasets):
python preprocess_and_save.pyAfter running the model, check the outputs/ directory for:
loss_curve.png: Shows the training and validation loss (Mean Squared Error) over epochs. A decreasing curve indicates the model is learning to reconstruct normal traffic.reconstruction_error_hist.png: A histogram showing the distribution of errors. You should ideally see two distinct peaks: one for Normal traffic (left, low error) and one for Attacks (right, high error). The vertical line represents the calculated threshold.confusion_matrix.png: Displays the accuracy of the classification on the test set.
- Unsupervised/Semi-supervised Learning: The labels (Attack/Normal) are stripped during training. The model is trained purely on
X_train(Normal data). Labels are only used at the end to evaluate how well the anomaly detection worked. - Threshold Selection: The system calculates multiple potential thresholds. The "Balanced" threshold attempts to maximize the F1-Score while keeping the False Positive Rate (FPR) low.
- Git Ignore: Large files (raw CSVs in
UNSW_NB15/and processed CSVs indata/) are ignored to keep the repository lightweight.
This project is open-source. Feel free to modify and use it for educational or research purposes.