MLAD is designed to detect anomalies in system logs across multiple systems by combining a Transformer with a Gaussian Mixture Model (GMM).
- Multi-System Anomaly Detection: Detects anomalies across multiple systems, overcoming limitations of traditional one-model-per-system methods.
- Hybrid Transformer-GMM Architecture: Integrates Transformers with GMMs, jointly learning semantic log representations while maintaining clear separation between normal and abnormal events.
- Alpha-Entmax Attention: Uses sparse attention mechanism to better identify important keywords in log sequences.
- "Identical Shortcut" Problem Solver: Mitigates the identical shortcut problem by transforming the vector space, effectively separating abnormal samples from normal ones.
-
Clone the repository:
git clone https://github.com/yourusername/MLAD.git cd MLAD -
Create a virtual environment:
conda create -n mlad python=3.8 conda activate mlad
-
Install the dependencies:
pip install -r requirements.txt
The implementation uses three public datasets:
- BGL: Blue Gene/L supercomputer logs from Lawrence Livermore National Laboratory.
- HDFS: Hadoop Distributed File System logs from Amazon EC2 nodes.
- Thunderbird: System service messages from Sandia National Labs' Thunderbird supercomputer.
To obtain these datasets:
- BGL and Thunderbird: USENIX CFDR Data
- HDFS: LogHub on GitHub
After downloading, place the log files in their respective directories:
data/
├── BGL/
│ └── BGL.log
├── HDFS/
│ └── HDFS.log
└── Thunderbird/
└── Thunderbird.log
MLAD/
├── data/ # Datasets
├── models/ # Model implementations
│ ├── alpha_entmax.py # Alpha-entmax implementation
│ ├── feed_forward.py # Feed-forward network with CeLU
│ ├── gmm.py # Gaussian Mixture Model
│ └── mlad.py # Complete MLAD model
├── utils/ # Utility functions
│ ├── data_loader.py # Data loading utilities
│ └── log_preprocessing.py # Log preprocessing functions
├── saved_models/ # Saved models directory
├── results/ # Evaluation results directory
├── main.py # Main script to run the pipeline
├── train.py # Training script
├── evaluate.py # Evaluation script
├── requirements.txt # Dependencies
└── README.md # This file
To run the complete pipeline (train, evaluate, visualize) on all datasets:
python main.py --visualizeTo check if datasets are available:
python main.py --download_onlyTo train on specific datasets:
python main.py --train_only --datasets BGL HDFSTo evaluate on specific datasets (requires pre-trained models):
python main.py --eval_only --datasets BGL HDFSTo run transfer learning experiments between BGL and Thunderbird:
python main.py --transfer_learning --datasets BGL ThunderbirdTo run an ablation study on the alpha parameter:
python main.py --alpha_ablation --datasets BGLCustomize model parameters:
python main.py --d_model 100 --n_heads 4 --n_layers 2 --alpha 1.5 --n_components 5Customize training parameters:
python main.py --batch_size 512 --lr 0.001 --epochs 30The following F1 scores for MLAD across different datasets:
| Dataset | Precision | Recall | F1 Score |
|---|---|---|---|
| BGL | 0.9492 | 0.8932 | 0.9184 |
| HDFS | 0.9296 | 0.8656 | 0.8946 |
| Thunderbird | 0.8824 | 0.9066 | 0.8962 |
Visualization examples will be saved in the results/ directory when running with the --visualize flag.
This project is licensed under the MIT License - see the LICENSE file for details.
- Sparse Sequence-to-Sequence Models (Peters et al., 2019) for alpha-entmax implementation