## Setup CoLog

CoLog can be set up in two ways: **Local Setup** or **Online Setup (Google Colab)**.

---

### Option 1: Local Setup

If you already have CoLog cloned on your local machine:

1. **Navigate to your CoLog directory** (e.g., `C:\Users\user\Documents\GitHub\colog`)
2. **No need to run `git clone`** — you're already working in the repository
3. **Install dependencies** using your local Python environment:
   ```bash
   python -m pip install -r requirements.txt
   ```
4. **Ensure Python environment is configured** in VS Code (select your Python interpreter)
5. **Start working** with the existing dataset and groundtruth folders

**Advantages:**
- Faster access to files and datasets
- Better control over environment and dependencies
- Can use local GPU/CPU resources
- All changes are immediately saved to your local repository

---

### Option 2: Online Setup (Google Colab)

If you want to run CoLog in Google Colab or need a fresh clone:

1. **Clone the repository** from GitHub:
   ```bash
   !git clone https://github.com/NasirzadehMoh/CoLog.git
   ```
2. **Change directory** to the cloned folder:
   ```bash
   %cd CoLog
   ```
3. **Install dependencies** using Colab's Python environment:
   ```bash
   !pip install -r requirements.txt
   ```
4. **Upload or unrar datasets** if needed (or download them programmatically)
5. **Run preprocessing and training** using the provided scripts

**Advantages:**
- Free GPU access (Tesla T4/K80)
- No local setup required
- Easy sharing and collaboration
- Isolated environment for experiments

---

### For This Notebook

**We use the Online Setup (Google Colab Pro)** to take advantage of GPU resources and cloud-based execution. The cells below will clone the repository, install dependencies, and setup the environment in Colab.

In [None]:
!git clone https://github.com/NasirzadehMoh/CoLog.git

### Install Requirements

This step installs all Python dependencies required to run CoLog, including:

**Core Dependencies:**
- **PyTorch** - Deep learning framework for model training
- **Transformers** - Hugging Face library for transformer models
- **SentenceTransformers** - For generating log message embeddings
- **scikit-learn** - Machine learning utilities and metrics
- **imbalanced-learn** - Tools for handling class imbalance (Tomek Links, SMOTE, etc.)
- **NumPy & Pandas** - Data manipulation and numerical operations
- **tqdm** - Progress bars for long-running operations

**Additional Libraries:**
- **TensorBoard** - Visualization for training metrics and hyperparameter tuning
- **spaCy** - NLP library for NER-based log parsing
- Various parser dependencies (Drain, etc.)

---

**Installation Options:**

#### For Google Colab (Online Setup):
```bash
!pip install -r requirements.txt
```

#### For Local Environment:
```bash
python -m pip install -r requirements.txt
```

#### For Virtual Environment (Recommended for Local):
```bash
# Create virtual environment first
python -m venv venv
# Activate it (Windows)
.\venv\Scripts\activate
# Or on Linux/Mac
source venv/bin/activate
# Then install
pip install -r requirements.txt
```

---

**Expected Installation Time:**
- ~5-10 minutes depending on internet speed and whether PyTorch needs to be downloaded
- Colab environments are faster due to pre-installed packages

**GPU Support:**
- If you have CUDA-enabled GPU, PyTorch will be installed with CUDA support
- Use `--device cuda` in subsequent commands to leverage GPU acceleration

In [None]:
!pip install -r requirements.txt

### Change to CoLog Directory

After cloning the repository, we need to navigate into the CoLog directory to access all scripts and datasets.

---

**Why this is necessary:**
- All relative file paths in the scripts assume you're working from the CoLog root directory
- The `groundtruth/`, `datasets/`, `neuralnetwork/`, and other folders are located here
- Running scripts without being in the correct directory will cause "file not found" errors

---

**For Google Colab (Online Setup):**

The default working directory in Colab is `/content/`. After cloning, CoLog will be at `/content/CoLog/`.

```python
%cd /content/CoLog
```

The `%cd` magic command changes the current working directory for all subsequent cells.

---

**For Local Setup:**

If you're running locally (not in Colab), you would navigate to your local CoLog path:

```python
# Example for Windows
%cd C:/Users/user/Documents/GitHub/CoLog

# Example for Linux/Mac
%cd ~/Documents/CoLog
```

Or you can skip this if you opened the notebook from within the CoLog directory already.

---

**Verification:**

After running the next cell, you should see output like:
```
/content/CoLog
```

You can verify you're in the right place by listing files:
```python
!ls -la  # Linux/Mac/Colab
!dir     # Windows
```

You should see folders like `groundtruth/`, `datasets/`, `neuralnetwork/`, `transformer/`, etc.

In [None]:
%cd /content/CoLog

### Extract Ground Truth (All-in-One Pipeline)

This section demonstrates how to run `groundtruth/main.py` - an integrated pipeline that combines **preprocessing**, **ground truth extraction**, and **class imbalance resampling** in a single command.

---

**What it does (3 steps in one):**

1. **Preprocessing (Automatic):**
   - Parses raw logs using either Drain (rule-based) or NER-based parser
   - Extracts unique log messages across all parsed logs
   - Generates embeddings using SentenceTransformers and saves them as pickle files

2. **Ground Truth Extraction:**
   - Reads parsed logs (CSV from Drain or pickle files from NER parser)
   - Constructs sequences for each log message using a configurable sliding window
   - Assigns labels (normal/anomaly) based on dataset-specific labeling strategies
   - Splits data into train/validation/test sets and saves ground truth files

3. **Class Imbalance Resampling (Optional):**
   - Applies resampling techniques (e.g., Tomek Links, RandomUnderSampler) when `--resample` flag is used
   - Balances the training data to improve model performance
   - Creates resampled datasets in subdirectories

---

**Key Concepts:**

- **sequence_type**: 
  - `background`: Uses current message + previous messages as context `[prev_msg, current_msg]`
  - `context`: Uses previous + current + following messages `[prev_msg, current_msg, next_msg]`
  
- **window_size**: Number of messages on each side (left/right) of the current message
  - Example: `window_size=1` with `context` type → `[msg_{i-1}, msg_i, msg_{i+1}]`

- **Labeling Strategies** (automatic based on dataset):
  - Type 1 (hadoop, zookeeper): Level column-based labeling
  - Type 2 (spark, windows): Wordlist heuristic
  - Type 3 (bgl): Label '-' indicates normal
  - Type 4 (casper-rw, dfrws*, honeynet*): NER parser stored pickle

---

**Available Command-Line Arguments:**

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--dataset` | str | `hadoop` | Dataset name to process (choices: hadoop, spark, bgl, windows, zookeeper, casper-rw, etc.) |
| `--dataset-dir` | str | `datasets/` | Path to the root datasets directory |
| `--sequence-type` | str | `context` | Sequence type: `background` or `context` |
| `--window-size` | int | `1` | Number of messages on each side of current message |
| `--train-ratio` | float | `0.6` | Fraction of non-test data for training |
| `--valid-ratio` | float | `0.2` | Fraction of non-test data for validation |
| `--model` | str | `all-MiniLM-L6-v2` | SentenceTransformer model for embeddings |
| `--batch-size` | int | `64` | Batch size for embeddings computation |
| `--device` | str | `auto` | Device for processing: `auto`, `cpu`, or `cuda` |
| `--force` | flag | False | Force re-processing even if files exist |
| `--verbose` | flag | False | Enable detailed debug-level logging |
| `--groundbreaking` | flag | False | Enable multi-label/sequence-level labeling |
| `--resample` | flag | False | Automatically run resampling after extraction |
| `--resample-method` | str | None | Resampling method (e.g., `TomekLinks`, `RandomUnderSampler`, `SMOTE`) |
| `--random-seed` | int | `100` | Random seed for reproducibility |
| `--dry-run` | flag | False | Preview what would be done without actually processing files |

---

**Output Files:**

Saved in `datasets/<dataset>/groundtruth/<sequence_type>_<window_size>/`:

**Basic Ground Truth Files:**
- `messages.p`: Message ID → tokenized message (NumPy array)
- `sequences.p`: Message ID → list of surrounding messages ('UNK' for padding)
- `labels.p`: Message ID → label integer (0=normal, 1=anomaly)
- `keys.p`: Ordered list of all message IDs
- `train_set.p`, `valid_set.p`, `test_set.p`: Split datasets for model training

**Resampled Files** (if `--resample` used):
- Saved in `resampled_groundtruth/<method>/` subdirectory
- Same structure as above but with balanced class distributions

---

**Usage Examples:**

```bash
# Basic usage - preprocessing + ground truth extraction only
!python groundtruth/main.py --dataset hadoop --sequence-type context --window-size 1

# With all three steps (preprocessing + extraction + resampling)
!python groundtruth/main.py --dataset spark \
    --sequence-type context --window-size 1 \
    --model all-MiniLM-L6-v2 --batch-size 128 --device auto \
    --resample --resample-method TomekLinks \
    --verbose

# Force regeneration with GPU acceleration
!python groundtruth/main.py --dataset bgl \
    --sequence-type background --window-size 2 \
    --batch-size 256 --device cuda \
    --force --verbose

# Custom train/valid split ratios with resampling
!python groundtruth/main.py --dataset windows \
    --train-ratio 0.7 --valid-ratio 0.15 \
    --resample --resample-method RandomUnderSampler

# Dry run to preview without processing
!python groundtruth/main.py --dataset zookeeper \
    --sequence-type context --window-size 1 \
    --resample --resample-method SMOTE \
    --dry-run
```

---

**Advantages of Unified Pipeline:**

✓ **Simplified workflow** - One command instead of three separate scripts  
✓ **Automatic preprocessing** - No need to run preprocessing separately  
✓ **Consistent parameters** - Same settings used across all steps  
✓ **Efficient execution** - Embeddings computed once and reused  
✓ **Reproducible** - Single random seed for all randomized operations  

---

**Note:** The script automatically detects which parser (Drain vs NER) to use based on the dataset type, so you don't need to worry about parser selection.

In [None]:
# Example: Complete pipeline with preprocessing, extraction, and resampling
!python groundtruth/main.py \
    --dataset hadoop \
    --sequence-type context \
    --window-size 1 \
    --model all-MiniLM-L6-v2 \
    --batch-size 128 \
    --device auto \
    --resample \
    --resample-method TomekLinks \
    --verbose

# Other datasets (uncomment to run):
# !python groundtruth/main.py --dataset spark --sequence-type context --window-size 1 --resample --resample-method TomekLinks --verbose
# !python groundtruth/main.py --dataset zookeeper --sequence-type context --window-size 1 --resample --resample-method TomekLinks --verbose
# !python groundtruth/main.py --dataset bgl --sequence-type context --window-size 1 --resample --resample-method TomekLinks --verbose
# !python groundtruth/main.py --dataset windows --sequence-type context --window-size 1 --resample --resample-method RandomUnderSampler --verbose
# !python groundtruth/main.py --dataset casper-rw --sequence-type context --window-size 1 --resample --resample-method TomekLinks --verbose

### Train CoLog

After extracting ground truth data, we can now train the CoLog model using the prepared dataset. The training script handles the entire neural network training process.

---

**Why this is necessary:**
- Trains the collaborative transformer model on your processed log sequences
- Learns patterns from normal and anomalous log sequences
- Creates model checkpoints that can be used for anomaly detection
- Automatically handles train/validation splits and optimization

---

**Available Command-Line Arguments:**

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| **Dataset Arguments** |
| `--dataset` | str | `hadoop` | Dataset to train on (choices: hadoop, spark, bgl, windows, zookeeper, casper-rw, etc.) |
| `--sequence-type` | str | auto-detect | Sequence construction type: `background` or `context` |
| `--window-size` | int | auto-detect | Number of messages on each side of current message in sequences |
| **Training Configuration** |
| `--name` | str | `CoLog` | Name identifier for this training run (used for organizing outputs) |
| `--batch-size` | int | `32` | Number of samples processed before model update |
| `--max-epoch` | int | `100` | Maximum number of training epochs |
| `--warmup-epoch` | float | `5` | Warmup epochs with gradually increasing learning rate |
| `--early-stop` | int | `10` | Early stopping patience (epochs without improvement) |
| `--evaluation-start` | int | `0` | Epoch to start evaluation after |
| `--random-seed` | int | random | Random seed for reproducibility |
| `--device` | str | `auto` | Device for training: `cpu`, `cuda`, or `auto` |
| `--output` | str | `runs/` | Path to output results directory |
| `--train-ratio` | float | auto-detect | Fraction of dataset for training |
| **Optimizer Arguments** |
| `--optimizer` | str | `adam` | Optimizer function name |
| `--optimizer-params` | str | `{"weight_decay": 0}` | Optimizer parameters as string dict |
| `--learning-rate` | float | `0.001` | Initial learning rate |
| `--lr-decay` | float | `0.1` | Learning rate decay coefficient |
| `--decay-times` | int | `3` | Number of learning rate reductions |
| `--grad-clip` | float | `5.0` | Gradient clipping max norm; -1 disables clipping |
| **Model Architecture** |
| `--embedding-size` | int | `300` | Dimension of word embeddings (input to message adapter LSTM) |
| `--sequences-fsize` | int | `384` | Dimension of sequence features (input to sequence adapter) |
| `--layers` | int | `2` | Number of collaborative transformer layers |
| `--heads` | int | `8` | Number of attention heads in collaborative transformer |
| `--hidden-size` | int | `256` | Size of collaborative transformer hidden layers |
| `--dropout-rate` | float | `0.1` | Dropout probability for regularization |
| `--projection-size` | int | `128` | Size of projection layer for modality fusion |
| **Data Processing** |
| `--len-messages` | int | `50` | Maximum length of log messages (tokens) |
| `--len-sequences` | int | `50` | Maximum length of log sequences |
| **Class Imbalance** |
| `--resample-method` | str | `Original` | Method for handling class imbalance (TomekLinks, RandomUnderSampler, SMOTE, etc.) |
| `--groundbreaking` | flag | False | Enable groundbreaking mode with 4 classes instead of 2 |
| **Hyperparameter Tuning** |
| `--tuning` | flag | False | Enable hyperparameter tuning mode using Ray Tune |
| `--train-bmodel` | flag | False | Train with best configuration after tuning (only with `--tuning`) |
| `--tuner-samples` | int | `10` | Number of tuning configurations to sample |

---

**How Training Works:**

1. **Loads Prepared Data**: Reads the ground truth data from `datasets/<dataset>/groundtruth/<sequence_type>_<window_size>/`
2. **Initializes Model**: Creates the collaborative transformer architecture with specified hyperparameters
3. **Training Loop**: Runs for specified epochs, optimizing on the training set with backpropagation
4. **Validation**: Periodically evaluates on validation set to prevent overfitting
5. **Checkpointing**: Saves best models to `runs/<name>/models/`
6. **Logging**: Records training metrics to `runs/<name>/logs/` for TensorBoard visualization

---

**Output Location:**

After training, you'll find:
- Model checkpoints: `runs/<dataset_name>/models/best<seed>.pkl`
- Training logs: `runs/<dataset_name>/logs/` (TensorBoard events)
- Configuration file: `runs/<dataset_name>/<name>_config.json`

---

**Training Tips:**

- Start with `--max-epoch 20` for initial experiments
- Increase epochs if validation loss is still decreasing
- Monitor training progress in the output logs
- GPU/CUDA will be used automatically if available (much faster than CPU)
- Use `--early-stop` to prevent overfitting and save time
- For hyperparameter tuning, use `--tuning --tuner-samples 20` to search optimal configuration

---

**Usage Examples:**

```bash
# Basic training with default parameters
!python train.py --name hadoop --dataset hadoop --max-epoch 20

# Training with custom hyperparameters
!python train.py --name spark_custom --dataset spark \
    --batch-size 64 --learning-rate 0.0005 --max-epoch 50 \
    --layers 3 --heads 12 --hidden-size 512

# Training with resampled data for imbalanced datasets
!python train.py --name windows_balanced --dataset windows \
    --resample-method RandomUnderSampler --max-epoch 30

# Hyperparameter tuning mode
!python train.py --name bgl_tuned --dataset bgl \
    --tuning --tuner-samples 20 --train-bmodel

# Training with specific device and random seed
!python train.py --name zookeeper_reproducible --dataset zookeeper \
    --device cuda --random-seed 42 --max-epoch 25
```

---

**Multiple Datasets:**

The cell includes commented examples for all supported datasets. Simply uncomment the line for the dataset you want to train on. You can train multiple datasets sequentially or in separate runs.

In [None]:
!python train.py --name hadoop --dataset hadoop --max-epoch 20
# !python train.py --name spark --dataset spark --max-epoch 20
# !python train.py --name zookeeper --dataset zookeeper --max-epoch 20
# !python train.py --name bgl --dataset bgl --max-epoch 20
# !python train.py --name windows --dataset windows --max-epoch 20
# !python train.py --name casper-rw --dataset casper-rw --max-epoch 20
# !python train.py --name dfrws-2009-jhuisi --dataset dfrws-2009-jhuisi --max-epoch 20
# !python train.py --name dfrws-2009-nssal --dataset dfrws-2009-nssal --max-epoch 20
# !python train.py --name honeynet-challenge7 --dataset honeynet-challenge7 --max-epoch 20
# !python train.py --name honeynet-challenge5 --dataset honeynet-challenge5 --max-epoch 20

### Evaluate CoLog

After training the CoLog model, we evaluate its performance on the test dataset to measure anomaly detection accuracy and generate comprehensive metrics reports.

---

**Why this is necessary:**
- Assesses the trained model's performance on unseen test data
- Generates detailed classification metrics (precision, recall, F1-score)
- Creates visualization plots (confusion matrix, ROC curve, PR curve)
- Supports ensemble evaluation using multiple checkpoints for improved accuracy
- Enables cross-dataset generalizability testing

---

**Available Command-Line Arguments:**

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| **Evaluation Configuration** |
| `--checkpoints-path` | str | auto-generated | Path to checkpoints directory containing trained model files (best*.pkl) |
| `--eval-sets` | list | `['test_set']` | List of evaluation sets to use: `valid_set`, `test_set`, or both |
| `--ensemble` | flag | False | Enable ensemble predictions by averaging outputs from multiple checkpoints |
| `--num-ckpts` | int | `1` | Number of checkpoints to use for evaluation/ensembling (newest first) |
| **Model Configuration** |
| `--name` | str | `CoLog` | Name identifier for the model (used for generalizability testing) |
| `--dataset` | str | `hadoop` | Dataset name for evaluation (choices: hadoop, spark, bgl, windows, etc.) |
| **Generalizability Testing** |
| `--eval-generalizability` | flag | False | Enable cross-dataset generalization evaluation on unseen datasets |
| **Reporting Options** |
| `--plot-metrics` | flag | False | Generate full evaluation report with plots (confusion matrix, ROC, PR curves) |

---

**How Evaluation Works:**

1. **Load Model Checkpoint(s)**: Loads best trained model(s) from `runs/<dataset>/models/`
2. **Load Test Data**: Reads test set from ground truth directory
3. **Generate Predictions**: Runs model inference on test sequences
4. **Calculate Metrics**: Computes accuracy, precision, recall, F1-score, AUC, etc.
5. **Visualization** (if `--plot-metrics`): Creates plots and saves as SVG/EPS files
6. **Export Reports**: Saves classification reports to CSV files

---

**Evaluation Modes:**

**1. Standard Evaluation (Single Checkpoint):**
```bash
!python test.py --checkpoints-path runs/hadoop/models/
```
- Uses the single best checkpoint
- Fast evaluation
- Good for quick performance checks

**2. Ensemble Evaluation (Multiple Checkpoints):**
```bash
!python test.py --checkpoints-path runs/hadoop/models/ --ensemble --num-ckpts 5
```
- Averages predictions from top-5 checkpoints
- Improves robustness and accuracy
- Reduces variance from single-model predictions

**3. Generalizability Testing (Cross-Dataset):**
```bash
# Train on hadoop, test on spark
!python test.py --checkpoints-path runs/hadoop/models/ \
    --eval-generalizability --name spark --dataset spark
```
- Evaluates how well a model trained on one dataset performs on another
- Tests transfer learning capabilities
- Assesses model generalization

**4. Full Report with Visualizations:**
```bash
!python test.py --checkpoints-path runs/hadoop/models/ --plot-metrics
```
- Generates all metrics plots
- Creates confusion matrices (regular and normalized)
- Produces ROC and Precision-Recall curves
- Exports detailed CSV reports

---

**Output Files:**

When using `--plot-metrics`, the following files are generated in the checkpoints directory:

**Visualization Files:**
- `confusion_matrix.svg` / `confusion_matrix.eps` - Visual confusion matrix
- `confusion_matrix_normalized.svg` / `confusion_matrix_normalized.eps` - Normalized confusion matrix
- `roc_curve.svg` / `roc_curve.eps` - Receiver Operating Characteristic curve
- `precision_recall_curve.svg` / `precision_recall_curve.eps` - Precision-Recall curve

**Report Files:**
- `df_report.csv` - Standard classification report (precision, recall, F1-score per class)
- `df_report_imb.csv` - Imbalanced classification metrics (additional metrics for imbalanced datasets)

**Console Output:**
- Accuracy, Precision, Recall, F1-Score
- AUC-ROC, AUC-PR
- Per-class performance metrics
- Confusion matrix values

---

**Key Metrics Explained:**

- **Accuracy**: Overall correctness of predictions
- **Precision**: How many predicted anomalies are actually anomalies (fewer false positives)
- **Recall**: How many actual anomalies are detected (fewer false negatives)
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)
- **AUC-ROC**: Area Under ROC Curve (measures separability at all thresholds)
- **AUC-PR**: Area Under Precision-Recall Curve (better for imbalanced datasets)

---

**Usage Examples:**

```bash
# Basic evaluation on test set
!python test.py --checkpoints-path runs/hadoop/models/

# Evaluation with full metrics visualization
!python test.py --checkpoints-path runs/hadoop/models/ --plot-metrics

# Ensemble evaluation with top 3 checkpoints
!python test.py --checkpoints-path runs/spark/models/ --ensemble --num-ckpts 3 --plot-metrics

# Evaluate on both validation and test sets
!python test.py --checkpoints-path runs/zookeeper/models/ \
    --eval-sets valid_set test_set --plot-metrics

# Cross-dataset generalizability (trained on BGL, tested on Hadoop)
!python test.py --checkpoints-path runs/bgl/models/ \
    --eval-generalizability --name hadoop --dataset hadoop \
    --plot-metrics

# Auto-generated checkpoints path from dataset name
!python test.py --dataset hadoop --plot-metrics
```

---

**Evaluation Tips:**

- Always use `--plot-metrics` for comprehensive analysis and publication-ready figures
- Use `--ensemble` with `--num-ckpts 3-5` for more robust predictions
- For imbalanced datasets, focus on F1-score and AUC-PR rather than just accuracy
- Test generalizability by evaluating on different datasets to assess model robustness
- Compare performance on validation vs test sets to detect overfitting

---

**Multiple Datasets:**

The cell includes commented examples for all supported datasets. Simply uncomment the line for the dataset you want to evaluate.

In [None]:
!python test.py --checkpoints-path runs/hadoop/models/ --plot-metrics
# !python test.py --checkpoints-path runs/spark/models/ --plot-metrics
# !python test.py --checkpoints-path runs/zookeeper/models/ --plot-metrics
# !python test.py --checkpoints-path runs/bgl/models/ --plot-metrics
# !python test.py --checkpoints-path runs/windows/models/ --plot-metrics
# !python test.py --checkpoints-path runs/casper-rw/models/ --plot-metrics
# !python test.py --checkpoints-path runs/dfrws-2009-jhuisi/models/ --plot-metrics
# !python test.py --checkpoints-path runs/dfrws-2009-nssal/models/ --plot-metrics
# !python test.py --checkpoints-path runs/honeynet-challenge7/models/ --plot-metrics
# !python test.py --checkpoints-path runs/honeynet-challenge5/models/ --plot-metrics

### Run TensorBoard for Hyperparameter Tuning Results

The following cells allow you to visualize the hyperparameter tuning results using TensorBoard:

1. **Load TensorBoard Extension**: The first cell loads the TensorBoard extension into the notebook
2. **Launch TensorBoard**: The second cell starts TensorBoard and points it to the tuning results directory

**Instructions**:
- Replace `tuned_folder_path` with the actual path to your hyperparameter tuning logs
- TensorBoard will display metrics, parameter comparisons, and training progress for all tuning trials
- You can compare different hyperparameter configurations to find the optimal settings

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir tuned_folder_path