GAVEL: Towards Rule-Based Safety Through Activation Monitoring

GAVEL is a framework for rule-based activation safety in large language models. It decomposes model behavior into Cognitive Elements (CEs)—fine-grained, interpretable factors like "making a threat" or "payment processing"—and enforces safeguards through logical rules over these elements. This enables modular, auditable, and adaptive safety without retraining models or detectors.

📄 Paper: GAVEL: Towards Rule-Based Safety Through Activation Monitoring (Accepted to ICLR 2026)

🖥️ GAVEL Studio: A no-code interactive application for extending, experimenting with, and deploying GAVEL safety systems — define new CEs and rules, train detectors, and evaluate end-to-end.

✨ Key Features

Cognitive Elements: Decompose LLM behavior into interpretable, reusable factors
Rule-Based Detection: Compose CEs into logical rules for precise safety enforcement
No Retraining Required: Update safeguards by editing rules, not retraining models
Modular & Auditable: Transparent rule definitions support accountability and debugging

🚀 Quick Start

Installation

Clone the repository:

git clone https://github.com/Offensive-AI-Lab/gavel.git
cd GAVEL

Create environment and install dependencies:

Option A: Using uv (Recommended):
```
uv sync
```
Option B: Using venv:
```
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Install the package in editable mode:
```
pip install -e .
```
Configure environment: Copy config.template.json to config.json and update with your model path and dataset locations.

🏃 End-to-End Execution

The easiest way to run the full pipeline is using the provided shell script:

./run.sh

This script sequentially executes:

Training: Trains RNN probes on Cognitive Elements (scripts/train.py)
Unified Evaluation: Runs calibration and evaluation in a single pass (scripts/evaluate.py)

🛠️ Components & Usage

All scripts support a --verbose flag for detailed debugging.

1. Training (`scripts/train.py`)

Trains lightweight RNN probes on top of a frozen LLM to detect Cognitive Elements.

python scripts/train.py --config config.json

Input: Training dataset defined in config.json
Output: Trained model checkpoint at {base_dir}/model/trained_model_rnn.pth

2. Unified Evaluation (`scripts/evaluate.py`)

Runs the full evaluation pipeline in-memory:

Extracts activations from the evaluation dataset
Calibrates thresholds using Youden's J-statistic
Evaluates performance against the ruleset
Computes metrics (TPR, FPR, AUC)

python scripts/evaluate.py --config config.json

Flags:

--force-recalibrate: Force recalibration even if thresholds exist
--skip-calibration: Skip calibration and use existing thresholds
--calibration-only: Run only the calibration phase and exit

📓 Notebooks

We provide Jupyter notebooks in the notebooks/ directory for interactive exploration:

Notebook	Description
`train_classifier.ipynb`	End-to-end training of the RNN classifier on Cognitive Elements. Covers data preparation, representation extraction, and model training with W&B logging.
`evaluate_classifier.ipynb`	Complete evaluation pipeline: calibration, threshold optimization, and metrics computation. Demonstrates in-memory processing for fast iteration.

To run notebooks:

pip install jupyterlab
jupyter lab

⚙️ Configuration

GAVEL uses a JSON configuration file. See config.template.json for all options.

Key parameters:

Parameter	Description
`model.name_or_path`	Path to base LLM (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
`model.selected_layers_range`	Transformer layers to extract from (e.g., `[13, 27]`)
`paths.base_dir`	Base directory for model outputs
`paths.train_dataset`	Path to training data
`paths.eval_dataset`	Path to evaluation data

💻 Hardware Requirements

Component	Minimum	Recommended
GPU VRAM	16 GB	24+ GB
RAM	32 GB	64 GB
Storage	5 GB	10+ GB

Tested on: NVIDIA RTX 6000 Ada (48 GB)

python scripts/evaluate.py --verbose

📂 Project Structure

├── gavel/                  # Core package
│   ├── config.py           # Configuration management
│   ├── models/             # RNN probe definitions
│   ├── training/           # Training loops and dataloaders
│   ├── preprocessing/      # Dialogue extraction utilities
│   ├── evaluation/         # Metrics and calibration logic
│   └── utils/              # Logging and helper functions
├── scripts/                # Execution scripts
│   ├── train.py
│   ├── evaluate_unified.py
│   ├── calibrate.py
│   └── evaluate.py
├── notebooks/              # Interactive tutorials
│   ├── train_classifier.ipynb
│   └── evaluate_classifier.ipynb
├── assets/                 # Images and assets
├── rulesets/               # Safety rules definitions
├── config.json             # Global configuration
├── config.template.json    # Configuration template
├── run.sh                  # Master execution script
└── README.md

📜 Citation

If you use GAVEL in your research, please cite our paper:

@inproceedings{rozenfeld2026gavel,
  title={GAVEL: Towards Rule-Based Safety Through Activation Monitoring},
  author={Rozenfeld, Shir and Pankajakshan, Rahul and Zloczower, Itay and Lenga, Eyal and Gressel, Gilad and Mirsky, Yisroel},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

📄 License

This project is licensed under the Apache 2.0 license - see the LICENSE file for details.

Acknowledgment

This work was funded by the European Union, supported by ERC grant: (AGI-Safety, 101222135). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. This work was also supported by the Israeli Ministry of Innovation Science and Technology (grant number 1001948211)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
assets		assets
data		data
gavel		gavel
models/mistralai_Mistral-7B-Instruct-v0.2		models/mistralai_Mistral-7B-Instruct-v0.2
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
config.json		config.json
config.template.json		config.template.json
labels.json		labels.json
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.sh		run.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

✨ Key Features

🚀 Quick Start

Installation

🏃 End-to-End Execution

🛠️ Components & Usage

1. Training (`scripts/train.py`)

2. Unified Evaluation (`scripts/evaluate.py`)

📓 Notebooks

⚙️ Configuration

💻 Hardware Requirements

📂 Project Structure

📜 Citation

📄 License

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GAVEL: Towards Rule-Based Safety Through Activation Monitoring

✨ Key Features

🚀 Quick Start

Installation

🏃 End-to-End Execution

🛠️ Components & Usage

1. Training (scripts/train.py)

2. Unified Evaluation (scripts/evaluate.py)

📓 Notebooks

⚙️ Configuration

💻 Hardware Requirements

📂 Project Structure

📜 Citation

📄 License

Acknowledgment

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Training (`scripts/train.py`)

2. Unified Evaluation (`scripts/evaluate.py`)

Packages