This repository contains the code, configurations, and scripts for reproducing the experiments in our paper.
Note: Models and data are hosted on HuggingFace under
decaf-usenix/(anonymized) due to size: Decaf-Gen-{1.3b, 6.7b, 22b}, two 32B rerankers, ExeBench test sets, and the Juliet vulnerability-detection dataset. The provided download script fetches whichever subset you need (full set is over 200 GB)
We recommend opening the repository as a Dev Container in VS Code.
- Edit
.devcontainer/devcontainer.jsonto mount your storage directory on the host machine. - After building the container, run
.devcontainer/conda_install.shto install all dependencies. - A
HF_TOKENfrom HuggingFace will need to be created and added to the environment and the.dotenvfile in the project to authorize HuggingFace.
| Script | Description |
|---|---|
scripts/download.sh |
Download models and data from HuggingFace |
scripts/inference.py |
LLM inference (vLLM) to sample from the LLMs for decompilation |
scripts/evaluator.py |
Evaluation for LLM generated decompilations: compilation, execution, edit distance, etc. |
scripts/eval_rerank.py |
Reranking using the different reranking methodologies |
| src/utils/exebench.py | Core utility for compiling, executing, disassembling, and evaluating ExeBench examples |
| src/training/trainer.py | Supervised Fine-Tuning training |
The scripts/merged_test_set_experiments/ directory contains shell scripts for running the merged-test-set experiments (inference, evaluation, reranking) across the different models.
For the Juliet vulnerability-recovery experiment:
# End-to-end (GPUs needed): Ghidra prepare -> infer -> evaluate -> rerank -> analysis
bash scripts/juliet/run_juliet_pipeline.sh
# Re-run just the analysis side (no GPUs) against the shipped reranked_results.jsonl
bash scripts/juliet/run_juliet_pipeline.sh analyzeIndividual stages are also addressable: prepare, infer, evaluate, rerank, populate, codeql, summarize. The analyze shortcut composes populate -> codeql -> summarize.
| Directory | Description |
|---|---|
configs/test_experiments/ |
Inference / evaluation / reranking configs (merged test sets) |
| configs/decaf_batch_juliet_ceiling_O2_funceval.yaml | Juliet O2 batch pipeline config (used by run_juliet_full_pipeline.sh) |
To verify the environment setup, run:
pytest tests/ --ignore=tests/models/ -vThis validates core functionality (compilation, execution, distance metrics) without requiring GPU resources.
-
Download models and data:
./scripts/download.sh --all
This fetches:
- Decaf-Gen-1.3b / -6.7b / -22b — LLM generators at three scales
- Decaf-ReRanker-32b-stripped / -unstripped — neural rerankers
- Decaf-Test-Sets — Real and Synth merged-test-set evaluation jsonl
- Decaf-Juliet-Funceval — Juliet binaries, manifests, alignments, and
a 3.2 GB
reranked_results.jsonlso the analysis pipeline can run without GPUs
You can also fetch selectively:
./scripts/download.sh --models-only # all generators + rerankers ./scripts/download.sh --data-only # test sets + Juliet ./scripts/download.sh --juliet-only # just the Juliet dataset (~707 MB)
-
Merged-test-set experiments: run the drivers under
scripts/merged_test_set_experiments/or the configs underconfigs/test_experiments/ours_base_{1.3b,6.7b}/:bash scripts/merged_test_set_experiments/ours_base_v2_n32_inference.sh bash scripts/merged_test_set_experiments/ours_base_v2_n32_eval.sh bash scripts/merged_test_set_experiments/ours_base_v2_n32_rerank_comprehensive.sh
-
Juliet vulnerability-recovery experiment. After
download.sh --juliet-only, run either:# Analysis-only (no GPU needed) — uses shipped reranked_results.jsonl bash scripts/juliet/run_juliet_analysis_pipeline.sh # Full end-to-end (GPUs needed) — re-runs inference / evaluate / rerank bash scripts/juliet/run_juliet_full_pipeline.sh