[Paper] | [Citation] | [Code]
Safe Denoiser for Discrete Diffusion Language Models Amman Yusuf, [Zhejun Jiang], Mijung Park [ICML 2026]
Code: The full implementation is maintained at
code/(submodule →ammanyusuf/SAD). Clone with--recurse-submodulesto include it.
We introduce SAD (Safe Annealed Denoising), a training-free safety mechanism for Masked Discrete Language Models (MDLMs) such as LLaDA, Dream, and MDLM. SAD steers token predictions away from unsafe content during sampling — no retraining, no fine-tuning, no classifier in the loop.
The core idea: at each denoising step within a guidance window, we compute the expected denoised embedding under the base model and a pre-built unsafe negation set, then push the output embedding away from the unsafe direction by strength β*:
SAD applies during a window of diffusion timesteps (t ∈ C). The base sampler produces toxic outputs; SAD steers towards high-likelihood safe continuations without any model modification.
SAD blocks jailbreak attacks that fool the baseline defense (DiffuGuard), while the unguarded model complies.
Case Study 01 — WildJailbreakBench attack: SAD refuses. DiffuGuard partially complies. Vanilla fully complies.
Case Study 02 — DIJA structured mask attack: SAD deflects into an innocuous story. DiffuGuard produces borderline content. Vanilla produces explicit content.
The safe denoiser operates as a logit-level hook during the reverse diffusion process:
- Unsafe reference set — a pre-tokenized tensor of unsafe completions, built once per tokenizer from BeaverTails / RealToxicityPrompts / ToxiGen.
- Guidance window — SAD is active only within timesteps
[t_start, t_end], leaving early and late denoising unaffected. - Repellency — at each active step, token log-probabilities are shifted to reduce the likelihood of tokens that appear in the unsafe reference distribution.
- Semantic gating (optional) — a lightweight gate suppresses guidance when the prompt is already on a safe trajectory, preserving generation quality on benign inputs.
The method is model-agnostic: the same hook interface works for LLaDA, Dream, and MDLM. See src/third_party/mdlm/repellency/README.md for the full derivation and configuration reference.
Full implementation, quick start, configuration reference, and Slurm pipeline docs are in the code/ submodule (ammanyusuf/SAD).
git clone --recurse-submodules https://github.com/ParkLabML/SAD.gitIf you find this work useful, please cite:
@misc{yusuf2026safetyawaredenoisertextdiffusion,
title={The Safety-Aware Denoiser for Text Diffusion Models},
author={Amman Yusuf and Zhejun Jiang and Mijung Park},
year={2026},
eprint={2605.08116},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.08116},
}