Safe Denoiser for Discrete Diffusion Language Models

Safe Denoiser for Discrete Diffusion Language Models Amman Yusuf, [Zhejun Jiang], Mijung Park [ICML 2026]

Code: The full implementation is maintained at code/ (submodule → ammanyusuf/SAD). Clone with --recurse-submodules to include it.

Overview

We introduce SAD (Safe Annealed Denoising), a training-free safety mechanism for Masked Discrete Language Models (MDLMs) such as LLaDA, Dream, and MDLM. SAD steers token predictions away from unsafe content during sampling — no retraining, no fine-tuning, no classifier in the loop.

The core idea: at each denoising step within a guidance window, we compute the expected denoised embedding under the base model and a pre-built unsafe negation set, then push the output embedding away from the unsafe direction by strength β*:

$$E_\text{safe} = E_D + \beta^* (E_D - \hat{E}_{D,\text{unsafe}})$$

SAD applies during a window of diffusion timesteps (t ∈ C). The base sampler produces toxic outputs; SAD steers towards high-likelihood safe continuations without any model modification.

Case Studies

SAD blocks jailbreak attacks that fool the baseline defense (DiffuGuard), while the unguarded model complies.

Case Study 01 — WildJailbreakBench attack: SAD refuses. DiffuGuard partially complies. Vanilla fully complies.

Case Study 02 — DIJA structured mask attack: SAD deflects into an innocuous story. DiffuGuard produces borderline content. Vanilla produces explicit content.

Method

The safe denoiser operates as a logit-level hook during the reverse diffusion process:

Unsafe reference set — a pre-tokenized tensor of unsafe completions, built once per tokenizer from BeaverTails / RealToxicityPrompts / ToxiGen.
Guidance window — SAD is active only within timesteps [t_start, t_end], leaving early and late denoising unaffected.
Repellency — at each active step, token log-probabilities are shifted to reduce the likelihood of tokens that appear in the unsafe reference distribution.
Semantic gating (optional) — a lightweight gate suppresses guidance when the prompt is already on a safe trajectory, preserving generation quality on benign inputs.

The method is model-agnostic: the same hook interface works for LLaDA, Dream, and MDLM. See src/third_party/mdlm/repellency/README.md for the full derivation and configuration reference.

Code

Full implementation, quick start, configuration reference, and Slurm pipeline docs are in the code/ submodule (ammanyusuf/SAD).

git clone --recurse-submodules https://github.com/ParkLabML/SAD.git

Citation

If you find this work useful, please cite:

@misc{yusuf2026safetyawaredenoisertextdiffusion,
      title={The Safety-Aware Denoiser for Text Diffusion Models}, 
      author={Amman Yusuf and Zhejun Jiang and Mijung Park},
      year={2026},
      eprint={2605.08116},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.08116}, 
}

[Paper] | [Code]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
code @ 0ab1e46		code @ 0ab1e46
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safe Denoiser for Discrete Diffusion Language Models

Overview

Case Studies

Method

Code

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Safe Denoiser for Discrete Diffusion Language Models

Overview

Case Studies

Method

Code

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages