Official implementation of "Flexible-length Text Infilling for Discrete Diffusion Models" (EMNLP 2025).
This release includes the DDOT training, sampling, and evaluation code built on top of the SEDD backbone. Pretrained checkpoints are not bundled in the repository. Configs that previously referenced private warm-start checkpoints have been normalized to use the public louaaron/sedd-medium backbone by default.
The provided environment.yml targets Linux with CUDA 11.8.
conda env create -f environment.yml
conda activate ddotThe main compatibility constraint is that torch and flash-attn must be installed against the same CUDA version.
The configs expect datasets from Hugging Face, including:
azhang42/one-billion-wordsazhang42/one-billion-words-testazhang42/yelpazhang42/yelp-testcodeparrot/codeparrot-cleanazhang42/codeparrot-clean-random-testazhang42/codeparrot-clean-block-testHuggingFaceFW/fineweb-edu
Dataset caching is controlled by data.cache_dir in the config files.
The default training entrypoint is:
torchrun --nproc_per_node=4 train.pyTo select another configuration:
torchrun --nproc_per_node=4 train.py --config-name position-one-billion-words-block
torchrun --nproc_per_node=4 train.py --config-name yelp-block
torchrun --nproc_per_node=4 train.py --config-name code-randomHydra outputs are written under exp_local/<dataset>/<date>/<time>/.
run_sample.py accepts either a run directory or a checkpoint file:
python run_sample.py \
--model_path /path/to/run_or_checkpoint \
--prompt "The masked sentence goes here." \
--steps 64 \
--lexAdd --perplexity to score generated samples with gpt2-large.
eval.py is designed for distributed evaluation with torchrun:
torchrun --nproc_per_node=4 --master-port=29501 eval.py \
--model_path /path/to/run_or_checkpoint \
--batch_size 64 \
--steps 64 \
--lexMETEOR scoring is optional. If you have a local METEOR jar, pass it explicitly:
torchrun --nproc_per_node=4 --master-port=29501 eval.py \
--model_path /path/to/run_or_checkpoint \
--meteor-jar /path/to/meteor-1.5.jar \
--lexscripts/train.shandscripts/eval.share SLURM examples and likely need cluster-specific edits.- Several configs in the original research workflow used private intermediate checkpoints. The public defaults now point to
louaaron/sedd-mediumso the configs are runnable without private paths, but paper-exact reproduction may still require your released checkpoints. train.pydefaults to theone-billion-words-blockconfig. Override it with--config-name ...as needed.
@inproceedings{zhang-etal-2025-flexible,
title={Flexible-length Text Infilling for Discrete Diffusion Models},
author={Zhang, Andrew and Sivakumar, Anushka and Tang, Chia-Wei and Thomas, Chris},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://aclanthology.org/2025.emnlp-main.1597/},
doi={10.18653/v1/2025.emnlp-main.1597}
}This repository builds on the public SEDD implementation and related open-source diffusion model tooling.