DDOT

Official implementation of "Flexible-length Text Infilling for Discrete Diffusion Models" (EMNLP 2025).

This release includes the DDOT training, sampling, and evaluation code built on top of the SEDD backbone. Pretrained checkpoints are not bundled in the repository. Configs that previously referenced private warm-start checkpoints have been normalized to use the public louaaron/sedd-medium backbone by default.

Setup

The provided environment.yml targets Linux with CUDA 11.8.

conda env create -f environment.yml
conda activate ddot

The main compatibility constraint is that torch and flash-attn must be installed against the same CUDA version.

Data

The configs expect datasets from Hugging Face, including:

azhang42/one-billion-words
azhang42/one-billion-words-test
azhang42/yelp
azhang42/yelp-test
codeparrot/codeparrot-clean
azhang42/codeparrot-clean-random-test
azhang42/codeparrot-clean-block-test
HuggingFaceFW/fineweb-edu

Dataset caching is controlled by data.cache_dir in the config files.

Training

The default training entrypoint is:

torchrun --nproc_per_node=4 train.py

To select another configuration:

torchrun --nproc_per_node=4 train.py --config-name position-one-billion-words-block
torchrun --nproc_per_node=4 train.py --config-name yelp-block
torchrun --nproc_per_node=4 train.py --config-name code-random

Hydra outputs are written under exp_local/<dataset>/<date>/<time>/.

Sampling

run_sample.py accepts either a run directory or a checkpoint file:

python run_sample.py \
  --model_path /path/to/run_or_checkpoint \
  --prompt "The masked sentence goes here." \
  --steps 64 \
  --lex

Add --perplexity to score generated samples with gpt2-large.

Evaluation

eval.py is designed for distributed evaluation with torchrun:

torchrun --nproc_per_node=4 --master-port=29501 eval.py \
  --model_path /path/to/run_or_checkpoint \
  --batch_size 64 \
  --steps 64 \
  --lex

METEOR scoring is optional. If you have a local METEOR jar, pass it explicitly:

torchrun --nproc_per_node=4 --master-port=29501 eval.py \
  --model_path /path/to/run_or_checkpoint \
  --meteor-jar /path/to/meteor-1.5.jar \
  --lex

Notes

scripts/train.sh and scripts/eval.sh are SLURM examples and likely need cluster-specific edits.
Several configs in the original research workflow used private intermediate checkpoints. The public defaults now point to louaaron/sedd-medium so the configs are runnable without private paths, but paper-exact reproduction may still require your released checkpoints.
train.py defaults to the one-billion-words-block config. Override it with --config-name ... as needed.

Citation

@inproceedings{zhang-etal-2025-flexible,
  title={Flexible-length Text Infilling for Discrete Diffusion Models},
  author={Zhang, Andrew and Sivakumar, Anushka and Tang, Chia-Wei and Thomas, Chris},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  url={https://aclanthology.org/2025.emnlp-main.1597/},
  doi={10.18653/v1/2025.emnlp-main.1597}
}

Acknowledgements

This repository builds on the public SEDD implementation and related open-source diffusion model tooling.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
model		model
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
catsample.py		catsample.py
data.py		data.py
dist_ctx.py		dist_ctx.py
environment.yml		environment.yml
eval.py		eval.py
graph_lib.py		graph_lib.py
load_model.py		load_model.py
losses.py		losses.py
noise_lib.py		noise_lib.py
run_sample.py		run_sample.py
run_train.py		run_train.py
sampling.py		sampling.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DDOT

Setup

Data

Training

Sampling

Evaluation

Notes

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DDOT

Setup

Data

Training

Sampling

Evaluation

Notes

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages