Skip to content

Andrew-Zhang/DDOT

Repository files navigation

DDOT

Official implementation of "Flexible-length Text Infilling for Discrete Diffusion Models" (EMNLP 2025).

This release includes the DDOT training, sampling, and evaluation code built on top of the SEDD backbone. Pretrained checkpoints are not bundled in the repository. Configs that previously referenced private warm-start checkpoints have been normalized to use the public louaaron/sedd-medium backbone by default.

Setup

The provided environment.yml targets Linux with CUDA 11.8.

conda env create -f environment.yml
conda activate ddot

The main compatibility constraint is that torch and flash-attn must be installed against the same CUDA version.

Data

The configs expect datasets from Hugging Face, including:

  • azhang42/one-billion-words
  • azhang42/one-billion-words-test
  • azhang42/yelp
  • azhang42/yelp-test
  • codeparrot/codeparrot-clean
  • azhang42/codeparrot-clean-random-test
  • azhang42/codeparrot-clean-block-test
  • HuggingFaceFW/fineweb-edu

Dataset caching is controlled by data.cache_dir in the config files.

Training

The default training entrypoint is:

torchrun --nproc_per_node=4 train.py

To select another configuration:

torchrun --nproc_per_node=4 train.py --config-name position-one-billion-words-block
torchrun --nproc_per_node=4 train.py --config-name yelp-block
torchrun --nproc_per_node=4 train.py --config-name code-random

Hydra outputs are written under exp_local/<dataset>/<date>/<time>/.

Sampling

run_sample.py accepts either a run directory or a checkpoint file:

python run_sample.py \
  --model_path /path/to/run_or_checkpoint \
  --prompt "The masked sentence goes here." \
  --steps 64 \
  --lex

Add --perplexity to score generated samples with gpt2-large.

Evaluation

eval.py is designed for distributed evaluation with torchrun:

torchrun --nproc_per_node=4 --master-port=29501 eval.py \
  --model_path /path/to/run_or_checkpoint \
  --batch_size 64 \
  --steps 64 \
  --lex

METEOR scoring is optional. If you have a local METEOR jar, pass it explicitly:

torchrun --nproc_per_node=4 --master-port=29501 eval.py \
  --model_path /path/to/run_or_checkpoint \
  --meteor-jar /path/to/meteor-1.5.jar \
  --lex

Notes

  • scripts/train.sh and scripts/eval.sh are SLURM examples and likely need cluster-specific edits.
  • Several configs in the original research workflow used private intermediate checkpoints. The public defaults now point to louaaron/sedd-medium so the configs are runnable without private paths, but paper-exact reproduction may still require your released checkpoints.
  • train.py defaults to the one-billion-words-block config. Override it with --config-name ... as needed.

Citation

@inproceedings{zhang-etal-2025-flexible,
  title={Flexible-length Text Infilling for Discrete Diffusion Models},
  author={Zhang, Andrew and Sivakumar, Anushka and Tang, Chia-Wei and Thomas, Chris},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  year={2025},
  url={https://aclanthology.org/2025.emnlp-main.1597/},
  doi={10.18653/v1/2025.emnlp-main.1597}
}

Acknowledgements

This repository builds on the public SEDD implementation and related open-source diffusion model tooling.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors