PUMA (Progressive Unmasking) is a simple modification to the forward process of Masked Diffusion Models (MDMs). Instead of training on randomly masked sequences, PUMA aligns that training- and inference- time masking patterns, thereby focusingon inference-aligned masks and speeding up training.
# Create and activate conda environment
conda env create -f environment.ymlThe slurm training script (see below) activates the environment automatically. To activate manually, run:
conda activate pumaWe provide PUMA codebases for Sudoku and TinyGSM.
-
Sudoku: Download sudoku-train-data.npy and sudoku-test-data.npy from here and put them in the
data/sudoku_newfolder. -
TinyGSM: Run
data/tiny_gsm.pythat gives you fileslabels.bin,meta.json, andprompt_mask.binfor pretraining in the desiredout_dirdirectory.
Submit a job using the SLURM script:
sbatch job.shThe SLURM script calls train.py, which handles the training loop.
Config files are located in yaml_files/. Edit these YAML files to adjust:
- Model architecture
- Training hyperparameters
- Dataset settings
- Logging options
We provide one config each for PUMA and the baseline for the following three settings: Sudoku, TinyGSM (standard), TinyGSM (block diffusion).
Training logs and checkpoints are saved according to the paths specified in your config file. The training file also logs results to wandb.
train.py: unified file that handles the MDM pretraining (includes the vanilla MDM pretraining). Self-includes the evaluation accuracy logging.sampling.py: sampling for a given MDMprogressive.py': PUMA via batch streaming implementation. The implementation detail can be found in Section 3.2 and Appendix B.1.progressive_block.py: PUMA implementation for block diffusion.model: our Qwen2-style attention implementationseval: eval util functions for Sudoku / TinyGSM
If you find this repository helpful, please consider citing our paper:
@misc{
kim2026stoptrainingworstprogressive,
title={{S}top {T}raining for the {W}orst: {P}rogressive {U}nmasking {A}ccelerates {M}asked {D}iffusion {T}raining},
author={Jaeyeon Kim and Jonathan Geuter and David Alvarez-Melis and Sham Kakade and Sitan Chen},
year={2026},
eprint={2602.10314},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.10314},
}