Minimal diffusion language modeling toolkit inspired by ZHZisZZ/dllm.
This repository keeps the same big-picture layout as the reference project:
dllm/core/schedulers: diffusion masking schedulesdllm/core/samplers: iterative denoising samplersdllm/core/trainers: masked diffusion training loopdllm/pipelines/toy: a small bidirectional transformer language modeldllm/data: toy corpora and dataset helpersexamples/toy: runnable train and sample entry points
The implementation here is intentionally compact. It is an educational DLLM that can be trained end-to-end on a tiny character-level corpus without external frameworks beyond PyTorch.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Train a toy diffusion LM:
python examples/toy/train.py --output artifacts/toy_dllm.ptSample from the saved checkpoint:
python examples/toy/sample.py \
--checkpoint artifacts/toy_dllm.pt \
--prompt "diffusion"Infill masked text:
python examples/toy/sample.py \
--checkpoint artifacts/toy_dllm.pt \
--infill "diff<mask><mask>ion"Run tests:
pytest- This is not a drop-in clone of the reference repository.
- The public API and folder structure are similar on purpose so the codebase is easy to extend toward larger DLLM experiments later.
- The tokenizer is character-level and understands
<mask>as a special token, which keeps sampling and infilling easy to inspect.