nanoDiff

A minimal, clean, hackable implementation of a state-of-the-art diffusion language model — built to learn, understand, train, and improve dLLMs.

Think of it as nanoGPT / nanochat, but for diffusion language models instead of autoregressive ones. It distills the LLaDA recipe (the simplest formulation that has scaled to 8B–100B) down to a small modular package you can read in an afternoon.

Based on LLaDA — Large Language Diffusion Models (Nie et al. 2025, arXiv:2502.09992). The broader lineage (D3PM → LLaDA 2.0) is in References.

The idea in 60 seconds

An autoregressive LLM writes text left-to-right: p(x) = Π p(xₜ | x<ₜ).

A masked diffusion LM instead learns to un-corrupt text:

Forward process (nanodiff/diffusion.py): pick a mask ratio t ~ U(0,1), then replace each token independently with a [MASK] token with probability t. That's the entire "noising" process — no Gaussians, no latents.
Model (nanodiff/model.py): a LLaMA-style transformer (RMSNorm, SwiGLU, RoPE) with one change — attention is bidirectional, so it can use right-context to fill in masks.
Loss (nanodiff/diffusion.py): cross-entropy on the masked positions only, weighted by 1/t. That weight is what makes the loss a real upper bound on the negative log-likelihood (not just a heuristic).
Sampling (nanodiff/sampler.py): start fully masked, repeatedly predict all tokens but only commit the most confident ones, re-mask the rest, iterate. A block_length knob smoothly interpolates between pure-diffusion and autoregressive decoding.

Why care: parallel generation, bidirectional context, native infilling, and a tunable quality↔speed dial.

        t=1  ████████████████████   (all [MASK])
             ██ the ███ of ████ is
             ██ the meaning of life is
        t=0  the meaning of life is   (clean text)
                ▲ each step: predict all, commit the confident ones, repeat

Quickstart

uv sync                       # create .venv, install deps + the nanodiff package
source .venv/bin/activate     # (or prefix the commands below with `uv run`)
python tests/smoke_test.py    # optional: verify the core stack works (~2 min, CPU)

# 1. tokenize a pretraining corpus (streams FineWeb-Edu)
python scripts/prepare_data.py --out-dir data/fineweb_edu --num-tokens 2_000_000_000

# 2. pretrain a base model (single GPU)
python pretrain/train.py --config pretrain/configs/50m.py
#    ...or multi-GPU:
torchrun --standalone --nproc_per_node=8 pretrain/train.py --config pretrain/configs/50m.py

# 3. sample / evaluate
python sample.py --ckpt checkpoints/50m/ckpt.pt --prompt "The meaning of life is"
python eval.py --ckpt checkpoints/50m/ckpt.pt --iters 500

# 4. (optional) instruction-tune the base on Alpaca-cleaned
python scripts/prepare_sft_data.py --out-dir data/alpaca_sft
python sft/train.py --config sft/configs/50m_alpaca.py

Scaling

Scaling is a one-file change — copy a config and edit the model/optimizer fields:

# pretrain/configs/350m.py
from nanodiff.config import Config
config = Config(name="nanodiff-350m", n_layer=16, n_embd=1280, n_head=20,
                batch_size=16, grad_accum_steps=16, out_dir="checkpoints/350m")

Everything reads from the Config dataclass, so model code never changes.

Pretrained models

Two pretrained checkpoints are on the Hugging Face Hub:

Model	What it is
Sebasdi/nanodiff-50m-base	the 50M base — pretrained on ~2B tokens of FineWeb-Edu (val perplexity ~50)
Sebasdi/nanodiff-50m-sft-alpaca	the base, instruction-tuned on Alpaca-cleaned (~51k examples)

# base model — continues text, document-style
hf download Sebasdi/nanodiff-50m-base nanodiff-50m-base.pt --local-dir checkpoints/
python chat.py --ckpt checkpoints/nanodiff-50m-base.pt

# SFT model — follows instructions (note the --sft flag)
hf download Sebasdi/nanodiff-50m-sft-alpaca nanodiff-50m-sft-alpaca.pt --local-dir checkpoints/
python chat.py --ckpt checkpoints/nanodiff-50m-sft-alpaca.pt --sft

⚠️ Set your expectations. These are 50M-parameter models trained on ~2B tokens — on the order of 1/100th the data a model like GPT-2 saw. They are learning artifacts, not usable assistants:

The base model continues text — prompt it document-style ("The history of Rome is"), not question-style.

The SFT model follows instructions (chat.py --sft), but at 50M params it confabulates freely — fluent English, unreliable facts. SFT taught it to answer, not to know.

Both need the repetition penalty. Small diffusion LMs collapse into repetition loops under the default sampler; chat.py and sample.py enable a frequency repetition penalty (--rep-penalty 3.0) by default.

The next steps in this learning project are to scale — more training data and larger models — which is what these small-scale runs were calibrating for.

How the training step works

The entire learning signal, from pretrain/train.py:

x0           = train_data.get_batch(...)               # clean tokens  (B, T)
x_t, mask, t = forward_process(x0, mask_token_id)      # corrupt them
logits       = model(x_t)                              # predict every token
loss         = diffusion_loss(logits, x0, mask, t)     # 1/t-weighted CE on masks
loss.backward()

That's it. No noise schedule, no timestep embedding (we use LLaDA's time-free parameterization — see the comment at the top of model.py), no ELBO bookkeeping.

Training customization

Dataset — fully swappable. The pipeline only ever sees a flat uint16 token array on disk, so it is dataset-agnostic. Either point prepare_data.py at any Hugging Face text dataset (it just needs a "text" field):

python scripts/prepare_data.py --dataset <hf-name> --subset <config> --out-dir data/mine

or produce your own train.bin / val.bin (any uint16 token dump) and set data_dir in your config — the model never knows the difference.

Tokenizer — coupled, but in known places. The GPT-2 BPE is wired in as the default working path. Swapping it means updating these spots:

File(s)	What to change
`scripts/prepare_data.py`, `sample.py`, `pretrain/train.py`	`tiktoken.get_encoding("gpt2")`
`nanodiff/config.py`	`vocab_size`, `mask_token_id` (= last real id + 1, then pad)
`scripts/prepare_data.py`, `sample.py`, `pretrain/train.py`	`EOT` — the document-separator id
`nanodiff/data.py`	`uint16` dtype caps the vocab at 65536; use `uint32` above that

Model size — a one-file config change. See Scaling above.

References

The recipe nanoDiff implements is LLaDA; here is the lineage:

Paper	Year	arXiv
D3PM — Structured Denoising Diffusion in Discrete State-Spaces	2021	2107.03006
SEDD — Discrete Diffusion by Estimating Data-Distribution Ratios	2024	2310.16834
MDLM — Simple and Effective Masked Diffusion Language Models	2024	2406.07524
BD3-LM — Block Diffusion (interpolating AR ↔ diffusion)	2025	2503.09573
LLaDA — Large Language Diffusion Models (primary reference)	2025	2502.09992
Dream 7B — Diffusion Large Language Models	2025	2508.15487
LLaDA 2.0 — Scaling Diffusion Language Models to 100B	2025	2512.15745
A Survey on Diffusion Language Models	2025	2508.10875

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
benchmark		benchmark
nanodiff		nanodiff
pretrain		pretrain
scripts		scripts
sft		sft
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat.py		chat.py
eval.py		eval.py
pyproject.toml		pyproject.toml
sample.py		sample.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoDiff

The idea in 60 seconds

Quickstart

Scaling

Pretrained models

How the training step works

Training customization

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanoDiff

The idea in 60 seconds

Quickstart

Scaling

Pretrained models

How the training step works

Training customization

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages