Distributed Training From Scratch — Roadmap Checklist

Step 0 — Single-GPU Baseline ✅

Step 1 — PyTorch DDP Golden Baseline ✅

Step 2 — Sharded Data Loading ✅

Deterministic dataset sharding per rank
Same number of steps per rank
Per-epoch seed control
Verify no sample overlap across ranks

Step 3 — Naive Data Parallelism (Correctness) ✅

Broadcast model parameters from rank 0
Backward pass on each rank
All-reduce gradients
Divide gradients by world_size
Optimizer step after synchronization
Weights match PyTorch DDP within tolerance (rtol=1e-5, atol=1e-6)
param_norm diff < 1e-6

Step 4 — Autograd Hooks + Bucketing (Real DDP) ✅

Register backward hooks per parameter
Gradient bucketing
Async all-reduce per bucket
Overlap backward compute with communication
Single sync point before optimizer step
no_sync() context manager for gradient accumulation

Step 5 — ZeRO Stage 1 (Optimizer State Partitioning)

Partition optimizer state (e.g. Adam momentum, variance) across ranks
Each rank holds only 1 / world_size of optimizer state per parameter
Gather optimizer state on demand when applying updates
Reduce memory per rank by ~4× for Adam-style optimizers
Verify training matches DDP baseline (same loss trajectory, checkpoint parity)

Step 6 — ZeRO Stage 2 (Gradient + Optimizer State Partitioning)

Partition gradients so each rank owns 1 / world_size of gradients
All-reduce only the gradient slice owned by each rank (reduce-scatter style)
Combine with ZeRO-1: partition both gradients and optimizer state
Further reduce memory (gradient buffers no longer replicated)
Verify correctness vs DDP / ZeRO-1

Step 7 — ZeRO Stage 3 (Full Parameter Sharding)

Partition model parameters across ranks; each rank holds 1 / world_size
All-gather parameters (or submodules) on demand before forward
Free gathered parameters after backward for that layer (or use stream/cache)
Combine with ZeRO-1 + ZeRO-2 for full redundancy removal
Enable training models that do not fit on a single GPU
Verify correctness and memory scaling with world size

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
ddp		ddp
utils		utils
.gitignore		.gitignore
README.md		README.md
compare_checkpoints.py		compare_checkpoints.py
requirements.txt		requirements.txt
test_bucketer.py		test_bucketer.py
test_sampler.py		test_sampler.py
test_torch_sampler.py		test_torch_sampler.py
train.py		train.py
train_ddp.py		train_ddp.py
train_ddp_torch.py		train_ddp_torch.py
zero1.py		zero1.py
zero2.py		zero2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training From Scratch — Roadmap Checklist

Step 0 — Single-GPU Baseline ✅

Step 1 — PyTorch DDP Golden Baseline ✅

Step 2 — Sharded Data Loading ✅

Step 3 — Naive Data Parallelism (Correctness) ✅

Step 4 — Autograd Hooks + Bucketing (Real DDP) ✅

Step 5 — ZeRO Stage 1 (Optimizer State Partitioning)

Step 6 — ZeRO Stage 2 (Gradient + Optimizer State Partitioning)

Step 7 — ZeRO Stage 3 (Full Parameter Sharding)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Training From Scratch — Roadmap Checklist

Step 0 — Single-GPU Baseline ✅

Step 1 — PyTorch DDP Golden Baseline ✅

Step 2 — Sharded Data Loading ✅

Step 3 — Naive Data Parallelism (Correctness) ✅

Step 4 — Autograd Hooks + Bucketing (Real DDP) ✅

Step 5 — ZeRO Stage 1 (Optimizer State Partitioning)

Step 6 — ZeRO Stage 2 (Gradient + Optimizer State Partitioning)

Step 7 — ZeRO Stage 3 (Full Parameter Sharding)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages