feat: delta weight sync (disk + nccl transports) by nanjiangwill · Pull Request #1806 · THUDM/slime

nanjiangwill · 2026-04-05T04:32:01Z

Summary

Non-colocated weight sync that ships only changed positions + values instead of every parameter. The motivating use case is training/inference disaggregation — trainer and rollout engines in different datacenters over a shared filesystem with bandwidth on the order of 100s of MB/s, where a full broadcast is infeasible but a sparse delta (~3% density, ~5 GB for a 355B model) is. Two transports share one wire layout and one receiver-side decoder:

disk — write one safetensors file per flush to --update-weight-delta-dir; one HTTP push per sync wakes the rollout engines to read+apply. For cross-DC training/inference disaggregation.
nccl — broadcast each per-flush bucket directly. Intra-DC validation baseline.

Receiver applies via NaN-masked overwrite — no arithmetic on either side. Lossless by construction; no drift, no periodic base syncs needed.

Inspired by arXiv:2509.19128 (selective overwrite); cross-DC disaggregation motivation from Fireworks AI — Frontier RL Is Cheaper Than You Think.

Note: Examples, docs, SGLang patch, and performance comparison will follow in a separate PR.

CLI surface

Trainer side (slime):

flag	values	applies to
`--update-weight-mode`	`full` / `delta`	universal — picks the strategy
`--update-weight-transport`	`nccl` / `disk`	delta only — per-flush carrier
`--update-weight-encoding`	`indices` / `deltas` / `deltas_zstd`	delta only — position encoding
`--update-weight-delta-dir`	path	delta + disk — shared-FS root for per-sync directories
`--update-weight-delta-keep-files`	bool flag	delta + disk — debug aid (skip post-apply cleanup)
`--custom-delta-pre-push-path`	`module.fn`	delta + disk — trainer-side hook (e.g., shared-FS commit)

SGLang side (auto-mirrored via --sglang- prefix):

flag	purpose
`--sglang-update-weight-delta-chunk-bytes`	byte cap per `model.load_weights` call on apply
`--sglang-update-weight-delta-read-workers`	max parallel I/O threads for reading delta files (disk only)

--update-weight-mode=delta is rejected with --colocate — CUDA IPC has no wire to compress.

Code shape

Slime: new UpdateWeightFromDistributedDelta extends the refactored UpdateWeightFromDistributed (template-method base with _iter_non_expert_chunks / _iter_expert_chunks / _on_chunk hook; bucketing lives in the iterators). One class for both transports — the if self.transport == "nccl" branch lives only in _flush_bucket and _finalize_sync. Encoders (encode_indices, encode_deltas), snapshot state (DeltaState), and the disk-only writer (AsyncSafetensorsWriter) are module-level.

SGLang patch (to follow in next PR):

Wire protocol: DeltaSpec + DeltaEncoding + DeltaParam in io_struct.py. Receiver dispatches on load_format == "delta".
Two receive entry points, both converge on _apply_delta_payload:
- NCCL: update_weights_from_distributed(load_format="delta", delta=spec) broadcasts (__positions__, __values__) tensors, then applies.
- Disk: update_weights_from_disk(load_format="delta", model_path=..., files=[...]) reads safetensors files (batched parallel prefetch), then applies.
Apply: _delta_apply_context patches Tensor.copy_ / fill_ scoped to writes whose destination is inside model param storage. NaN sentinel at unchanged positions triggers masked overwrite. No arithmetic, lossless.

🤖 Generated with Claude Code

Adds delta weight sync: ship only changed positions + values instead of full parameters. Two transports (disk for cross-DC disaggregation, NCCL for intra-DC), three encodings (indices, deltas, deltas_zstd), lossless selective overwrite via NaN sentinel. Examples, docs, and performance comparison to follow in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

nanjiangwill marked this pull request as draft April 5, 2026 04:33

nanjiangwill force-pushed the delta-compression-feature branch 5 times, most recently from d2aa1c0 to 5fb928d Compare April 18, 2026 01:26

nanjiangwill marked this pull request as ready for review April 18, 2026 01:29

zhuzilin added the run-ci-megatron label Apr 23, 2026

nanjiangwill force-pushed the delta-compression-feature branch from 23b9059 to b056c46 Compare April 30, 2026 22:27

nanjiangwill force-pushed the delta-compression-feature branch from b1c4ae1 to db09918 Compare May 13, 2026 02:41

nanjiangwill changed the title ~~feat: delta compression for weight sync~~ feat: delta-compression weight sync May 13, 2026

nanjiangwill changed the title ~~feat: delta-compression weight sync~~ feat: partial weight sync (delta + selective) May 13, 2026

nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from 4245bfa to 0a664bc Compare May 20, 2026 16:41

nanjiangwill changed the title ~~feat: partial weight sync (delta + selective)~~ feat: delta weight sync (disk + nccl transports) May 20, 2026

nanjiangwill force-pushed the delta-compression-feature branch from 0a664bc to 8021717 Compare May 22, 2026 05:59

nanjiangwill force-pushed the delta-compression-feature branch from 8021717 to b88fc27 Compare May 22, 2026 06:01

nanjiangwill added 2 commits May 25, 2026 18:16

add example

4e90292

Merge branch 'main' into nan/delta-sync

35837bc

nanjiangwill closed this May 25, 2026

nanjiangwill force-pushed the delta-compression-feature branch from b88fc27 to f0bce74 Compare May 25, 2026 18:17

nanjiangwill reopened this May 25, 2026

zhuzilin approved these changes May 26, 2026

View reviewed changes

zhuzilin merged commit 987b314 into main May 26, 2026
35 of 84 checks passed

zhuzilin deleted the delta-compression-feature branch May 26, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: delta weight sync (disk + nccl transports)#1806

feat: delta weight sync (disk + nccl transports)#1806
zhuzilin merged 3 commits into
mainfrom
delta-compression-feature

nanjiangwill commented Apr 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nanjiangwill commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CLI surface

Code shape

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nanjiangwill commented Apr 5, 2026 •

edited

Loading