Skip to content

feat: delta weight sync (disk + nccl transports)#1806

Merged
zhuzilin merged 3 commits into
mainfrom
delta-compression-feature
May 26, 2026
Merged

feat: delta weight sync (disk + nccl transports)#1806
zhuzilin merged 3 commits into
mainfrom
delta-compression-feature

Conversation

@nanjiangwill
Copy link
Copy Markdown
Collaborator

@nanjiangwill nanjiangwill commented Apr 5, 2026

Summary

Non-colocated weight sync that ships only changed positions + values instead of every parameter. The motivating use case is training/inference disaggregation — trainer and rollout engines in different datacenters over a shared filesystem with bandwidth on the order of 100s of MB/s, where a full broadcast is infeasible but a sparse delta (~3% density, ~5 GB for a 355B model) is. Two transports share one wire layout and one receiver-side decoder:

  • disk — write one safetensors file per flush to --update-weight-delta-dir; one HTTP push per sync wakes the rollout engines to read+apply. For cross-DC training/inference disaggregation.
  • nccl — broadcast each per-flush bucket directly. Intra-DC validation baseline.

Receiver applies via NaN-masked overwrite — no arithmetic on either side. Lossless by construction; no drift, no periodic base syncs needed.

Inspired by arXiv:2509.19128 (selective overwrite); cross-DC disaggregation motivation from Fireworks AI — Frontier RL Is Cheaper Than You Think.

Note: Examples, docs, SGLang patch, and performance comparison will follow in a separate PR.

CLI surface

Trainer side (slime):

flag values applies to
--update-weight-mode full / delta universal — picks the strategy
--update-weight-transport nccl / disk delta only — per-flush carrier
--update-weight-encoding indices / deltas / deltas_zstd delta only — position encoding
--update-weight-delta-dir path delta + disk — shared-FS root for per-sync directories
--update-weight-delta-keep-files bool flag delta + disk — debug aid (skip post-apply cleanup)
--custom-delta-pre-push-path module.fn delta + disk — trainer-side hook (e.g., shared-FS commit)

SGLang side (auto-mirrored via --sglang- prefix):

flag purpose
--sglang-update-weight-delta-chunk-bytes byte cap per model.load_weights call on apply
--sglang-update-weight-delta-read-workers max parallel I/O threads for reading delta files (disk only)

--update-weight-mode=delta is rejected with --colocate — CUDA IPC has no wire to compress.

Code shape

Slime: new UpdateWeightFromDistributedDelta extends the refactored UpdateWeightFromDistributed (template-method base with _iter_non_expert_chunks / _iter_expert_chunks / _on_chunk hook; bucketing lives in the iterators). One class for both transports — the if self.transport == "nccl" branch lives only in _flush_bucket and _finalize_sync. Encoders (encode_indices, encode_deltas), snapshot state (DeltaState), and the disk-only writer (AsyncSafetensorsWriter) are module-level.

SGLang patch (to follow in next PR):

  • Wire protocol: DeltaSpec + DeltaEncoding + DeltaParam in io_struct.py. Receiver dispatches on load_format == "delta".
  • Two receive entry points, both converge on _apply_delta_payload:
    • NCCL: update_weights_from_distributed(load_format="delta", delta=spec) broadcasts (__positions__, __values__) tensors, then applies.
    • Disk: update_weights_from_disk(load_format="delta", model_path=..., files=[...]) reads safetensors files (batched parallel prefetch), then applies.
  • Apply: _delta_apply_context patches Tensor.copy_ / fill_ scoped to writes whose destination is inside model param storage. NaN sentinel at unchanged positions triggers masked overwrite. No arithmetic, lossless.

🤖 Generated with Claude Code

@nanjiangwill nanjiangwill marked this pull request as draft April 5, 2026 04:33
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch 5 times, most recently from d2aa1c0 to 5fb928d Compare April 18, 2026 01:26
@nanjiangwill nanjiangwill marked this pull request as ready for review April 18, 2026 01:29
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from 23b9059 to b056c46 Compare April 30, 2026 22:27
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from b1c4ae1 to db09918 Compare May 13, 2026 02:41
@nanjiangwill nanjiangwill changed the title feat: delta compression for weight sync feat: delta-compression weight sync May 13, 2026
@nanjiangwill nanjiangwill changed the title feat: delta-compression weight sync feat: partial weight sync (delta + selective) May 13, 2026
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch 2 times, most recently from 4245bfa to 0a664bc Compare May 20, 2026 16:41
@nanjiangwill nanjiangwill changed the title feat: partial weight sync (delta + selective) feat: delta weight sync (disk + nccl transports) May 20, 2026
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from 0a664bc to 8021717 Compare May 22, 2026 05:59
Adds delta weight sync: ship only changed positions + values instead of
full parameters. Two transports (disk for cross-DC disaggregation, NCCL
for intra-DC), three encodings (indices, deltas, deltas_zstd), lossless
selective overwrite via NaN sentinel.

Examples, docs, and performance comparison to follow in a separate PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from 8021717 to b88fc27 Compare May 22, 2026 06:01
@nanjiangwill nanjiangwill force-pushed the delta-compression-feature branch from b88fc27 to f0bce74 Compare May 25, 2026 18:17
@nanjiangwill nanjiangwill reopened this May 25, 2026
@zhuzilin zhuzilin merged commit 987b314 into main May 26, 2026
35 of 84 checks passed
@zhuzilin zhuzilin deleted the delta-compression-feature branch May 26, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants