feat: delta weight sync (disk + nccl transports)#1806
Merged
Conversation
d2aa1c0 to
5fb928d
Compare
23b9059 to
b056c46
Compare
b1c4ae1 to
db09918
Compare
4245bfa to
0a664bc
Compare
0a664bc to
8021717
Compare
Adds delta weight sync: ship only changed positions + values instead of full parameters. Two transports (disk for cross-DC disaggregation, NCCL for intra-DC), three encodings (indices, deltas, deltas_zstd), lossless selective overwrite via NaN sentinel. Examples, docs, and performance comparison to follow in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8021717 to
b88fc27
Compare
b88fc27 to
f0bce74
Compare
zhuzilin
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-colocated weight sync that ships only changed positions + values instead of every parameter. The motivating use case is training/inference disaggregation — trainer and rollout engines in different datacenters over a shared filesystem with bandwidth on the order of 100s of MB/s, where a full broadcast is infeasible but a sparse delta (~3% density, ~5 GB for a 355B model) is. Two transports share one wire layout and one receiver-side decoder:
disk— write one safetensors file per flush to--update-weight-delta-dir; one HTTP push per sync wakes the rollout engines to read+apply. For cross-DC training/inference disaggregation.nccl— broadcast each per-flush bucket directly. Intra-DC validation baseline.Receiver applies via NaN-masked overwrite — no arithmetic on either side. Lossless by construction; no drift, no periodic base syncs needed.
Inspired by arXiv:2509.19128 (selective overwrite); cross-DC disaggregation motivation from Fireworks AI — Frontier RL Is Cheaper Than You Think.
CLI surface
Trainer side (slime):
--update-weight-modefull/delta--update-weight-transportnccl/disk--update-weight-encodingindices/deltas/deltas_zstd--update-weight-delta-dir--update-weight-delta-keep-files--custom-delta-pre-push-pathmodule.fnSGLang side (auto-mirrored via
--sglang-prefix):--sglang-update-weight-delta-chunk-bytesmodel.load_weightscall on apply--sglang-update-weight-delta-read-workers--update-weight-mode=deltais rejected with--colocate— CUDA IPC has no wire to compress.Code shape
Slime: new
UpdateWeightFromDistributedDeltaextends the refactoredUpdateWeightFromDistributed(template-method base with_iter_non_expert_chunks/_iter_expert_chunks/_on_chunkhook; bucketing lives in the iterators). One class for both transports — theif self.transport == "nccl"branch lives only in_flush_bucketand_finalize_sync. Encoders (encode_indices,encode_deltas), snapshot state (DeltaState), and the disk-only writer (AsyncSafetensorsWriter) are module-level.SGLang patch (to follow in next PR):
DeltaSpec+DeltaEncoding+DeltaParaminio_struct.py. Receiver dispatches onload_format == "delta"._apply_delta_payload:update_weights_from_distributed(load_format="delta", delta=spec)broadcasts(__positions__, __values__)tensors, then applies.update_weights_from_disk(load_format="delta", model_path=..., files=[...])reads safetensors files (batched parallel prefetch), then applies._delta_apply_contextpatchesTensor.copy_/fill_scoped to writes whose destination is inside model param storage. NaN sentinel at unchanged positions triggers masked overwrite. No arithmetic, lossless.🤖 Generated with Claude Code