Fix CUDA IPC cache leaks during weight updates by zhuzilin · Pull Request #1731 · THUDM/slime

zhuzilin · 2026-03-17T02:37:42Z

Root cause: ForkingPickler calls storage._share_cuda_() on GPU tensors, creating permanent entries in the CUDA IPC cache that hold strong references to GPU memory. These entries are only released when torch.cuda.ipc_collect() detects the consumer has closed its IPC handle.

Fix (in update_weight_from_tensor.py):

del hf_named_tensors added alongside long_lived_tensors to break chunk overlap
torch.cuda.ipc_collect() after each chunk's ray.get() + del — releases IPC cache entries for completed chunks
torch.cuda.ipc_collect() after the post-loop barrier — releases the last chunk's IPC entries for non-source ranks (which don't wait for ray.get())

**Root cause:** `ForkingPickler` calls `storage._share_cuda_()` on GPU tensors, creating permanent entries in the CUDA IPC cache that hold strong references to GPU memory. These entries are only released when `torch.cuda.ipc_collect()` detects the consumer has closed its IPC handle. **Fix (in `update_weight_from_tensor.py`):** 1. `del hf_named_tensors` added alongside `long_lived_tensors` to break chunk overlap 2. `torch.cuda.ipc_collect()` after each chunk's `ray.get()` + `del` — releases IPC cache entries for completed chunks 3. `torch.cuda.ipc_collect()` after the post-loop barrier — releases the last chunk's IPC entries for non-source ranks (which don't wait for `ray.get()`)

zhuzilin merged commit 183e525 into main Mar 17, 2026
1 check passed

zhuzilin deleted the memory_opt branch March 17, 2026 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA IPC cache leaks during weight updates#1731

Fix CUDA IPC cache leaks during weight updates#1731
zhuzilin merged 1 commit intomainfrom
memory_opt

zhuzilin commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhuzilin commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant