Fix(checkpoint): add resume/pause in save_model() for offload_train (fixes #1886) by Procrastinatorrrr · Pull Request #1888 · THUDM/slime

Procrastinatorrrr · 2026-05-04T14:52:11Z

🐛 Bug Summary

Checkpoint save crashes with --colocate and offload_train=True (regression from #1856, fixes #1886)

Error

Traceback (most recent call last):
  File "/root/slime/train.py", line 107, in <module>
    train(args)
  File "/root/slime/train.py", line 88, in train
    save(rollout_id)
  File "/root/slime/train.py", line 54, in save
    actor_model.save_model(
  File "/root/slime/slime/ray/actor_group.py", line 133, in save_model
    return ray.get([actor.save_model.remote(rollout_id, force_sync=force_sync) for actor in self._actor_handlers])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2980, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1023, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AcceleratorError): �[36mray::MegatronTrainRayActor.save_model()�[39m (pid=12545, ip=172.17.0.2, actor_id=7b15b4911ec1f78e875ffaec02000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7123dfe3d640>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/utils/timer.py", line 78, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 525, in save_model
    save(rollout_id, self.model, self.optimizer, self.opt_param_scheduler)
  File "/root/slime/slime/backends/megatron_utils/model.py", line 765, in save
    save_checkpoint(
  File "/root/Megatron-LM/megatron/training/checkpointing.py", line 635, in save_checkpoint
    async_save_request = dist_checkpointing.save(state_dict, checkpoint_name, save_strategy,
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megatron-LM/megatron/core/dist_checkpointing/serialization.py", line 407, in save
    sharded_state_dict, state_dict = save_preprocess(
                                     ^^^^^^^^^^^^^^^^
  File "/root/Megatron-LM/megatron/core/dist_checkpointing/state_dict_utils.py", line 56, in save_preprocess
    determine_global_metadata(sharded_part)[1],
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megatron-LM/megatron/core/dist_checkpointing/validation.py", line 518, in determine_global_metadata
    torch.distributed.all_gather_object(global_metadata, local_metadata)
  File "/root/slime/slime/utils/reloadable_process_group.py", line 67, in new_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3167, in all_gather_object
    input_tensor, local_size = _object_to_tensor(obj, current_device, group)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3076, in _object_to_tensor
    byte_tensor = torch.ByteTensor(byte_storage).to(device)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid argument
Search for `cudaErrorInvalidValue' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

Root Cause

PR #1856 introduced unconditional self.sleep() at the end of train(), which calls
torch_memory_saver.pause() to release GPU memory. However, save_model() only calls
reload_process_groups() but missing torch_memory_saver.resume(), causing it to
operate on a paused model state.

Call stack when crash occurs:

train.py:85   → actor_model.async_train()
                 ↓
actor.py:371  → self.sleep()              # Model paused here (pause #1)
                 ↓
train.py:87-88 → should_run_periodic_action() → True (save interval reached)
                 ↓
train.py:89   → save(rollout_id)         # Tries to save paused model
                 ↓
actor.py:518  → reload_process_groups()   # ✅ Process groups rebuilt
                 ↓ ❌ MISSING: torch_memory_saver.resume()
actor.py:525  → save(paused_model)        # 💥 CRASH: CUDA error or double-pause

Fix

Complete the save_model() offload lifecycle by adding proper resume/pause:

  def save_model(self, rollout_id, force_sync=False):
      if self.args.debug_rollout_only:
          return
  
-     # torch dist may trigger nccl communication during saving.
      if self.args.offload_train:
+         # [FIX-#1856] Resume model before saving when using offload_train
          reload_process_groups()
+         torch_memory_saver.resume()       # Wake up paused model
+         clear_memory()                    # Ensure sufficient memory available
  
      if self.args.async_save:
          from megatron.training.async_utils import maybe_finalize_async_save
          maybe_finalize_async_save(blocking=True)
  
      save(rollout_id, self.model, self.optimizer, self.opt_param_scheduler)
  
      if force_sync and self.args.async_save:
          maybe_finalize_async_save(blocking=True)
  
      if self.args.save_hf is not None and self.role == "actor":
          from slime.backends.megatron_utils.model import save_hf_model
          save_hf_model(self.args, rollout_id, self.model)
  
-     if self.args.offload_train:
-         destroy_process_groups()
+     # [FIX-#1856] Pause model again after saving to free GPU memory
+     if self.args.offload_train:
+         clear_memory(clear_host_memory=True)
+         destroy_process_groups()
+         torch_memory_saver.pause()        # Release GPU memory again

Verification

Test environment:

Model: Qwen3.5-4B (32 layers, 2560 hidden, TP=2)
Hardware: 2× NVIDIA H200 (140GB each)
Config: --colocate --advantage-estimator grpo --save-interval 1 --offload
Result: ✅ Checkpoint saved successfully at every rollout without errors

Log excerpt:

[36m(MegatronTrainRayActor pid=12281)[0m   [2026-05-04 13:58:44.399834] successfully saved checkpoint from iteration       1 to /data/output/checkpoints/ [ t 1/2, p 1/1 ]

Files Changed

slime/backends/megatron_utils/actor.py: Complete save_model() offload lifecycle (+6 lines)

Impact Analysis

✅ Minimal change: Only modifies save_model() method, no changes to train() or offload_train()
✅ Preserves all optimizations: Auto-sleep in train() remains active for memory efficiency
✅ No side effects: Only affects offload_train=True scenario with periodic saves
✅ Backwards compatible: All existing configurations work unchanged

Testing

Required: Please add run-ci-ckpt label (validates checkpoint save path)
Optional: run-ci-short for broader coverage

📚 Related Issues

Fixes: [Question] Checkpoint save fails with --colocate + --save-interval after #1856 — am I missing anything? #1886 (Checkpoint save fails with --colocate + --save-interval after refactor/ppo #1856)
Regression from: PR refactor/ppo #1856 (75af5297) - "refactor/ppo"
Affects: All users of --colocate with offload_train=True who need periodic checkpoint saves
Impact severity: High (training progress lost, cannot resume from checkpoints)
Workaround: Disable --colocate or remove --save-interval (not practical for long training runs)

…HUDM#1856 regression) When using --colocate with offload_train=True, train() calls self.sleep() which pauses the model via torch_memory_saver.pause(). The subsequent save_model() only rebuilds process groups but fails to resume the model, causing CUDA error during checkpoint save. This fix completes the save_model() offload lifecycle: - Add torch_memory_saver.resume() before save to wake up paused model - Add clear_memory() before save to ensure sufficient memory - Add clear_memory(clear_host_memory=True) after save for cleanup - Add torch_memory_saver.pause() after save to free GPU memory Fixes checkpoint save crash for all users of --colocate with --save-interval. Tested: GRPO training on Qwen3.5-4B with 2x H200 GPUs, checkpoints save successfully.

feji3769 · 2026-05-04T15:35:37Z

Thanks for the fix! This solution lets me save successfully.

lilei199908 · 2026-05-05T13:21:20Z

Thanks for the fix!

Procrastinatorrrr · 2026-05-05T14:32:54Z

CI Failure Summary

Both CI failures are pre-existing issues unrelated to this PR's changes.

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

Error: ValueError: Cannot merge two lists with different lengths (81 and 52)
Location: actor.py:84 → init() → load_checkpoint() (checkpoint loading, not saving)

Why Unrelated:
This PR only modifies:

save_model() — adds torch_memory_saver.resume/pause around save operation
train() — moves sleep() into critic branch

Neither change affects init(), load_checkpoint(), or optimizer state structure.

The error occurs during Phase 2 (checkpoint loading) of test_qwen3_4B_ckpt.py.

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`

Error: assert 'self.rollout_data_postprocess(self.args)' in <actor.py source> — assertion failed

Root Cause: Signature mismatch between test expectation and actual code:

	Test Expects	Actual Code
Call	`self.rollout_data_postprocess(self.args)`	`self.rollout_data_postprocess(self.args, rollout_id, rollout_data)`
Params	`(args,)`	`(args, rollout_id, rollout_data)`

Why Unrelated: This PR does not modify rollout_data_postprocess or its call site. The actual code at actor.py:467 already passes 3 parameters, suggesting this hook's signature was changed in a prior commit (possibly PR #1856 or later) without updating the contract test.

Request

@maintainer Could you please verify if both tests pass on the current main branch? If they also fail on main, this confirms these are pre-existing regressions.

lilei199908 · 2026-05-05T14:34:53Z

CI Failure Summary

Both CI failures are pre-existing issues unrelated to this PR's changes.

1. run-ci-ckpt (Checkpoint Save/Load Test)

Error: ValueError: Cannot merge two lists with different lengths (81 and 52) Location: actor.py:84 → init() → load_checkpoint() (checkpoint loading, not saving)

Why Unrelated: This PR only modifies:

save_model() — adds torch_memory_saver.resume/pause around save operation

train() — moves sleep() into critic branch

Neither change affects init(), load_checkpoint(), or optimizer state structure.

The error occurs during Phase 2 (checkpoint loading) of test_qwen3_4B_ckpt.py.

2. test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]

Error: assert 'self.rollout_data_postprocess(self.args)' in <actor.py source> — assertion failed

Root Cause: Signature mismatch between test expectation and actual code:

Test Expects Actual Code
Call self.rollout_data_postprocess(self.args) self.rollout_data_postprocess(self.args, rollout_id, rollout_data)
Params (args,) (args, rollout_id, rollout_data)
Why Unrelated: This PR does not modify rollout_data_postprocess or its call site. The actual code at actor.py:467 already passes 3 parameters, suggesting this hook's signature was changed in a prior commit (possibly PR #1856 or later) without updating the contract test.

Request

@maintainer Could you please verify if both tests pass on the current main branch? If they also fail on main, this confirms these are pre-existing regressions.

yes，these are pre-existing regressions and i'am trying to slove it. thanks

* temp save rfc Signed-off-by: SamitHuang <285365963@qq.com> * add plan Signed-off-by: SamitHuang <285365963@qq.com> * update Signed-off-by: SamitHuang <285365963@qq.com> * [docker] remove true on policy patches (THUDM#1661) Co-authored-by: Copilot <copilot@github.com> * [fix]: Qwen3.5-35B-A3B 8-GPU: set TP size to 2 for num_query_groups=2 (THUDM#1662) * Remove FSDP support (THUDM#1664) Co-authored-by: Copilot <copilot@github.com> * docs: add OpenClaw-RL to projects built upon slime (THUDM#1635) * qwen2.5 0.5b non-colocate (first attempt ok, but nccl error later) Signed-off-by: samithuang <285365963@qq.com> * add convert script * add setup doc * Support setting update weights in sglang_config (THUDM#1665) Co-authored-by: Copilot <copilot@github.com> * fix nccl error by NcclBridge subprocess * eliminate gpu to cpu weight transfer Signed-off-by: samithuang <285365963@qq.com> * Revise weight synchronization strategy in goal plan Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan. * [fix] Fix numerical accuracy issue in dynamic sampling filter (THUDM#1674) * sync from internal (THUDM#1677) Co-authored-by: Copilot <copilot@github.com> * bugfixes from community (THUDM#1678) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: coding-famer <chenhegu0109@gmail.com> * Fix: pass return_tensors in text_kwargs for transformers>=5.0.0 compatibility (THUDM#1648) * Fix missing packed_seq_params in bshd qkv_format (THUDM#1649) * [Multimodal][Model] Qwen3.5 VL training example/support (THUDM#1676) * update docs (THUDM#1680) Co-authored-by: Copilot <copilot@github.com> * update docs (THUDM#1681) Co-authored-by: Copilot <copilot@github.com> * support offloading non-updatable server (THUDM#1668) Co-authored-by: Copilot <copilot@github.com> * bugfix (THUDM#1685) Co-authored-by: Copilot <copilot@github.com> * fix: handle Qwen3.5 in quantize_params_fp8 (THUDM#1683) * bugfix (THUDM#1687) Co-authored-by: Copilot <copilot@github.com> * Fix Qwen3.5 & Qwen3-Next linear attention cu_seqlens missing (THUDM#1686) Co-authored-by: benyi <huangliangmeng.hlm@alibaba-inc.com> * fix: use semantic version comparison for PyTorch >= 2.6 detection (THUDM#1667) * [Fix] Minor fix for properly finishing / flushing wandb logging metrics at exit (THUDM#1592) Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Autofix/issue 1578 hf2megatron arg suffix (THUDM#1636) * bugfix (THUDM#1688) Co-authored-by: Copilot <copilot@github.com> * fix(examples): update strands_sglang example to v0.3.x API (THUDM#1684) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [docker] cherry pick qwen3.5 bugfix (THUDM#1691) Co-authored-by: Copilot <copilot@github.com> * bugfix/fix Qwen3.5 dense model precision bug in TP_SIZE>1 from sglang (THUDM#1705) * Fix/qwen3 5 mtp bridge (THUDM#1702) Co-authored-by: benyi <huangliangmeng.hlm@alibaba-inc.com> * support epd for glm4.6v (THUDM#1704) * [docker] support epd for glm4.6v (THUDM#1707) Co-authored-by: Copilot <copilot@github.com> * remove script * [docker] store v0.5.9 patch (THUDM#1710) Co-authored-by: Copilot <copilot@github.com> * Add GLM-4.7-Flash MTP training support (THUDM#1712) * [release] bump to v0.2.3 (THUDM#1682) Co-authored-by: Copilot <copilot@github.com> * feat: add GLM-4.6V MoE VL bridge with CP support (THUDM#1715) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: resolve rope_theta from rope_parameters dict in HF config validation (THUDM#1720) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * [docker] patches for glm4.6v, kimi k2.5 and dsa cp only (THUDM#1722) Co-authored-by: Copilot <copilot@github.com> * [docker] support IndexCache * Fix CUDA IPC cache leaks during weight updates (THUDM#1731) Co-authored-by: Copilot <copilot@github.com> * [docker] update megatron (THUDM#1729) Co-authored-by: Copilot <copilot@github.com> * [docker] Fix IndexCache with mla model (THUDM#1736) Co-authored-by: Copilot <copilot@github.com> * [slime-router] support pd disaggregation and remove radix tree middleware (THUDM#1735) * Fix glm4v megatron bridge (THUDM#1738) Co-authored-by: Copilot <copilot@github.com> * [docker] update sglang patch (THUDM#1743) Co-authored-by: Copilot <copilot@github.com> * feat: GLM4V multimodal support improvements (THUDM#1745) Co-authored-by: Copilot <copilot@github.com> * feat: placeholder worker type, metrics router, and GPQA letter range (THUDM#1746) Co-authored-by: Copilot <copilot@github.com> * always enable_metrics and remove dp context (THUDM#1747) Co-authored-by: Copilot <copilot@github.com> * fix: resolve SP/CP gradient inflation in FLA (linear attention) layers (THUDM#1748) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update MTP example configs, rename GLM-4.5 to GLM-4.7, clean scripts (THUDM#1749) Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Support qwen3.5 loss mask for multi-turn SFT (THUDM#1742) Co-authored-by: benyi <huangliangmeng.hlm@alibaba-inc.com> * fix: propagate moe_token_dispatcher_type in bridge model provider (THUDM#1737) * fix: resolve rope_theta from rope_parameters in DeepseekV32Bridge (THUDM#1734) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * chore: translate remaining Chinese comments to English (THUDM#1726) * feat: add Qwen3.5-4B model support (THUDM#1721) * fix: http_utils. disable system proxy for internal SGLang httpx clients (THUDM#1714) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: auto-detect GPUs in qwen3-4b script (THUDM#1700) * fix: quote `$MOE_LAYER_FREQ` (THUDM#1689) * disable router health_check and allow prompt_data is None (THUDM#1751) Co-authored-by: Copilot <copilot@github.com> * Router for vllm (#5) * Draft router design Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add vllm router Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add router to script Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix gpu memory utilization Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix output token ids Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add more nccl flag Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * small fix on qwen3-235b-a22b launch script (THUDM#1719) * sync internal bugfix (THUDM#1765) Co-authored-by: Copilot <copilot@github.com> * Fix uploading sglang metrics to wandb (THUDM#1768) Co-authored-by: Copilot <copilot@github.com> * use zhuzilin/sgl-router for sglang-router (THUDM#1770) Co-authored-by: Copilot <copilot@github.com> * [docker] update sgl-router (THUDM#1772) Co-authored-by: Copilot <copilot@github.com> * [Multimodal] Add Multimodal OPD support (THUDM#1760) * refactor: remove slime router (THUDM#1773) Co-authored-by: Copilot <copilot@github.com> * Add rollout trace timeline viewer (THUDM#1776) Co-authored-by: Hanyu Zhang <hanyu.zhang@aminer.cn> * [Fix] Fix duplicate Megatron LR scheduler resume when optimizer state is not loaded (THUDM#1775) * Support FP8 conversion for Qwen3.5 (THUDM#1769) * fix typo (THUDM#1759) Co-authored-by: shiqirui <shiqirui@kupasai.com> * [Fix]Fix some bugs/clean up (THUDM#1756) * (fix):not have encoder_only attr cause run failed (THUDM#1741) Co-authored-by: wangch <wangch@wangchdeMacBook-Air.local> * update docs * remove redundant envvar * some minor cleanup * [release] bump to v0.2.4 (THUDM#1777) Co-authored-by: Copilot <copilot@github.com> * Plan refactor vllm/sglang Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Code implemented Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix port Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix config Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug MOE weight sync * Fix bug vllm transfer weight Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix weight sync Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix config Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Change name config Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * pass critic role through to create RayTrainGroup (THUDM#1797) * fix qwen3.5 397B converting error when enable expert parallel (THUDM#1799) Co-authored-by: 周鹤云 <zhouheyun@xiaohongshu.com> * fix(geo3k-vlm-sft): remove --apply-chat-template from SFT launch script (THUDM#1791) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Add host memory metrics to available_memory function (THUDM#1764) * [WIP] fix loss oom (THUDM#1788) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * sync from internal (THUDM#1805) * sync from internal (THUDM#1807) * feat: add npu patch for qwen3-vl-8b grpo & ppo (THUDM#1750) Signed-off-by: cjy0x <isjunyi.chen@gmail.com> Co-authored-by: shiyuan680 <917935075@qq.com> Co-authored-by: PengchengShi00 <spc117369@gmail.com> * fix missing position_ids in log-prob forward step (THUDM#1809) * feat: add support for including missing weights from origin HF checkp… (THUDM#1812) * [Fix] Initialize grad_norm before found_inf skip path (THUDM#1762) * [conda] Add install custom sgl-router to build_conda.sh (THUDM#1813) * Revert no_grad for entropy to prevent comm stuck in dsa (THUDM#1822) * Add fallback for get_seqlen_balanced_partitions (THUDM#1823) * Resolve review Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Try colocated vllm weight Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * docs: add Relax to notable projects in README (THUDM#1834) * Bugfix: use cpu instead of cuda in convert_torch_dist_to_hf.py when --add-missing-from-origin-hf is set (THUDM#1828) * [fix] eval sample logging when sample is a list (THUDM#1836) * [Draft] Local runable dev Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * [Fix] Fix cuda-python pin in build_conda.sh (THUDM#1827) * fix entropy bug and update code (THUDM#1846) * Revert "Add fallback for get_seqlen_balanced_partitions" (THUDM#1848) * fix (THUDM#1849) * Fix offload train Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add support for NVIDIA DGX Spark (GB10 / sm_121a, arm64) (THUDM#1835) * Fix offload train Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix offload_rollout Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix vllm offload Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix offload traing Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix offload weight Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix offload weight Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * refactor/ppo (THUDM#1856) * [docker] cleanup sglang patch (THUDM#1859) * [docker] update v0.5.9 patch * Rename critic config to megatron config (THUDM#1866) * [Fix] Use Ray ObjectRef await instead of asyncio.to_thread in distributed POST (THUDM#1873) * chore: include length context in slice_log_prob_with_cp assert (THUDM#1862) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [docker] upgrade megatron to 1dcf0dafa (THUDM#1867) * fix ppo value head load bugs (THUDM#1878) * [docker] upgrade sglang to v0.5.10.post1 (THUDM#1874) * [docs] update docs * [docker] update megatron-bridge and add qwen3.6 tests (THUDM#1884) * fix lint * Fix(checkpoint): add resume/pause in save_model() for offload_train (fixes THUDM#1886) (THUDM#1888) * fix ppo value offload bugs (THUDM#1882) * fix qwen3.6 hf config validation bug (THUDM#1889) * Add missing metrics to log (THUDM#1890) * fix(qwen3_next): use torch.get_default_dtype() — get_current_dtype do… (THUDM#1883) Co-authored-by: yeqinghe <yeqinghe@MacBook-Pro-6.local> * Fix location error in install script (THUDM#1877) * Only allow --allgather-cp for DSA model (THUDM#1891) * Migrate internal feature (THUDM#1897) * [Fix] Fix distributed POST actor concurrency split (THUDM#1880) Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> * Fix CI: update rollout_data_postprocess plugin contract for new call site (THUDM#1902) Co-authored-by: jingshenghang <shenghang.jing@aminer.cn> * Patch Megatron TP grad coalesce to chunked all-reduce (THUDM#1899) * fix: harden retool rollout against multi-turn / retry desync (THUDM#1861) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix log file Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix import engine group Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix rebase code Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: SamitHuang <285365963@qq.com> Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Signed-off-by: cjy0x <isjunyi.chen@gmail.com> Co-authored-by: SamitHuang <285365963@qq.com> Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com> Co-authored-by: Copilot <copilot@github.com> Co-authored-by: none0663 <none0663@outlook.com> Co-authored-by: Yinjie Wang <yinjie@uchicago.edu> Co-authored-by: Fengqing Jiang <43953876+Django-Jiang@users.noreply.github.com> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: coding-famer <chenhegu0109@gmail.com> Co-authored-by: Lawrence Wu <lawrence.wu@harmonic.fun> Co-authored-by: huang3eng <huang3eng@gmail.com> Co-authored-by: benyi <huangliangmeng.hlm@alibaba-inc.com> Co-authored-by: Aaron Batilo <AaronBatilo@gmail.com> Co-authored-by: Silun Wang <igeekwang@gmail.com> Co-authored-by: Chengxing Xie <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Yuan He <33579950+Lawhy@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Mor Zusman <mor.zusmann@gmail.com> Co-authored-by: append-only <shw20010329@163.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Nan Jiang <59716405+nanjiangwill@users.noreply.github.com> Co-authored-by: Xuan Wang <49010704+stevewx@users.noreply.github.com> Co-authored-by: Hubert Wang <huberthyw@gmail.com> Co-authored-by: Hou Shihao <shhou007@gmail.com> Co-authored-by: DongzhuoranZhou <110855293+DongzhuoranZhou@users.noreply.github.com> Co-authored-by: Ailuntz <130897222+ailuntz@users.noreply.github.com> Co-authored-by: Zhuohao Li <garrick0508@gmail.com> Co-authored-by: Hanyu Zhang <hanyu.zhang@aminer.cn> Co-authored-by: Kang Yu <kangy.me@gmail.com> Co-authored-by: peterjc123 <peter_jiachen@163.com> Co-authored-by: qrskannbara <94727257+albaNnaksqr@users.noreply.github.com> Co-authored-by: shiqirui <shiqirui@kupasai.com> Co-authored-by: wangyufak <wangch9@xiaopeng.com> Co-authored-by: wangch <wangch@wangchdeMacBook-Air.local> Co-authored-by: Xintong Li <znculee@gmail.com> Co-authored-by: TM <tianmingxu.tmxu@gmail.com> Co-authored-by: 周鹤云 <zhouheyun@xiaohongshu.com> Co-authored-by: LiLei <77353389+lilei199908@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: cjy0x <isjunyi.chen@gmail.com> Co-authored-by: shiyuan680 <917935075@qq.com> Co-authored-by: PengchengShi00 <spc117369@gmail.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: Mathew Han <49226490+mathewjhan@users.noreply.github.com> Co-authored-by: haoxuanJIA <116806014+boots-coder@users.noreply.github.com> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Leo Fan <84952531+leofan-lab@users.noreply.github.com> Co-authored-by: Long Yijun <156500868+Procrastinatorrrr@users.noreply.github.com> Co-authored-by: HeatherLiuzh <heather996lzh@gmail.com> Co-authored-by: yeqinghe <yeqinghe@MacBook-Pro-6.local> Co-authored-by: tao W <122036357+selfanti@users.noreply.github.com> Co-authored-by: jingshenghang <48083555+jingshenghang@users.noreply.github.com> Co-authored-by: jingshenghang <shenghang.jing@aminer.cn>

lilei199908 added the run-ci-ckpt label May 5, 2026

lilei199908 reviewed May 5, 2026

View reviewed changes

Comment thread slime/backends/megatron_utils/actor.py Outdated

lilei199908 reviewed May 5, 2026

View reviewed changes

Comment thread slime/backends/megatron_utils/actor.py Outdated

refactor: use wake_up()/sleep() for cleaner save_model

d304ad2

zhuzilin approved these changes May 6, 2026

View reviewed changes

zhuzilin merged commit 16924b6 into THUDM:main May 6, 2026

FortPercent mentioned this pull request May 8, 2026

[Bug] CUresult error 1 (invalid argument) func=free line=81 in actor.sleep() during offload_train (regression from #1856) #1895

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix(checkpoint): add resume/pause in save_model() for offload_train (fixes #1886)#1888

Fix(checkpoint): add resume/pause in save_model() for offload_train (fixes #1886)#1888
zhuzilin merged 2 commits into
THUDM:mainfrom
Procrastinatorrrr:fix/checkpoint-save-crash-offload

Procrastinatorrrr commented May 4, 2026

Uh oh!

feji3769 commented May 4, 2026

Uh oh!

lilei199908 commented May 5, 2026

Uh oh!

Procrastinatorrrr commented May 5, 2026

Uh oh!

lilei199908 commented May 5, 2026

CI Failure Summary

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`

Request

CI Failure Summary

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`

Request

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Procrastinatorrrr commented May 4, 2026

🐛 Bug Summary

Error

Root Cause

Fix

Verification

Files Changed

Impact Analysis

Testing

📚 Related Issues

Uh oh!

feji3769 commented May 4, 2026

Uh oh!

lilei199908 commented May 5, 2026

Uh oh!

Procrastinatorrrr commented May 5, 2026

CI Failure Summary

1. run-ci-ckpt (Checkpoint Save/Load Test)

2. test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]

Request

Uh oh!

lilei199908 commented May 5, 2026

CI Failure Summary

1. run-ci-ckpt (Checkpoint Save/Load Test)

2. test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]

Request

CI Failure Summary

1. run-ci-ckpt (Checkpoint Save/Load Test)

2. test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]

Request

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`

1. `run-ci-ckpt` (Checkpoint Save/Load Test)

2. `test_plugin_runtime_hook_contracts::test_runtime_hook_callsite_is_stable[rollout_data_postprocess]`