Rebase to latest slime#10
Conversation
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: samithuang <285365963@qq.com>
Co-authored-by: Copilot <copilot@github.com>
Add rollout backend client and test qwen2.5-0.5b non-colocate training
Signed-off-by: samithuang <285365963@qq.com>
Eliminate intermediate CPU tensors
Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan.
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com> Co-authored-by: yueming-yuan <yym022502@gmail.com> Co-authored-by: coding-famer <chenhegu0109@gmail.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
…#1862) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
THUDM#1883) Co-authored-by: yeqinghe <yeqinghe@MacBook-Pro-6.local>
Co-authored-by: Zilin Zhu <zhuzilinallen@gmail.com>
…site (THUDM#1902) Co-authored-by: jingshenghang <shenghang.jing@aminer.cn>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for NVIDIA GB10 and NPU backends, refactors the rollout system to integrate vLLM, and optimizes loss computation for context-parallel training. It also includes extensive documentation for role-based training overrides and advanced engine deployments. Feedback points to a potential TypeError when using non-reentrant checkpointing, a possible crash in advantage estimation if all log-probability sources are null, and an opportunity to reduce memory pressure by avoiding global tensor allocations during token redistribution.
|
|
||
| if args.recompute_loss_function: | ||
| loss, log = checkpoint(func, args, batch, logits, sum_of_sample_mean) | ||
| loss, log = checkpoint(func, args, batch, logits, sum_of_sample_mean, use_reentrant=False) |
There was a problem hiding this comment.
The checkpoint function from megatron.core.tensor_parallel (which is typically what is used here) does not support use_reentrant as a keyword argument in many versions of Megatron-LM. Passing it here may cause a TypeError. If you are using a custom wrapper or a specific version that supports it, please ensure compatibility. Otherwise, it is safer to pass arguments positionally or verify the signature of the imported checkpoint function.
| xs = log_probs or rollout_log_probs or values | ||
| kl = [torch.zeros_like(x, dtype=torch.float32, device=x.device) for x in xs] |
There was a problem hiding this comment.
If log_probs, rollout_log_probs, and values are all None, xs will be None, and the list comprehension on the next line will raise a TypeError. While there is an early return check at the beginning of compute_advantages_and_returns, it only checks log_probs and values. If args.use_rollout_logprobs is true and rollout_log_probs is available but log_probs and values are None, this path might still be reached. Consider adding an explicit check for xs or ensuring the early return covers all cases.
References
- Enforce defensive programming: ensure appropriate null/nil/None checks exist before object property accesses or iterations.
| def _build_shifted_tokens( | ||
| T: int, | ||
| device: torch.device, | ||
| unconcat_tokens: list[torch.Tensor], | ||
| total_lengths: list[int], | ||
| response_lengths: list[int], | ||
| qkv_format: str, | ||
| max_seq_lens: list[int] | None, | ||
| allgather_cp: bool, | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
When allgather_cp is enabled, T_global represents the total sequence length across all CP ranks. Allocating full_tokens with T_global on every rank may lead to high memory pressure or OOM if the global sequence length is extremely large (e.g., millions of tokens), even though it only stores int64. Since this is immediately sliced to the local chunk size T, consider if there is a way to construct only the local chunk of shifted tokens directly to save memory.
No description provided.