Rebase to latest slime#12
Conversation
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Add rollout backend client and test qwen2.5-0.5b non-colocate training
Signed-off-by: samithuang <285365963@qq.com>
Eliminate intermediate CPU tensors
Reorder weight synchronization support for colocate and non-colocate scenarios in the goal plan.
* Draft router design Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add vllm router Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add router to script Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix gpu memory utilization Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix output token ids Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Add more nccl flag Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
…site (THUDM#1902) Co-authored-by: jingshenghang <shenghang.jing@aminer.cn>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces vLLM as a first-class rollout backend for Slime, refactoring the codebase to support backend-agnostic requests and responses through new adapter interfaces. Significant additions include a managed vLLM engine lifecycle, a translation sidecar for protocol compatibility, specialized weight synchronization mechanisms for both colocated and distributed deployments, and a Megatron patch for chunked gradient coalescence. Review feedback highlights the need to restore configuration validation logic and remove an unconditional override of the rollout offloading setting. Furthermore, the reviewer suggests replacing hardcoded file paths in the new shell scripts with variables to enhance environment portability.
| # if hasattr(hf_config, hf_config_name) and hasattr(args, megatron_config_name): | ||
| # if not compare_fn(getattr(hf_config, hf_config_name), getattr(args, megatron_config_name)): | ||
| # errors.append( | ||
| # f"{hf_config_name} in hf config {getattr(hf_config, hf_config_name)} is not equal to " | ||
| # f"{megatron_config_name} {getattr(args, megatron_config_name)}, please check the config." | ||
| # ) |
There was a problem hiding this comment.
The validation logic comparing Hugging Face config values with Megatron arguments has been commented out. This is risky as it can lead to silent misconfigurations that are hard to debug at runtime. If the validation was causing issues, it should be fixed or adapted rather than completely disabled. Please consider restoring this validation to ensure configuration correctness.
| args.offload_train = False | ||
| if args.offload_rollout is None: | ||
| args.offload_rollout = False | ||
| args.offload_rollout = False |
There was a problem hiding this comment.
| PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ | ||
| ${MODEL_ARGS[@]} \ | ||
| --hf-checkpoint /root/Qwen2.5-0.5B-Instruct \ | ||
| --save /root/Qwen2.5-0.5B-Instruct_torch_dist/ |
There was a problem hiding this comment.
This script contains hardcoded paths (e.g., /root/Megatron-LM, /root/Qwen2.5-0.5B-Instruct). This reduces portability and makes it difficult for other users to run the script in different environments. Consider using environment variables or command-line arguments to specify these paths to improve reusability.
| --hf-checkpoint /root/Qwen2.5-0.5B-Instruct/ | ||
| --ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/ |
There was a problem hiding this comment.
| --hf-checkpoint /root/Qwen2.5-0.5B-Instruct/ | ||
| --ref-load /root/Qwen2.5-0.5B-Instruct_torch_dist/ |
There was a problem hiding this comment.




No description provided.