elegantly prevent data re-put within DP#50
Conversation
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors the data collection logic in the distributed parallelism (DP) framework to prevent redundant data re-put operations. It moves the logic for determining whether to collect data from individual workers to the decorator level, making it more centralized and elegant.
- Introduces
_compute_need_collect()function to determine collection necessity based on dispatch mode and worker state - Removes the
collect_from_rankkwarg-based approach in favor of inspecting dispatch mode configuration - Simplifies
collect_nd_compute_dataprotoby removing thecollect_maskparameter
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| verl/utils/transferqueue_utils.py | Adds _compute_need_collect() helper function, updates tqbridge decorator to accept dispatch_mode parameter, removes manual collect_from_rank kwarg handling, and changes BatchMeta() to BatchMeta.empty() |
| verl/single_controller/base/decorator.py | Removes collect_mask parameter from dispatch/collect functions, passes dispatch_mode to tqbridge, and simplifies the lazy compute dispatch/collect logic |
Comments suppressed due to low confidence (1)
verl/single_controller/base/decorator.py:274
- Call to function collect_nd_compute with too few arguments; should be no fewer than 3.
output = collect_nd_compute(worker_group, output)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def collect_nd_compute_dataproto(worker_group, output): | ||
| output = collect_nd_compute(worker_group, output) |
There was a problem hiding this comment.
The function collect_nd_compute at line 254 expects collect_mask as its first parameter, but the call at line 274 only passes worker_group and output. This will cause a TypeError when the function is invoked. The function signature of collect_nd_compute needs to be updated to remove the collect_mask parameter, or you need to retrieve and pass the collect_mask here.
| Args: | ||
| dispatch_mode: For controlling data collection logic. If None, | ||
| _compute_need_collect will always return True. |
There was a problem hiding this comment.
The documentation should explain what happens when dispatch_mode is None vs when it's provided, and clarify what types are accepted for dispatch_mode (should it be a dict or can it also be a Dispatch enum?). The current description is incomplete.
| collect_fn_name = dispatch_mode["collect_fn"].func.__name__ | ||
| if collect_fn_name != "collect_lazy_compute_data_proto" or len(args) < 1 or not isinstance(args[0], Worker): | ||
| return True | ||
|
|
||
| collect_mesh_name = dispatch_mode["collect_fn"].args[0] |
There was a problem hiding this comment.
The condition checks if collect_fn_name != "collect_lazy_compute_data_proto" to return True early. However, this hardcoded function name check is brittle. If the function is renamed or if there are other similar collect functions added in the future, this logic will break. Consider using a more robust approach, such as checking a property/attribute on the collect function or using a registry pattern.
| collect_fn_name = dispatch_mode["collect_fn"].func.__name__ | |
| if collect_fn_name != "collect_lazy_compute_data_proto" or len(args) < 1 or not isinstance(args[0], Worker): | |
| return True | |
| collect_mesh_name = dispatch_mode["collect_fn"].args[0] | |
| collect_fn = dispatch_mode["collect_fn"] | |
| base_fn = getattr(collect_fn, "func", collect_fn) | |
| # Prefer an explicit attribute on the collect function, fall back to name-based check. | |
| is_lazy_collect = getattr( | |
| base_fn, | |
| "is_lazy_compute_data_proto", | |
| base_fn.__name__ == "collect_lazy_compute_data_proto", | |
| ) | |
| if not is_lazy_collect or len(args) < 1 or not isinstance(args[0], Worker): | |
| return True | |
| collect_mesh_name = collect_fn.args[0] |
| elif collect_from_rank == False: | ||
| return BatchMeta() | ||
| elif not need_collect: | ||
| return BatchMeta.empty() |
There was a problem hiding this comment.
The code now uses BatchMeta.empty() instead of BatchMeta(). Verify that the BatchMeta class from the transfer_queue library actually has an empty() class method. If it doesn't exist, this will cause an AttributeError at runtime.
| return BatchMeta() | ||
| return output | ||
| elif not need_collect: | ||
| return BatchMeta.empty() |
There was a problem hiding this comment.
The code now uses BatchMeta.empty() instead of BatchMeta(). Verify that the BatchMeta class from the transfer_queue library actually has an empty() class method. If it doesn't exist, this will cause an AttributeError at runtime.
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com>
df06822
into
TransferQueue:han/optimize_tq_collect
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)