fix: auto-detect Ray fanout stages#2025
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR auto-detects Ray "fanout" stages by inspecting the return annotation of
Confidence Score: 4/5Safe to merge; the auto-detection logic is correct for all annotated stages in the repo, and the removed override in URLGenerationStage is genuinely redundant. The core annotation-inspection logic works correctly for the list[T], Union[..., list[T]], and T | list[T] cases tested. A minor silent-failure path exists when get_type_hints throws and the fallback returns a raw string annotation, but no existing stage in the repo triggers that path. nemo_curator/stages/base.py — specifically the annotations fallback in _process_returns_list. Important Files Changed
Reviews (1): Last reviewed commit: "fix: auto-detect Ray fanout stages" | Re-trigger Greptile |
| origin = get_origin(annotation) | ||
| if origin is list: | ||
| return True | ||
| if origin in (UnionType, Union) or isinstance(annotation, UnionType): |
There was a problem hiding this comment.
The
isinstance(annotation, UnionType) branch is unreachable: for any X | Y expression, get_origin(annotation) returns types.UnionType, so origin in (UnionType, Union) is already True before isinstance is evaluated. The isinstance guard can be removed without changing behaviour.
| if origin in (UnionType, Union) or isinstance(annotation, UnionType): | |
| if origin in (UnionType, Union): |
| try: | ||
| return_annotation = get_type_hints(cls.process).get("return") | ||
| except (NameError, TypeError): | ||
| return_annotation = cls.process.__annotations__.get("return") |
There was a problem hiding this comment.
String annotation fallback silently loses type information
When get_type_hints() raises NameError or TypeError (e.g. when from __future__ import annotations is active and a referenced type is not importable at inspection time), the fallback reads cls.process.__annotations__["return"] which is a raw string such as "list[OutputTask]". get_origin("list[OutputTask]") returns None, so _annotation_includes_list returns False and the stage is not auto-detected as a fanout stage even though it is one. The failure is silent — ray_stage_spec() simply returns {}. Stages that hit this path would need to manually override ray_stage_spec() as before.
sarahyurick
left a comment
There was a problem hiding this comment.
Thanks @nightcityblade !
| @@ -77,11 +77,6 @@ def process(self, task: _EmptyTask) -> list[FileGroupTask]: | |||
| for i, url in enumerate(urls) | |||
| ] | |||
|
|
|||
| def ray_stage_spec(self) -> dict[str, Any]: | |||
| return { | |||
| "is_fanout_stage": True, | |||
There was a problem hiding this comment.
Can you do this for all stages to make sure that this PR works for all of them? And ensure that each existing stage has a pytest to check that it is being set?
|
|
||
| name = "MaybeFanoutProcessingStage" | ||
|
|
||
| def process(self, task: MockTask) -> MockTask | list[MockTask]: |
There was a problem hiding this comment.
I am undecided what should happen in this case. Maybe it should be up to the user?
Description
Closes #1613.
Automatically marks Ray Data stages as fanout stages when their
processreturn annotation islist[...]or a union that includeslist[...]. This lets stages likeURLGenerationStagerely on the baseProcessingStagedefault instead of manually settingis_fanout_stage.Usage
Checklist
Tests:
uv run --python 3.12 ruff check nemo_curator/stages/base.py nemo_curator/stages/text/download/base/url_generation.py tests/stages/common/test_base.pyuv run --python 3.12 --with pytest --with ray --with loguru --with pandas --with pyarrow --with fsspec python -m pytest tests/stages/common/test_base.py tests/stages/text/download/base/test_url_generation.py(blocked on macOS by NeMo-Curator's Linux-only runtime check)