Skip to content

fix: auto-detect Ray fanout stages#2025

Open
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1613-auto-fanout
Open

fix: auto-detect Ray fanout stages#2025
nightcityblade wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1613-auto-fanout

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Description

Closes #1613.

Automatically marks Ray Data stages as fanout stages when their process return annotation is list[...] or a union that includes list[...]. This lets stages like URLGenerationStage rely on the base ProcessingStage default instead of manually setting is_fanout_stage.

Usage

class MyFanoutStage(ProcessingStage[InputTask, OutputTask]):
    def process(self, task: InputTask) -> list[OutputTask]:
        ...

assert MyFanoutStage().ray_stage_spec() == {"is_fanout_stage": True}

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Tests:

  • uv run --python 3.12 ruff check nemo_curator/stages/base.py nemo_curator/stages/text/download/base/url_generation.py tests/stages/common/test_base.py
  • uv run --python 3.12 --with pytest --with ray --with loguru --with pandas --with pyarrow --with fsspec python -m pytest tests/stages/common/test_base.py tests/stages/text/download/base/test_url_generation.py (blocked on macOS by NeMo-Curator's Linux-only runtime check)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@nightcityblade nightcityblade requested a review from a team as a code owner May 24, 2026 13:27
@nightcityblade nightcityblade requested review from meatybobby and removed request for a team May 24, 2026 13:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 24, 2026

Greptile Summary

This PR auto-detects Ray "fanout" stages by inspecting the return annotation of ProcessingStage.process at class creation time, removing the need for manual ray_stage_spec() overrides on list-returning stages. URLGenerationStage is the first beneficiary, losing its now-redundant override.

  • ProcessingStage.ray_stage_spec() now calls _process_returns_list(), which uses get_type_hints (with a __annotations__ fallback) and _annotation_includes_list to recognise list[T], Union[..., list[T], ...], and T | list[T] return types.
  • URLGenerationStage.ray_stage_spec() is deleted; the base-class logic correctly inherits is_fanout_stage: True from the list[FileGroupTask] annotation.
  • Three new TestProcessingStageRaySpec tests cover single-task, pure-list, and union-list return types.

Confidence Score: 4/5

Safe to merge; the auto-detection logic is correct for all annotated stages in the repo, and the removed override in URLGenerationStage is genuinely redundant.

The core annotation-inspection logic works correctly for the list[T], Union[..., list[T]], and T | list[T] cases tested. A minor silent-failure path exists when get_type_hints throws and the fallback returns a raw string annotation, but no existing stage in the repo triggers that path.

nemo_curator/stages/base.py — specifically the annotations fallback in _process_returns_list.

Important Files Changed

Filename Overview
nemo_curator/stages/base.py Adds _process_returns_list and _annotation_includes_list helpers; ray_stage_spec() now auto-returns is_fanout_stage: True when the concrete process annotation includes list[...]. Minor redundancy in the union-type guard and a silent-failure edge case in the string-annotation fallback.
nemo_curator/stages/text/download/base/url_generation.py Removes the now-redundant explicit ray_stage_spec override; auto-detection in the base class correctly handles process(...) -> list[FileGroupTask].
tests/stages/common/test_base.py Adds FanoutProcessingStage, MaybeFanoutProcessingStage, and TestProcessingStageRaySpec covering single-task, list, and union-return cases.

Reviews (1): Last reviewed commit: "fix: auto-detect Ray fanout stages" | Re-trigger Greptile

origin = get_origin(annotation)
if origin is list:
return True
if origin in (UnionType, Union) or isinstance(annotation, UnionType):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The isinstance(annotation, UnionType) branch is unreachable: for any X | Y expression, get_origin(annotation) returns types.UnionType, so origin in (UnionType, Union) is already True before isinstance is evaluated. The isinstance guard can be removed without changing behaviour.

Suggested change
if origin in (UnionType, Union) or isinstance(annotation, UnionType):
if origin in (UnionType, Union):

Comment on lines +313 to +316
try:
return_annotation = get_type_hints(cls.process).get("return")
except (NameError, TypeError):
return_annotation = cls.process.__annotations__.get("return")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 String annotation fallback silently loses type information

When get_type_hints() raises NameError or TypeError (e.g. when from __future__ import annotations is active and a referenced type is not importable at inspection time), the fallback reads cls.process.__annotations__["return"] which is a raw string such as "list[OutputTask]". get_origin("list[OutputTask]") returns None, so _annotation_includes_list returns False and the stage is not auto-detected as a fanout stage even though it is one. The failure is silent — ray_stage_spec() simply returns {}. Stages that hit this path would need to manually override ray_stage_spec() as before.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 26, 2026
Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nightcityblade !

@@ -77,11 +77,6 @@ def process(self, task: _EmptyTask) -> list[FileGroupTask]:
for i, url in enumerate(urls)
]

def ray_stage_spec(self) -> dict[str, Any]:
return {
"is_fanout_stage": True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do this for all stages to make sure that this PR works for all of them? And ensure that each existing stage has a pytest to check that it is being set?


name = "MaybeFanoutProcessingStage"

def process(self, task: MockTask) -> MockTask | list[MockTask]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am undecided what should happen in this case. Maybe it should be up to the user?

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond waiting-on-customer Waiting on the original author to respond labels Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically detect when IS_FANOUT_STAGE should be set to True

3 participants