Add ray data for image by oyilmaz-nvidia · Pull Request #1610 · NVIDIA-NeMo/Curator

oyilmaz-nvidia · 2026-03-16T02:50:15Z

Description

This PR adds a few changes to test and benchmark Ray Data for image workflows.

Fanouts

Why IS_FANOUT_STAGE on ImageReaderStage:

ImageReaderStage.process() returns list[ImageBatch] — for each .tar file it reads, DALI may produce multiple batches. In Ray Data, all those batches from one tar end up in the same block after map_batches. Without IS_FANOUT_STAGE, all of them get sent to the same downstream embedding actor, killing parallelism. The flag triggers repartition(target_num_rows_per_block=1) in the adapter, splitting them into individual blocks so each ImageBatch can be picked up by any available ImageEmbeddingStage actor independently.

It's the same reason FilePartitioningStage has it — it also returns a list[FileGroupTask].

Should You Add More Fanouts?

For the standard pipeline, no.

Every other image stage returns a single task — confirmed:

Stage process() return type Fanout needed?
ImageReaderStage list[ImageBatch] ✅ added
ImageEmbeddingStage ImageBatch no
ImageAestheticFilterStage ImageBatch no
ImageNSFWFilterStage ImageBatch no
ImageWriterStage FileGroupTask no
ConvertImageBatchToDocumentBatch DocumentBatch no
DeduplicationRemovalStage ImageBatch no

ImageReaderStage is the only image stage that fans out (1 tar → N batches), so it's the only one that needs the flag.

Benchmarking:

The runtimes of Xenna and Ray Data for image curation benchmarking are almost the same.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

copy-pr-bot · 2026-03-16T02:50:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

oyilmaz-nvidia · 2026-03-16T03:12:44Z

/ok to test ae3bf46

greptile-apps · 2026-03-17T23:18:48Z

Greptile Summary

This PR enables Ray Data support for the image curation pipeline by adding a ray_stage_spec() override on ImageReaderStage that signals the Ray Data adapter to repartition after each tar is processed, restoring inter-stage parallelism that would otherwise collapse when multiple ImageBatch objects from one tar land in the same Ray block. A matching nightly benchmark entry (image_curation_raydata) is added alongside the existing Xenna benchmark to track Ray Data performance parity.

Key changes:

ImageReaderStage.ray_stage_spec() returns {IS_FANOUT_STAGE: True}, mirroring the pattern already established in FilePartitioningStage, so the Ray Data adapter calls repartition(target_num_rows_per_block=1) after this stage.
benchmarking/nightly-benchmark.yaml: old image_curation entry renamed to image_curation_xenna; new image_curation_raydata entry added with --executor=ray_data, identical resource config (64 CPUs, 4 GPUs), and the same correctness thresholds (exact_value: 3800, min_value: 3.0 images/sec).
Minor: the multi-line id_prefix ternary in _read_tars_with_dali was condensed to one line (this was noted in a prior review thread).

Confidence Score: 4/5

Safe to merge with one minor gap: the new ray_stage_spec() method has no unit test, leaving the fanout contract unverified by the test suite.
The IS_FANOUT_STAGE pattern is well-established in the codebase (FilePartitioningStage uses it identically) and the adapter's consumption of the flag is already tested there. The benchmark YAML changes are additive and low-risk. The only gap is that the test file for ImageReaderStage doesn't assert the new method's return value, despite the checklist claiming test coverage.
tests/stages/image/io/test_image_reader.py — missing assertion for the new ray_stage_spec() fanout flag.

Important Files Changed

Filename	Overview
nemo_curator/stages/image/io/image_reader.py	Adds `ray_stage_spec()` override returning `IS_FANOUT_STAGE: True` to enable proper parallelism when running under the Ray Data executor; also reformats the `id_prefix` ternary to a single line. The fanout logic follows the existing pattern in `FilePartitioningStage`. No test covers the new `ray_stage_spec()` return value.
benchmarking/nightly-benchmark.yaml	Renames the existing `image_curation` entry to `image_curation_xenna` and adds an equivalent `image_curation_raydata` entry that passes `--executor=ray_data`. Both entries share identical resource requests, thresholds, and correctness requirements (`min_value: 3.0`, `exact_value: 3800`).

Sequence Diagram

sequenceDiagram
    participant RDE as RayDataExecutor
    participant Adapter as RayDataStageAdapter
    participant IRS as ImageReaderStage
    participant DS as Ray Dataset

    RDE->>Adapter: process_dataset(dataset)
    Adapter->>IRS: ray_stage_spec()
    IRS-->>Adapter: {IS_FANOUT_STAGE: True}
    Adapter->>DS: map_batches(ImageReaderStage.process)<br/>(1 FileGroupTask → N ImageBatches)
    DS-->>Adapter: dataset with N batches<br/>in same block
    Adapter->>DS: repartition(target_num_rows_per_block=1)
    DS-->>Adapter: N separate blocks (one per ImageBatch)
    Adapter->>RDE: return repartitioned dataset
    Note over DS: Each block dispatched<br/>independently to any<br/>ImageEmbeddingStage actor

Comments Outside Diff (1)

tests/stages/image/io/test_image_reader.py, line 124-130 (link)

Missing test for ray_stage_spec()

The new ray_stage_spec() method is the core behavioral change in this PR — it marks ImageReaderStage as a fanout stage so the Ray Data adapter triggers repartition() after it. The existing test suite covers inputs(), outputs(), resources, and process(), but there is no test asserting that ray_stage_spec() returns {RayStageSpecKeys.IS_FANOUT_STAGE: True}.

Since the PR checklist states "New or Existing tests cover these changes," a simple assertion like the following would satisfy that requirement:
```
def test_ray_stage_spec_is_fanout() -> None:
    from nemo_curator.backends.experimental.utils import RayStageSpecKeys
    from nemo_curator.stages.image.io.image_reader import ImageReaderStage

    with patch("torch.cuda.is_available", return_value=False):
        stage = ImageReaderStage(dali_batch_size=2, verbose=False)

    spec = stage.ray_stage_spec()
    assert spec.get(RayStageSpecKeys.IS_FANOUT_STAGE) is True
```
Without this, a future refactor that accidentally drops the flag (or the method) would go undetected.

_{Last reviewed commit: "Update nightly-bench..."}

greptile-apps · 2026-03-17T23:18:55Z

-            if len(tar_paths) == 1
-            else f"group_{tar_paths[0].stem}_x{len(tar_paths)}"
-        )
+        id_prefix = tar_paths[0].stem if len(tar_paths) == 1 else f"group_{tar_paths[0].stem}_x{len(tar_paths)}"


Long line reduces readability

The original multi-line ternary was reformatted into a single ~105-character line. This makes the line harder to scan and likely exceeds the project's line-length limit. The original multi-line form was clearer — consider reverting to it:

Suggested change

id_prefix = tar_paths[0].stem if len(tar_paths) == 1 else f"group_{tar_paths[0].stem}_x{len(tar_paths)}"

id_prefix = (

tar_paths[0].stem

if len(tar_paths) == 1

else f"group_{tar_paths[0].stem}_x{len(tar_paths)}"

)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-17T23:18:56Z

+      - metric: throughput_images_per_sec
+        min_value: 2.5


Throughput threshold inconsistency with PR description

The PR description states "The runtimes of Xenna and Ray Data for image curation benchmarking are almost the same," yet the min_value for throughput_images_per_sec is set 20% lower for Ray Data (2.5) than for Xenna (3.0). If the two executors truly perform comparably, consider aligning the floor values — or document in a comment why a lower bound is intentional (e.g., to account for Ray Data cold-start overhead in CI).

oyilmaz-nvidia · 2026-03-18T16:07:00Z

/ok to test 1c2df7e

VibhuJawa · 2026-03-18T16:12:32Z

+    def ray_stage_spec(self) -> dict[str, Any]:
+        """Ray stage specification for this stage."""
+        return {
+            RayStageSpecKeys.IS_FANOUT_STAGE: True,
+        }
+


This makes sense, thanks for adding it .

suiyoubi

Thanks @oyilmaz-nvidia , does the perf for raydata comparable to using Xenna ? I see the min-value as set differently.

Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>

oyilmaz-nvidia · 2026-03-20T19:10:56Z

@suiyoubi Change the min val to default and the run time is similar to xenna.

oyilmaz-nvidia · 2026-03-20T19:11:10Z

/ok to test 590d503

Add fanout for ray data and benchmarking for image

6709ba0

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

Test these params that also utilizes fanout

ae3bf46

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

copy-pr-bot bot temporarily deployed to test March 16, 2026 03:13 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 16, 2026 03:13 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 17, 2026 23:17 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci March 17, 2026 23:17 Failure

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

VibhuJawa reviewed Mar 18, 2026

View reviewed changes

sarahyurick approved these changes Mar 18, 2026

View reviewed changes

suiyoubi reviewed Mar 18, 2026

View reviewed changes

oyilmaz-nvidia added 2 commits March 20, 2026 14:27

Merge branch 'NVIDIA-NeMo:main' into onur/ray-data-for-image

688fe85

Update nightly-benchmark.yaml

590d503

Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>

copy-pr-bot bot temporarily deployed to test March 20, 2026 19:11 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci March 20, 2026 19:11 Inactive

suiyoubi approved these changes Mar 23, 2026

View reviewed changes

lbliii mentioned this pull request Apr 3, 2026

docs: add image reader Ray Data support docs (PR #1610) #1731

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ray data for image#1610

Add ray data for image#1610
oyilmaz-nvidia merged 6 commits intoNVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/ray-data-for-image

oyilmaz-nvidia commented Mar 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

oyilmaz-nvidia commented Mar 16, 2026

Uh oh!

greptile-apps bot commented Mar 17, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

oyilmaz-nvidia commented Mar 18, 2026

Uh oh!

VibhuJawa Mar 18, 2026

Uh oh!

suiyoubi left a comment

Uh oh!

oyilmaz-nvidia commented Mar 20, 2026

Uh oh!

oyilmaz-nvidia commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

oyilmaz-nvidia commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fanouts

Should You Add More Fanouts?

Benchmarking:

Checklist

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

oyilmaz-nvidia commented Mar 16, 2026

Uh oh!

greptile-apps bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

oyilmaz-nvidia commented Mar 18, 2026

Uh oh!

VibhuJawa Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi left a comment

Choose a reason for hiding this comment

Uh oh!

oyilmaz-nvidia commented Mar 20, 2026

Uh oh!

oyilmaz-nvidia commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

oyilmaz-nvidia commented Mar 16, 2026 •

edited

Loading

greptile-apps bot commented Mar 17, 2026 •

edited

Loading