Add ray data for image#1610
Conversation
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
|
/ok to test ae3bf46 |
Greptile SummaryThis PR enables Ray Data support for the image curation pipeline by adding a Key changes:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant RDE as RayDataExecutor
participant Adapter as RayDataStageAdapter
participant IRS as ImageReaderStage
participant DS as Ray Dataset
RDE->>Adapter: process_dataset(dataset)
Adapter->>IRS: ray_stage_spec()
IRS-->>Adapter: {IS_FANOUT_STAGE: True}
Adapter->>DS: map_batches(ImageReaderStage.process)<br/>(1 FileGroupTask → N ImageBatches)
DS-->>Adapter: dataset with N batches<br/>in same block
Adapter->>DS: repartition(target_num_rows_per_block=1)
DS-->>Adapter: N separate blocks (one per ImageBatch)
Adapter->>RDE: return repartitioned dataset
Note over DS: Each block dispatched<br/>independently to any<br/>ImageEmbeddingStage actor
|
| if len(tar_paths) == 1 | ||
| else f"group_{tar_paths[0].stem}_x{len(tar_paths)}" | ||
| ) | ||
| id_prefix = tar_paths[0].stem if len(tar_paths) == 1 else f"group_{tar_paths[0].stem}_x{len(tar_paths)}" |
There was a problem hiding this comment.
The original multi-line ternary was reformatted into a single ~105-character line. This makes the line harder to scan and likely exceeds the project's line-length limit. The original multi-line form was clearer — consider reverting to it:
| id_prefix = tar_paths[0].stem if len(tar_paths) == 1 else f"group_{tar_paths[0].stem}_x{len(tar_paths)}" | |
| id_prefix = ( | |
| tar_paths[0].stem | |
| if len(tar_paths) == 1 | |
| else f"group_{tar_paths[0].stem}_x{len(tar_paths)}" | |
| ) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| - metric: throughput_images_per_sec | ||
| min_value: 2.5 |
There was a problem hiding this comment.
Throughput threshold inconsistency with PR description
The PR description states "The runtimes of Xenna and Ray Data for image curation benchmarking are almost the same," yet the min_value for throughput_images_per_sec is set 20% lower for Ray Data (2.5) than for Xenna (3.0). If the two executors truly perform comparably, consider aligning the floor values — or document in a comment why a lower bound is intentional (e.g., to account for Ray Data cold-start overhead in CI).
|
/ok to test 1c2df7e |
| def ray_stage_spec(self) -> dict[str, Any]: | ||
| """Ray stage specification for this stage.""" | ||
| return { | ||
| RayStageSpecKeys.IS_FANOUT_STAGE: True, | ||
| } | ||
|
|
There was a problem hiding this comment.
This makes sense, thanks for adding it .
suiyoubi
left a comment
There was a problem hiding this comment.
Thanks @oyilmaz-nvidia , does the perf for raydata comparable to using Xenna ? I see the min-value as set differently.
Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
|
@suiyoubi Change the min val to default and the run time is similar to xenna. |
|
/ok to test 590d503 |
Description
This PR adds a few changes to test and benchmark Ray Data for image workflows.
Fanouts
Why IS_FANOUT_STAGE on ImageReaderStage:
ImageReaderStage.process() returns list[ImageBatch] — for each .tar file it reads, DALI may produce multiple batches. In Ray Data, all those batches from one tar end up in the same block after map_batches. Without IS_FANOUT_STAGE, all of them get sent to the same downstream embedding actor, killing parallelism. The flag triggers repartition(target_num_rows_per_block=1) in the adapter, splitting them into individual blocks so each ImageBatch can be picked up by any available ImageEmbeddingStage actor independently.
It's the same reason FilePartitioningStage has it — it also returns a list[FileGroupTask].
Should You Add More Fanouts?
For the standard pipeline, no.
Every other image stage returns a single task — confirmed:
Stage process() return type Fanout needed?
ImageReaderStage list[ImageBatch] ✅ added
ImageEmbeddingStage ImageBatch no
ImageAestheticFilterStage ImageBatch no
ImageNSFWFilterStage ImageBatch no
ImageWriterStage FileGroupTask no
ConvertImageBatchToDocumentBatch DocumentBatch no
DeduplicationRemovalStage ImageBatch no
ImageReaderStage is the only image stage that fans out (1 tar → N batches), so it's the only one that needs the flag.
Benchmarking:
The runtimes of Xenna and Ray Data for image curation benchmarking are almost the same.
Checklist