Skip to content

Expose a few more args to the exact dedup workflow#1561

Merged
ayushdg merged 3 commits into
NVIDIA-NeMo:mainfrom
ayushdg:exact-workflow-expose-args
Mar 2, 2026
Merged

Expose a few more args to the exact dedup workflow#1561
ayushdg merged 3 commits into
NVIDIA-NeMo:mainfrom
ayushdg:exact-workflow-expose-args

Conversation

@ayushdg
Copy link
Copy Markdown
Contributor

@ayushdg ayushdg commented Feb 27, 2026

Description

For larger runs on clusters I often find myself setting total_nparts/rmm_pool_size/spill_limit manually for fine tuning. Longer term it makes sense to revisit and optimize better defaults but for now it would be nice to expose them to the workflow class rather than having to create a custom stage.

Usage

workflow = ExactDeduplicationWorfklow(total_nparts=512, rmm_pool_size=72*1024*1024*1024,...)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

…flow class

Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Comment thread nemo_curator/stages/deduplication/exact/workflow.py Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants