Skip to content

Make inprocess the default implementation for the GraphIngestor, batc…#2170

Merged
jdye64 merged 2 commits into
NVIDIA:26.05from
jdye64:inprocess-default-pipeline
May 29, 2026
Merged

Make inprocess the default implementation for the GraphIngestor, batc…#2170
jdye64 merged 2 commits into
NVIDIA:26.05from
jdye64:inprocess-default-pipeline

Conversation

@jdye64
Copy link
Copy Markdown
Collaborator

@jdye64 jdye64 commented May 29, 2026

…h can still be used with a configuration

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jdye64 jdye64 requested review from a team as code owners May 29, 2026 18:23
@jdye64 jdye64 requested review from jioffe502 and removed request for a team May 29, 2026 18:23
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Greptile Summary

This PR makes "inprocess" (single-process pandas, no Ray) the default run_mode across GraphIngestor, the CLI commands (retriever ingest, retriever pipeline), resolve_ingest_plan, ingest_documents, and HarnessConfig. All tests are updated to assert the new default value.

  • Default flip everywhere: run_mode default changed from \"batch\" (Ray-distributed) to \"inprocess\" (pandas, no Ray) in every public entry point.
  • Docstrings and help text updated: Inline docstrings and CLI help text reordered to list inprocess first; module-level usage example updated accordingly.
  • Tests mechanically updated: All assertions updated to expect \"inprocess\"; one help-text assertion was relaxed and no longer verifies that the CLI displays inprocess as the displayed default.

Confidence Score: 4/5

Merge with caution: any code constructing GraphIngestor(), calling resolve_ingest_plan/ingest_documents, or loading a HarnessConfig without an explicit run_mode will silently switch from Ray-distributed execution to single-process pandas at next deploy.

The PR changes the execution mode default across every public entry point. Callers that previously relied on the old default — including library users, YAML-based benchmark configs, and CI pipelines that omit --run-mode — will now run in single-process mode without any warning, deprecation notice, or version bump. For benchmark harness runs on large datasets this can mean severely degraded throughput or OOM failures.

nemo_retriever/src/nemo_retriever/graph_ingestor.py, nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py, and nemo_retriever/src/nemo_retriever/harness/config.py — all three change defaults that existing callers may be relying on implicitly.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/graph_ingestor.py Flips GraphIngestor.init default run_mode from batch to inprocess; any code constructing GraphIngestor() without an explicit mode now runs single-process pandas instead of Ray-distributed.
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py Defaults for resolve_ingest_plan and ingest_documents changed from batch to inprocess; docstrings updated to reflect new intent.
nemo_retriever/src/nemo_retriever/adapters/cli/main.py CLI --run-mode option default and help text updated from batch to inprocess; change is correct and self-consistent.
nemo_retriever/src/nemo_retriever/pipeline/main.py Pipeline CLI docstring examples and run_mode Typer option default swapped from batch to inprocess; help text and examples updated consistently.
nemo_retriever/src/nemo_retriever/harness/config.py HarnessConfig.run_mode default changed from batch to inprocess; YAML benchmark configs that do not explicitly set run_mode will now execute single-process pandas instead of Ray-distributed.
nemo_retriever/tests/test_harness_run.py Assertions updated from batch to inprocess to match new defaults; changes are mechanically correct.
nemo_retriever/tests/test_ingest_manifest.py Single assertion updated from batch to inprocess to align with new default; straightforward and correct.
nemo_retriever/tests/test_root_cli_workflow.py Several create_ingestor/dry-run assertions updated correctly; the help-text assertion at line 428-429 was weakened and no longer verifies that inprocess is shown as the CLI default value.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Caller: GraphIngestor() / CLI / HarnessConfig"] --> B{"run_mode?"}
    B -->|"default (now): inprocess"| C["InprocessExecutor (single-process pandas)"]
    B -->|"explicit: batch"| D["RayDataExecutor (Ray Data distributed)"]
    B -->|"explicit: service"| E["Remote retriever service"]
    C --> F["Local execution, no Ray startup"]
    D --> G["Distributed scale-out, Ray cluster required"]
Loading

Reviews (2): Last reviewed commit: "unit test fixes" | Re-trigger Greptile

self,
*,
run_mode: str = "batch",
run_mode: str = "inprocess",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent breaking change on public API default

GraphIngestor.__init__ is a user-facing contract (per public-api-contract). Silently flipping the default run_mode from "batch" to "inprocess" means any caller that relied on the default will now execute single-process pandas instead of distributed Ray without any warning or error. The same break is replicated in resolve_ingest_plan and ingest_documents in sdk_workflow.py. Any production pipeline constructed with GraphIngestor() (no explicit run_mode) will downgrade to single-core execution silently at next deploy. A deprecation warning on the old default, or at minimum a CHANGELOG / version bump, is needed before landing this.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 418

Comment:
**Silent breaking change on public API default**

`GraphIngestor.__init__` is a user-facing contract (per `public-api-contract`). Silently flipping the default `run_mode` from `"batch"` to `"inprocess"` means any caller that relied on the default will now execute single-process pandas instead of distributed Ray without any warning or error. The same break is replicated in `resolve_ingest_plan` and `ingest_documents` in `sdk_workflow.py`. Any production pipeline constructed with `GraphIngestor()` (no explicit `run_mode`) will downgrade to single-core execution silently at next deploy. A deprecation warning on the old default, or at minimum a `CHANGELOG` / version bump, is needed before landing this.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +428 to +429
assert "--run-mode" in result.output
assert "[inprocess|batch" in result.output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The old assertion "[default: batch]" confirmed the CLI help text actually rendered the default value. The new assertion only checks that the option name and allowed-values string appear somewhere in the output, but no longer verifies the displayed default is inprocess. A typo or missing Typer default= argument could slip through undetected.

Suggested change
assert "--run-mode" in result.output
assert "[inprocess|batch" in result.output
assert "--run-mode" in result.output
assert "[inprocess|batch" in result.output
assert "[default: inprocess]" in result.output
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_root_cli_workflow.py
Line: 428-429

Comment:
The old assertion `"[default: batch]"` confirmed the CLI help text actually rendered the default value. The new assertion only checks that the option name and allowed-values string appear somewhere in the output, but no longer verifies the displayed default is `inprocess`. A typo or missing Typer `default=` argument could slip through undetected.

```suggestion
    assert "--run-mode" in result.output
    assert "[inprocess|batch" in result.output
    assert "[default: inprocess]" in result.output
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@jdye64 jdye64 merged commit 2be38bd into NVIDIA:26.05 May 29, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants