Make inprocess the default implementation for the GraphIngestor, batc…#2170
Conversation
…h can still be used with a configuration
Greptile SummaryThis PR makes
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Flips GraphIngestor.init default run_mode from batch to inprocess; any code constructing GraphIngestor() without an explicit mode now runs single-process pandas instead of Ray-distributed. |
| nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py | Defaults for resolve_ingest_plan and ingest_documents changed from batch to inprocess; docstrings updated to reflect new intent. |
| nemo_retriever/src/nemo_retriever/adapters/cli/main.py | CLI --run-mode option default and help text updated from batch to inprocess; change is correct and self-consistent. |
| nemo_retriever/src/nemo_retriever/pipeline/main.py | Pipeline CLI docstring examples and run_mode Typer option default swapped from batch to inprocess; help text and examples updated consistently. |
| nemo_retriever/src/nemo_retriever/harness/config.py | HarnessConfig.run_mode default changed from batch to inprocess; YAML benchmark configs that do not explicitly set run_mode will now execute single-process pandas instead of Ray-distributed. |
| nemo_retriever/tests/test_harness_run.py | Assertions updated from batch to inprocess to match new defaults; changes are mechanically correct. |
| nemo_retriever/tests/test_ingest_manifest.py | Single assertion updated from batch to inprocess to align with new default; straightforward and correct. |
| nemo_retriever/tests/test_root_cli_workflow.py | Several create_ingestor/dry-run assertions updated correctly; the help-text assertion at line 428-429 was weakened and no longer verifies that inprocess is shown as the CLI default value. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Caller: GraphIngestor() / CLI / HarnessConfig"] --> B{"run_mode?"}
B -->|"default (now): inprocess"| C["InprocessExecutor (single-process pandas)"]
B -->|"explicit: batch"| D["RayDataExecutor (Ray Data distributed)"]
B -->|"explicit: service"| E["Remote retriever service"]
C --> F["Local execution, no Ray startup"]
D --> G["Distributed scale-out, Ray cluster required"]
Reviews (2): Last reviewed commit: "unit test fixes" | Re-trigger Greptile
| self, | ||
| *, | ||
| run_mode: str = "batch", | ||
| run_mode: str = "inprocess", |
There was a problem hiding this comment.
Silent breaking change on public API default
GraphIngestor.__init__ is a user-facing contract (per public-api-contract). Silently flipping the default run_mode from "batch" to "inprocess" means any caller that relied on the default will now execute single-process pandas instead of distributed Ray without any warning or error. The same break is replicated in resolve_ingest_plan and ingest_documents in sdk_workflow.py. Any production pipeline constructed with GraphIngestor() (no explicit run_mode) will downgrade to single-core execution silently at next deploy. A deprecation warning on the old default, or at minimum a CHANGELOG / version bump, is needed before landing this.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 418
Comment:
**Silent breaking change on public API default**
`GraphIngestor.__init__` is a user-facing contract (per `public-api-contract`). Silently flipping the default `run_mode` from `"batch"` to `"inprocess"` means any caller that relied on the default will now execute single-process pandas instead of distributed Ray without any warning or error. The same break is replicated in `resolve_ingest_plan` and `ingest_documents` in `sdk_workflow.py`. Any production pipeline constructed with `GraphIngestor()` (no explicit `run_mode`) will downgrade to single-core execution silently at next deploy. A deprecation warning on the old default, or at minimum a `CHANGELOG` / version bump, is needed before landing this.
How can I resolve this? If you propose a fix, please make it concise.| assert "--run-mode" in result.output | ||
| assert "[inprocess|batch" in result.output |
There was a problem hiding this comment.
The old assertion
"[default: batch]" confirmed the CLI help text actually rendered the default value. The new assertion only checks that the option name and allowed-values string appear somewhere in the output, but no longer verifies the displayed default is inprocess. A typo or missing Typer default= argument could slip through undetected.
| assert "--run-mode" in result.output | |
| assert "[inprocess|batch" in result.output | |
| assert "--run-mode" in result.output | |
| assert "[inprocess|batch" in result.output | |
| assert "[default: inprocess]" in result.output |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/tests/test_root_cli_workflow.py
Line: 428-429
Comment:
The old assertion `"[default: batch]"` confirmed the CLI help text actually rendered the default value. The new assertion only checks that the option name and allowed-values string appear somewhere in the output, but no longer verifies the displayed default is `inprocess`. A typo or missing Typer `default=` argument could slip through undetected.
```suggestion
assert "--run-mode" in result.output
assert "[inprocess|batch" in result.output
assert "[default: inprocess]" in result.output
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
…h can still be used with a configuration
Description
Checklist