Add input-aware retriever ingest routing#2068
Conversation
Greptile SummaryThis PR introduces input-aware routing for
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/utils/input_files.py | Adds INPUT_TYPE_EXTENSIONS, AUTO_INPUT_EXTENSIONS, PDF_DOCUMENT_INPUT_TYPES, and input_type_for_path(); centralises extension-to-type mapping. Adds .tif and .svg to image patterns (both were already supported by image/load.py). Clean. |
| nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py | Replaces bare AudioChunkParams()/ASRParams()/VideoFrameParams() constructors with default*() factory functions that read env vars, fixing silent misconfiguration for audio/video in auto-mode. |
| nemo_retriever/src/nemo_retriever/graph_ingestor.py | Adds _resolve_effective_extraction_inputs() for deferred mode selection. The audio-only and video-only branches in this new method still use bare constructors (flagged in prior review). Mixed-auto and explicit-mode paths are correct. |
| nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py | Adds _attach_extract_stage(), _resolve_effective_input_type(), and _validate_ingest_document_types(). Audio/video typed paths use _default_asr_params() correctly. Mixed-auto path correctly leaves asr/video params as None so MultiTypeExtractOperator factories activate. |
| nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py | Adds table-structure tuning override support (workers, batch_size, cpus, gpus). CPU total estimate now uses ts_cpus variable. Logic is correct. |
| nemo_retriever/src/nemo_retriever/params/models.py | Adds table_structure_workers, table_structure_batch_size, table_structure_cpus_per_actor, gpu_table_structure fields to BatchTuningParams. Backward-compatible additions with sensible defaults. |
| nemo_retriever/tests/test_ingest_interface.py | Adds tests for PDF-only, direct image, mixed PDF+image, explicit-mode rejection, and unknown-extension rejection. Test helper accesses private _resolve_effective_extraction_inputs() directly. |
| nemo_retriever/tests/test_root_cli_workflow.py | Good coverage of new routing paths. _FakeAsrParams returns a dict from model_copy(), weakening the ASRParams contract check. Mixed-directory test does not cover audio or video in mixed-auto mode. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["retriever ingest DOCUMENTS... --input-type X"] --> B["_validate_input_type()"]
B --> C["_expand_ingest_documents()"]
C --> D["_validate_ingest_document_types()"]
D --> E["_resolve_effective_input_type()"]
E -->|"doc"| F1["→ pdf"]
E -->|"single pdf"| F1
E -->|"single txt"| F2["→ txt"]
E -->|"single image"| F3["→ image"]
E -->|"single audio"| F4["→ audio"]
E -->|"single video"| F5["→ video"]
E -->|"mixed types"| F6["→ auto"]
F1 --> G1["ingestor.extract(params, extraction_mode=pdf)"]
F2 --> G2["ingestor.extract_txt(TextChunkParams)"]
F3 --> G3["ingestor.extract_image_files(params)"]
F4 --> G4["ingestor.extract_audio(AudioChunkParams, asr_params_from_env)"]
F5 --> G5["ingestor.extract_video(AudioChunkParams, VideoFrameParams)"]
F6 --> G6["ingestor.extract(params, extraction_mode=auto, text_params, html_params)"]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py:157-160
The `extract_params` argument is collected and potentially non-None when the caller provides batch-tuning options, but it is silently discarded for the `"txt"` and `"html"` branches. If `extract_params` is `None` this is harmless, but if a user supplies `--page-elements-workers` or another tuning flag alongside a text-only corpus, those settings are dropped without any warning. Consider forwarding `extract_params` (or at minimum its `batch_tuning` sub-object) so callers get consistent behaviour across all input types.
```suggestion
if input_type == "txt":
return ingestor.extract_txt(TextChunkParams(), extract_params=extract_params)
if input_type == "html":
return ingestor.extract_html(HtmlChunkParams(), extract_params=extract_params)
```
### Issue 2 of 2
nemo_retriever/tests/test_root_cli_workflow.py:1438-1442
**Hand-rolled mock diverges from ASRParams contract**
`_FakeAsrParams.model_copy()` returns a plain `dict` instead of an `ASRParams`-like object. Tests that use this mock (audio, video, and mixed-directory cases) then assert `fake_ingestor.extract_audio.call_args.kwargs["asr_params"] == {"segment_audio": False}`, checking dict equality rather than the `ASRParams` attribute. If `extract_audio`/`extract_video` were ever changed to call `.segment_audio` on the received object, this mock would silently pass where a real failure should be caught. Using `MagicMock(spec=ASRParams)` with a configured `model_copy` side-effect would keep the spec check alive.
Reviews (4): Last reviewed commit: "Remove temporary ingest architecture not..." | Re-trigger Greptile
| params: Optional[ExtractParams] = None, | ||
| *, | ||
| split_config: dict[str, Any] | None = None, | ||
| extraction_mode: str = "pdf", | ||
| extraction_mode: str | None = None, | ||
| text_params: Optional[TextChunkParams] = None, |
There was a problem hiding this comment.
Breaking default change on the public
extract() contract
extraction_mode changed its default from "pdf" to None. Code that calls .extract() without an explicit extraction_mode and passes non-PDF/DOC files into the pipeline will now be auto-detected and routed to a different sub-graph. Per the api-backward-compatibility rule, changing an existing parameter default is a breaking change and should be noted in the changelog or docstring.
Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 301-305
Comment:
**Breaking default change on the public `extract()` contract**
`extraction_mode` changed its default from `"pdf"` to `None`. Code that calls `.extract()` without an explicit `extraction_mode` and passes non-PDF/DOC files into the pipeline will now be auto-detected and routed to a different sub-graph. Per the `api-backward-compatibility` rule, changing an existing parameter default is a breaking change and should be noted in the changelog or docstring.
**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
tested and routing is in place
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Summary
GraphIngestor.extract()routing from input extensions immediately before graph construction.MultiTypeExtractOperatorand fail fast for unsupported concrete extensions.retriever ingest,GraphIngestor, andMultiTypeExtractOperator.Validation
nemo_retriever/.venv/bin/python -m pytest -q nemo_retriever/tests/test_ingest_interface.py nemo_retriever/tests/test_root_cli_workflow.py nemo_retriever/tests/test_ingest_plans.py nemo_retriever/tests/test_image_load.py nemo_retriever/tests/test_doc_to_pdf_actor.pygit diff --checkNotes