Skip to content

Add input-aware retriever ingest routing#2068

Merged
jperez999 merged 9 commits into
NVIDIA:mainfrom
jioffe502:codex/ingest-input-aware-modalities
May 20, 2026
Merged

Add input-aware retriever ingest routing#2068
jperez999 merged 9 commits into
NVIDIA:mainfrom
jioffe502:codex/ingest-input-aware-modalities

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

Summary

  • Resolve default GraphIngestor.extract() routing from input extensions immediately before graph construction.
  • Keep PDF/DOC/PPTX-only inputs on the dedicated PDF graph while routing direct images, text, HTML, audio, and video to their typed paths.
  • Route mixed supported inputs through MultiTypeExtractOperator and fail fast for unsupported concrete extensions.
  • Share the extension classifier between root retriever ingest, GraphIngestor, and MultiTypeExtractOperator.

Validation

  • nemo_retriever/.venv/bin/python -m pytest -q nemo_retriever/tests/test_ingest_interface.py nemo_retriever/tests/test_root_cli_workflow.py nemo_retriever/tests/test_ingest_plans.py nemo_retriever/tests/test_image_load.py nemo_retriever/tests/test_doc_to_pdf_actor.py
  • git diff --check

Notes

  • Opened as draft while the BO767/single-H100 PPS smoke and broader CI are still pending on another machine.

@jioffe502 jioffe502 marked this pull request as ready for review May 20, 2026 14:40
@jioffe502 jioffe502 requested review from a team as code owners May 20, 2026 14:40
@jioffe502 jioffe502 requested a review from jdye64 May 20, 2026 14:40
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR introduces input-aware routing for retriever ingest, extending it from PDF-only to support text, HTML, image, audio, video, and mixed-type directories through auto-detection of file extensions at ingestion time. A shared INPUT_TYPE_EXTENSIONS classifier in input_files.py is now the single source of truth for extension-to-type mapping across GraphIngestor, MultiTypeExtractOperator, and the root CLI.

  • GraphIngestor.extract() now defaults extraction_mode=None and defers extraction mode selection to _resolve_effective_extraction_inputs(), which inspects actual file extensions immediately before graph construction.
  • MultiTypeExtractOperator replaces bare AudioChunkParams() / ASRParams() / VideoFrameParams() constructors with proper _default_*() factory functions that read env vars, fixing silent misconfiguration for audio/video in mixed-auto directories.
  • ingest_documents() / _attach_extract_stage() add per-type routing in the SDK workflow layer, plus new CLI flags (--input-type, --table-output-format, --local-ingest-embed-backend, and GPU-per-actor options).

Confidence Score: 3/5

The mixed-directory auto-detection path is correctly implemented and the MultiTypeExtractOperator fix is solid, but the GraphIngestor library-mode audio-only and video-only paths still use bare Pydantic constructors that bypass env-var configuration.

The _resolve_effective_extraction_inputs() method added in this PR uses AudioChunkParams(), ASRParams(), and VideoFrameParams() bare constructors for the audio-only and video-only cases. Because these truthy instances are passed through to MultiTypeExtractOperator.init, the or default*() factory fallbacks that read AUDIO_GRPC_ENDPOINT and set correct split intervals are never reached. A user calling .extract() on a folder of .wav or .mp4 files via the library API would silently get wrong defaults, different from the CLI path that correctly uses _default_asr_params(). This gap was highlighted in a prior review round and remains open.

nemo_retriever/src/nemo_retriever/graph_ingestor.py — the audio-only and video-only branches in _resolve_effective_extraction_inputs() need the same default*() factory treatment that was correctly applied to MultiTypeExtractOperator in this PR.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/utils/input_files.py Adds INPUT_TYPE_EXTENSIONS, AUTO_INPUT_EXTENSIONS, PDF_DOCUMENT_INPUT_TYPES, and input_type_for_path(); centralises extension-to-type mapping. Adds .tif and .svg to image patterns (both were already supported by image/load.py). Clean.
nemo_retriever/src/nemo_retriever/graph/multi_type_extract_operator.py Replaces bare AudioChunkParams()/ASRParams()/VideoFrameParams() constructors with default*() factory functions that read env vars, fixing silent misconfiguration for audio/video in auto-mode.
nemo_retriever/src/nemo_retriever/graph_ingestor.py Adds _resolve_effective_extraction_inputs() for deferred mode selection. The audio-only and video-only branches in this new method still use bare constructors (flagged in prior review). Mixed-auto and explicit-mode paths are correct.
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py Adds _attach_extract_stage(), _resolve_effective_input_type(), and _validate_ingest_document_types(). Audio/video typed paths use _default_asr_params() correctly. Mixed-auto path correctly leaves asr/video params as None so MultiTypeExtractOperator factories activate.
nemo_retriever/src/nemo_retriever/graph/ingestor_runtime.py Adds table-structure tuning override support (workers, batch_size, cpus, gpus). CPU total estimate now uses ts_cpus variable. Logic is correct.
nemo_retriever/src/nemo_retriever/params/models.py Adds table_structure_workers, table_structure_batch_size, table_structure_cpus_per_actor, gpu_table_structure fields to BatchTuningParams. Backward-compatible additions with sensible defaults.
nemo_retriever/tests/test_ingest_interface.py Adds tests for PDF-only, direct image, mixed PDF+image, explicit-mode rejection, and unknown-extension rejection. Test helper accesses private _resolve_effective_extraction_inputs() directly.
nemo_retriever/tests/test_root_cli_workflow.py Good coverage of new routing paths. _FakeAsrParams returns a dict from model_copy(), weakening the ASRParams contract check. Mixed-directory test does not cover audio or video in mixed-auto mode.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["retriever ingest DOCUMENTS... --input-type X"] --> B["_validate_input_type()"]
    B --> C["_expand_ingest_documents()"]
    C --> D["_validate_ingest_document_types()"]
    D --> E["_resolve_effective_input_type()"]
    E -->|"doc"| F1["→ pdf"]
    E -->|"single pdf"| F1
    E -->|"single txt"| F2["→ txt"]
    E -->|"single image"| F3["→ image"]
    E -->|"single audio"| F4["→ audio"]
    E -->|"single video"| F5["→ video"]
    E -->|"mixed types"| F6["→ auto"]
    F1 --> G1["ingestor.extract(params, extraction_mode=pdf)"]
    F2 --> G2["ingestor.extract_txt(TextChunkParams)"]
    F3 --> G3["ingestor.extract_image_files(params)"]
    F4 --> G4["ingestor.extract_audio(AudioChunkParams, asr_params_from_env)"]
    F5 --> G5["ingestor.extract_video(AudioChunkParams, VideoFrameParams)"]
    F6 --> G6["ingestor.extract(params, extraction_mode=auto, text_params, html_params)"]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py:157-160
The `extract_params` argument is collected and potentially non-None when the caller provides batch-tuning options, but it is silently discarded for the `"txt"` and `"html"` branches. If `extract_params` is `None` this is harmless, but if a user supplies `--page-elements-workers` or another tuning flag alongside a text-only corpus, those settings are dropped without any warning. Consider forwarding `extract_params` (or at minimum its `batch_tuning` sub-object) so callers get consistent behaviour across all input types.

```suggestion
    if input_type == "txt":
        return ingestor.extract_txt(TextChunkParams(), extract_params=extract_params)
    if input_type == "html":
        return ingestor.extract_html(HtmlChunkParams(), extract_params=extract_params)
```

### Issue 2 of 2
nemo_retriever/tests/test_root_cli_workflow.py:1438-1442
**Hand-rolled mock diverges from ASRParams contract**

`_FakeAsrParams.model_copy()` returns a plain `dict` instead of an `ASRParams`-like object. Tests that use this mock (audio, video, and mixed-directory cases) then assert `fake_ingestor.extract_audio.call_args.kwargs["asr_params"] == {"segment_audio": False}`, checking dict equality rather than the `ASRParams` attribute. If `extract_audio`/`extract_video` were ever changed to call `.segment_audio` on the received object, this mock would silently pass where a real failure should be caught. Using `MagicMock(spec=ASRParams)` with a configured `model_copy` side-effect would keep the spec check alive.

Reviews (4): Last reviewed commit: "Remove temporary ingest architecture not..." | Re-trigger Greptile

Comment on lines 301 to +305
params: Optional[ExtractParams] = None,
*,
split_config: dict[str, Any] | None = None,
extraction_mode: str = "pdf",
extraction_mode: str | None = None,
text_params: Optional[TextChunkParams] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Breaking default change on the public extract() contract

extraction_mode changed its default from "pdf" to None. Code that calls .extract() without an explicit extraction_mode and passes non-PDF/DOC files into the pipeline will now be auto-detected and routed to a different sub-graph. Per the api-backward-compatibility rule, changing an existing parameter default is a breaking change and should be noted in the changelog or docstring.

Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph_ingestor.py
Line: 301-305

Comment:
**Breaking default change on the public `extract()` contract**

`extraction_mode` changed its default from `"pdf"` to `None`. Code that calls `.extract()` without an explicit `extraction_mode` and passes non-PDF/DOC files into the pipeline will now be auto-detected and routed to a different sub-graph. Per the `api-backward-compatibility` rule, changing an existing parameter default is a breaking change and should be noted in the changelog or docstring.

**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested and routing is in place

Comment thread nemo_retriever/src/nemo_retriever/adapters/cli/sdk_workflow.py
@jperez999 jperez999 merged commit 26cf50c into NVIDIA:main May 20, 2026
6 checks passed
edknv pushed a commit that referenced this pull request May 21, 2026
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants