Skip to content

Helm nemotron parse#2092

Merged
jdye64 merged 3 commits into
NVIDIA:26.05from
jdye64:helm-nemotron-parse
May 22, 2026
Merged

Helm nemotron parse#2092
jdye64 merged 3 commits into
NVIDIA:26.05from
jdye64:helm-nemotron-parse

Conversation

@jdye64
Copy link
Copy Markdown
Collaborator

@jdye64 jdye64 commented May 22, 2026

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@jdye64 jdye64 requested review from a team as code owners May 22, 2026 00:26
@jdye64 jdye64 requested a review from edknv May 22, 2026 00:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jdye64 jdye64 changed the base branch from main to 26.05 May 22, 2026 00:27
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR updates Helm and library-mode configurations to support the Nemotron Parse v1.2 dual-path (legacy v1.0/v1.1 vs. newer) remote API and promotes the VL embedding model to the default across the stack.

  • nemotron_parse.py: Adds version detection logic to branch between the legacy tool-call payload (v1.0/v1.1 models, tools extra body, task_prompt=None) and the plain max_tokens payload with forwarded task_prompt for newer models.
  • model/__init__.py + text_embed/runtime.py: Promotes VL_EMBED_MODEL (nvidia/llama-nemotron-embed-vl-1b-v2) to the default embedding model, replacing the previous text-only default; tests updated accordingly.
  • client.py + service_ingestor.py: Adds early-exit guards that drop document SSE events with unexpected status values before they reach the event queue or handler loop.

Confidence Score: 4/5

Safe to merge with awareness that the default embedding model has changed to VL, which may affect existing deployments that relied on the text-only default.

The most impactful change is the silent promotion of VL_EMBED_MODEL as the default — callers that pass model_name=None will now get a heavier model with different GPU footprint and embedding dimensions, which could cause incompatible vector indices in existing deployments. The nemotron-parse branching logic works correctly for the current known model names, and the client/ingestor status guards are safe no-ops or conservative improvements.

nemo_retriever/src/nemo_retriever/model/init.py and nemo_retriever/src/nemo_retriever/text_embed/runtime.py carry the default model switch; any deployment using the previous text-only default without an explicit model_name argument will be silently affected.

Important Files Changed

Filename Overview
ci/scripts/validate_deployment_configs.py Removes VLM service mapping — VLM config parity between docker-compose and Helm values will no longer be validated if the service still exists in docker-compose.yaml.
nemo_retriever/src/nemo_retriever/model/init.py Changes _DEFAULT_EMBED_MODEL from the text-only embed-1b-v2 to VL_EMBED_MODEL (embed-vl-1b-v2), a breaking default that silently switches callers passing model_name=None to a heavier VL model with different dimensions and dependency requirements.
nemo_retriever/src/nemo_retriever/parse/nemotron_parse.py Adds version-based branching (legacy v1.0/v1.1 vs newer) for chat-completions payload construction: legacy models receive tools+max_tokens and task_prompt=None; non-legacy models receive plain max_tokens with task_prompt forwarded. Detection uses substring matching on the model name.
nemo_retriever/src/nemo_retriever/service/client.py Adds an early-return guard in _on_sse_event to discard document events with statuses other than "completed" or "failed", preventing unexpected statuses from reaching the event queue.
nemo_retriever/src/nemo_retriever/service_ingestor.py Adds a symmetric status guard in the document_complete event handler — functionally unreachable in practice since client.py already filters non-completed/failed statuses before enqueuing.
nemo_retriever/src/nemo_retriever/text_embed/runtime.py Updates the secondary fallback in _embed_group to VL_EMBED_MODEL, consistent with the init.py default change; carries the same silent model-switch impact for any code path where resolved_model_name is falsy.
nemo_retriever/tests/test_create_local_embedder.py Test suite updated to match the new VL default: mock fixtures and assertions repointed from text-model mocks to VL-model mocks; normalize/max_length kwarg assertions removed since the VL path doesn't forward them.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nemotron_parse_pages called] --> B{use_remote?}
    B -- No --> C[Local vLLM model\ntask_prompt forwarded]
    B -- Yes --> D{invoke_url contains\n/v1/chat/completions?}
    D -- No --> E[invoke_image_inference_batches\nlegacy image API]
    D -- Yes --> F[Resolve model name]
    F --> G{_is_legacy?\nsubstring v1.0/v1.1/v1_0/v1_1}
    G -- Yes --> H[extra_body: tools + max_tokens\ntask_prompt = None\nused_v1_api = True]
    G -- No --> I[extra_body: max_tokens only\ntask_prompt forwarded]
    H --> J[invoke_chat_completions_images]
    I --> J
    J --> K{nim_client provided?}
    K -- Yes --> L[nim_client.invoke_chat_completions_images]
    K -- No --> M[standalone invoke_chat_completions_images\nwith thread pool]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
ci/scripts/validate_deployment_configs.py:48-58
**VLM service silently removed from deployment validation**

The `"vlm": "nemotron_nano_12b_v2_vl"` entry has been removed from `SERVICE_MAPPING`. On line 149 the script uses `if service_name not in SERVICE_MAPPING: continue`, so any `vlm` service still present in `docker-compose.yaml` will now be silently skipped — image-tag or environment-variable mismatches between Compose and Helm for that service will no longer be reported. Is VLM intentionally being dropped from the deployment, or should the mapping remain for coverage?

Reviews (3): Last reviewed commit: "test fixes" | Re-trigger Greptile

}

_DEFAULT_EMBED_MODEL = "nvidia/llama-nemotron-embed-1b-v2"
_DEFAULT_EMBED_MODEL = VL_EMBED_MODEL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent breaking change to default embed model

_DEFAULT_EMBED_MODEL is now VL_EMBED_MODEL (nvidia/llama-nemotron-embed-vl-1b-v2), replacing the previous default of nvidia/llama-nemotron-embed-1b-v2. Any caller that passes model_name=None — including resolve_embed_model(None) and create_local_embedder(model_name=None) — will silently get the VL model, which has different dependency requirements (VLM packages), different GPU memory footprint, and different embedding dimensions than the text-only model. Existing pipelines that relied on the default will break or produce incompatible embeddings without any warning. Per the public-api-contract rule, this is a breaking change that requires a deprecation cycle or at minimum a clear release-note callout. The same implicit break is carried into text_embed/runtime.py line 78 where the fallback also changes to VL_EMBED_MODEL.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/__init__.py
Line: 39

Comment:
**Silent breaking change to default embed model**

`_DEFAULT_EMBED_MODEL` is now `VL_EMBED_MODEL` (`nvidia/llama-nemotron-embed-vl-1b-v2`), replacing the previous default of `nvidia/llama-nemotron-embed-1b-v2`. Any caller that passes `model_name=None` — including `resolve_embed_model(None)` and `create_local_embedder(model_name=None)` — will silently get the VL model, which has different dependency requirements (VLM packages), different GPU memory footprint, and different embedding dimensions than the text-only model. Existing pipelines that relied on the default will break or produce incompatible embeddings without any warning. Per the `public-api-contract` rule, this is a breaking change that requires a deprecation cycle or at minimum a clear release-note callout. The same implicit break is carried into `text_embed/runtime.py` line 78 where the fallback also changes to `VL_EMBED_MODEL`.

How can I resolve this? If you propose a fix, please make it concise.

if "/v1/chat/completions" in invoke_url:
used_v1_api = True
_model_name = nemotron_parse_model or NEMOTRON_PARSE_REMOTE_DEFAULT_MODEL
_is_legacy = any(v in _model_name.lower() for v in ("v1.0", "v1.1", "v1_0", "v1_1"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile model version detection via substring match

_is_legacy is determined by checking whether the lowercased model name contains "v1.0", "v1.1", "v1_0", or "v1_1" as a substring. A future model named, for example, "nvidia/nemotron-parse-v2-v1.1-finetuned" or a path like /models/v1.0-backup/nemotron-parse-v2 would be incorrectly classified as legacy, silently injecting the tool-call extra_body and suppressing task_prompt. A structured version comparison — parsing the version token from the model name, or using an explicit allowlist of known legacy model IDs — would be more robust.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/parse/nemotron_parse.py
Line: 317

Comment:
**Fragile model version detection via substring match**

`_is_legacy` is determined by checking whether the lowercased model name contains `"v1.0"`, `"v1.1"`, `"v1_0"`, or `"v1_1"` as a substring. A future model named, for example, `"nvidia/nemotron-parse-v2-v1.1-finetuned"` or a path like `/models/v1.0-backup/nemotron-parse-v2` would be incorrectly classified as legacy, silently injecting the tool-call `extra_body` and suppressing `task_prompt`. A structured version comparison — parsing the version token from the model name, or using an explicit allowlist of known legacy model IDs — would be more robust.

How can I resolve this? If you propose a fix, please make it concise.

}

_DEFAULT_EMBED_MODEL = "nvidia/llama-nemotron-embed-1b-v2"
_DEFAULT_EMBED_MODEL = VL_EMBED_MODEL
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent breaking change to the default embedding model

Every caller that omits model_name (i.e. passes None or "") now silently receives nvidia/llama-nemotron-embed-vl-1b-v2 instead of nvidia/llama-nemotron-embed-1b-v2. The VL model may produce embeddings with a different hidden dimension and requires different GPU memory, so any existing index built against the text-only default will silently produce mismatched vectors on lookup. This change lacks a deprecation warning and migration path, which violates the backward-compatibility contract for library users who relied on the previous default.

Rule Used: Changes to public API surfaces (FastAPI endpoints,... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/model/__init__.py
Line: 39

Comment:
**Silent breaking change to the default embedding model**

Every caller that omits `model_name` (i.e. passes `None` or `""`) now silently receives `nvidia/llama-nemotron-embed-vl-1b-v2` instead of `nvidia/llama-nemotron-embed-1b-v2`. The VL model may produce embeddings with a different hidden dimension and requires different GPU memory, so any existing index built against the text-only default will silently produce mismatched vectors on lookup. This change lacks a deprecation warning and migration path, which violates the backward-compatibility contract for library users who relied on the previous default.

**Rule Used:** Changes to public API surfaces (FastAPI endpoints,... ([source](https://app.greptile.com/review/custom-context?memory=api-backward-compatibility))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +620 to +621
if status not in ("completed", "failed"):
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unexpected document statuses dropped without logging

When the server emits a document event with a status value that is neither "completed" nor "failed", the event is silently discarded. If the server introduces a new intermediate status (e.g. "processing", "partial") or emits an unexpected value due to a bug, the document will disappear from the client's accounting with no diagnostic trace. A logger.debug or logger.warning here would preserve observability without changing the filtering behaviour.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/service/client.py
Line: 620-621

Comment:
**Unexpected document statuses dropped without logging**

When the server emits a document event with a status value that is neither `"completed"` nor `"failed"`, the event is silently discarded. If the server introduces a new intermediate status (e.g. `"processing"`, `"partial"`) or emits an unexpected value due to a bug, the document will disappear from the client's accounting with no diagnostic trace. A `logger.debug` or `logger.warning` here would preserve observability without changing the filtering behaviour.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +974 to +975
if status not in ("completed", "failed"):
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant status guard — unreachable in practice

client.py's _on_sse_event already returns early (and never calls event_queue.put_nowait) when the status is not "completed" or "failed", so no document_complete event with another status can reach this branch. The guard is dead code in the current architecture. If the intent is purely defensive, a comment explaining that would help future readers; otherwise it can be removed to keep the event loop clear.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/service_ingestor.py
Line: 974-975

Comment:
**Redundant status guard — unreachable in practice**

`client.py`'s `_on_sse_event` already returns early (and never calls `event_queue.put_nowait`) when the status is not `"completed"` or `"failed"`, so no `document_complete` event with another status can reach this branch. The guard is dead code in the current architecture. If the intent is purely defensive, a comment explaining that would help future readers; otherwise it can be removed to keep the event loop clear.

How can I resolve this? If you propose a fix, please make it concise.

if "/v1/chat/completions" in invoke_url:
used_v1_api = True
_model_name = nemotron_parse_model or NEMOTRON_PARSE_REMOTE_DEFAULT_MODEL
_is_legacy = any(v in _model_name.lower() for v in ("v1.0", "v1.1", "v1_0", "v1_1"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Version detection relies on substring matching

_is_legacy is determined by whether the model name contains "v1.0", "v1.1", "v1_0", or "v1_1" as a substring. A model named e.g. "my-org/nemotron-parse-v1.0-finetuned" would be correctly flagged, but so would any future model whose path or organisation name happens to contain those strings. An explicit allowlist of known legacy model IDs (or a version-prefixed constant) would be more robust and easier to document.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/parse/nemotron_parse.py
Line: 317

Comment:
**Version detection relies on substring matching**

`_is_legacy` is determined by whether the model name contains `"v1.0"`, `"v1.1"`, `"v1_0"`, or `"v1_1"` as a substring. A model named e.g. `"my-org/nemotron-parse-v1.0-finetuned"` would be correctly flagged, but so would any future model whose path or organisation name happens to contain those strings. An explicit allowlist of known legacy model IDs (or a version-prefixed constant) would be more robust and easier to document.

How can I resolve this? If you propose a fix, please make it concise.

@jdye64 jdye64 merged commit 63dfa90 into NVIDIA:26.05 May 22, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants