Skip to content

Migrate VLM to Nemotron Omni as default VLM#509

Merged
nv-nikkulkarni merged 5 commits into
developfrom
dev/nikkulkarni/nemotron-omni-rc-nim-integration
May 7, 2026
Merged

Migrate VLM to Nemotron Omni as default VLM#509
nv-nikkulkarni merged 5 commits into
developfrom
dev/nikkulkarni/nemotron-omni-rc-nim-integration

Conversation

@nv-nikkulkarni
Copy link
Copy Markdown
Collaborator

@nv-nikkulkarni nv-nikkulkarni commented May 5, 2026

  • Switch VLM from nemotron-nano-12b-v2-vl to Nemotron Omni GA image (nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant)
  • Update all documentation references from nemotron-nano-12b-v2-vl to Nemotron Omni; add section in change-model.md for switching back to Nemotron Nano 12B VLM
  • Update Helm values.yaml with Nemotron Omni image and model names

Checklist

  • I am familiar with the Contributing Guidelines.
  • All commits are signed-off (git commit -s) and GPG signed (git commit -S).
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Summary by CodeRabbit

  • New Features

    • Upgraded default Vision‑Language Model to a reasoning-capable Nemotron variant.
    • Added thinking-mode controls: enable/disable thinking, token budget, and reasoning-token filtering.
  • Behavior Changes

    • VLM generation defaults adjusted (max tokens, temperature, top-p) and memory use increased for the VLM service.
    • Multimodal answer behavior tightened to prefer input evidence and avoid fabrication.
  • Documentation

    • Updated VLM, image-captioning, and multimodal query guides; added rollback instructions.
  • Tests / Chores

    • Unit tests aligned to new VLM message format; dependency pins updated.

@nv-nikkulkarni nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 2 times, most recently from 1f10480 to a9335ff Compare May 5, 2026 14:15
@nv-nikkulkarni
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@nv-nikkulkarni nv-nikkulkarni changed the title Migrate VLM to Nemotron Omni and add VLM Reranker support Migrate VLM to Nemotron Omni ad default VLM May 5, 2026
@nv-nikkulkarni nv-nikkulkarni changed the title Migrate VLM to Nemotron Omni ad default VLM Migrate VLM to Nemotron Omni as default VLM May 5, 2026
@nv-nikkulkarni nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 2 times, most recently from 9250e81 to 1d0e966 Compare May 5, 2026 14:19
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Review Change Stack

Warning

.coderabbit.yaml has a parsing error

The CodeRabbit configuration file in this repository has a parsing error and default settings were used instead. Please fix the error(s) in the configuration file. You can initialize chat with CodeRabbit to get help with the configuration file.

💥 Parsing errors (1)
Validation error: Invalid regex pattern for base branch. Received: "release-**" at "reviews.auto_review.base_branches[2]"
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

The PR upgrades the default Vision-Language Model from Nemotron Nano 12B to Nemotron Omni across deployment and Helm configs, adds VLM reasoning/token-streaming controls, refactors VLM to an OpenAI-style async client with dict messages, updates system prompts, tests, and documentation.

Changes

VLM Model Upgrade & Reasoning Controls

Layer / File(s) Summary
Deployment & Helm defaults
deploy/compose/docker-compose-ingestor-server.yaml, deploy/compose/docker-compose-rag-server.yaml, deploy/compose/nims.yaml, deploy/compose/nvdev.env, deploy/helm/nvidia-blueprint-rag/values.yaml
Model defaults switched to nvidia/nemotron-3-nano-30b-a3b-omni-reasoning; vlm-ms image/container updated; shm_size increased to 32GB; APP_FETCH_FULL_PAGE_CONTEXT casing normalized; new env vars added: VLM_FILTER_THINK_TOKENS, APP_VLM_ENABLE_THINKING, APP_VLM_THINKING_TOKEN_BUDGET; VLM generation defaults adjusted.
Configuration Model
src/nvidia_rag/utils/configuration.py
VLMConfig default model_name updated and fields added: enable_thinking (env: APP_VLM_ENABLE_THINKING) and thinking_token_budget (env: APP_VLM_THINKING_TOKEN_BUDGET).
VLM Core Implementation
src/nvidia_rag/rag_server/vlm.py
Refactor to construct OpenAI-style dict messages and use an async OpenAI-compatible client (AsyncOpenAI); add _build_extra_body to inject thinking controls; add organize_by_page param; streaming honors VLM_FILTER_THINK_TOKENS to filter reasoning tokens.
VLM System Prompt
src/nvidia_rag/rag_server/prompt.yaml, deploy/helm/nvidia-blueprint-rag/files/prompt.yaml
vlm_template.system rewritten with detailed operational rules: use provided text/images as authoritative, forbid fabrication, require answering on any relevant overlap, preserve exact values, extract table/list data precisely, prefer image evidence when conflicting, return all distinct values for multi-part queries, and format yes/no answers answer-first with evidence.
Dependency Manifest
pyproject.toml
Add openai>=1.0,<2.0 to optional dependency groups rag, ingest, and all.
Documentation
docs/change-model.md, docs/image_captioning.md, docs/multimodal-query.md, docs/vlm.md
Update VLM model references to the new default; add "Switch Back to Nemotron Nano 12B VLM" rollback instructions; update examples and Helm snippets.
Tests
tests/unit/test_rag_server/test_vlm.py
Tests updated to assert dict-based messages and async client usage; added tests for _build_extra_body, analyze/streaming behavior (think-token filtering), and adjusted redaction/image tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement, documentation

Suggested reviewers

  • nv-pranjald
  • smasurekar
  • shubhadeepd

Poem

🐰 I hopped through configs, docs, and code,

swapped Nano for Omni on the road,
thinking tokens now dance in a stream,
prompts sharpened like a carrot dream,
hooray—models ponder and answers beam!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Migrate VLM to Nemotron Omni as default VLM' clearly and specifically summarizes the main change: switching the default Vision-Language Model from Nemotron Nano 12B to Nemotron Omni. It is concise and directly reflects the primary objective evident across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev/nikkulkarni/nemotron-omni-rc-nim-integration

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/compose/nims.yaml`:
- Around line 333-334: Update documentation references that still mention the
old container name "nemo-vlm-microservice" to the new container name
"nemotron-3-nano-omni-30b-a3b-reasoning" in the listed docs; specifically edit
docs/image_captioning.md (replace occurrences around the captioning examples at
the earlier mentions) and docs/service-port-gpu-reference.md (replace the
service/container name usages and any related commands or docker-compose
examples) so all references match the new container_name value used in the
compose diff.

In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 163-175: The extra_body dict is being passed nested inside
model_kwargs which causes LangChain to treat it as a top-level field; update the
ChatOpenAI instantiation in vlm.py to pass extra_body directly as the extra_body
parameter (use the existing extra_body variable) instead of including it in
model_kwargs, i.e., remove "model_kwargs={'extra_body': extra_body}" and add
"extra_body=extra_body" to the ChatOpenAI call so ChatOpenAI receives
provider-specific fields (chat_template_kwargs, thinking_token_budget)
correctly.
- Around line 928-979: Add an integration or verification step to ensure the
NVIDIA reasoning delta is actually propagated into langchain-openai streaming
chunks: call vlm.astream (the same async streaming path used in VLM streaming)
against a real or staging Nemotron Omni endpoint with enable_thinking enabled
and assert that each received chunk has getattr(chunk, "additional_kwargs",
{}).get("reasoning") populated when expected; if the field is missing, fall back
to parsing chunk.content for reasoning markers or log a clear warning (include
VLM_FILTER_THINK_TOKENS and enable_thinking in the log) and update tests to
exercise this end-to-end behavior rather than only mocking
chunk.additional_kwargs.

In `@src/nvidia_rag/utils/configuration.py`:
- Around line 970-986: The VLM reasoning defaults (enable_thinking=True,
thinking_token_budget) are Omni-specific but the class still defaults
VLMConfig.model_name to the legacy Nano model; update VLMConfig.model_name
default from "nvidia/nemotron-nano-12b-v2-vl" to
"nvidia/nemotron-3-nano-30b-a3b-omni-reasoning" (and adjust its description if
present) so the default model matches enable_thinking=True, or alternatively set
enable_thinking default to False for legacy fallback—modify the
VLMConfig.model_name and/or enable_thinking fields accordingly in the class
where those attributes are defined.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cd9add50-4e54-4124-a0dd-c9a97f39fcbf

📥 Commits

Reviewing files that changed from the base of the PR and between 487b8c5 and a9335ff.

📒 Files selected for processing (12)
  • deploy/compose/docker-compose-ingestor-server.yaml
  • deploy/compose/docker-compose-rag-server.yaml
  • deploy/compose/nims.yaml
  • deploy/compose/nvdev.env
  • deploy/helm/nvidia-blueprint-rag/values.yaml
  • docs/change-model.md
  • docs/image_captioning.md
  • docs/multimodal-query.md
  • docs/vlm.md
  • src/nvidia_rag/rag_server/prompt.yaml
  • src/nvidia_rag/rag_server/vlm.py
  • src/nvidia_rag/utils/configuration.py

Comment thread deploy/compose/nims.yaml
Comment thread src/nvidia_rag/rag_server/vlm.py Outdated
Comment thread src/nvidia_rag/rag_server/vlm.py Outdated
Comment thread src/nvidia_rag/utils/configuration.py
@nv-nikkulkarni nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 1d0e966 to 197bcf1 Compare May 6, 2026 14:42
@NVIDIA-AI-Blueprints NVIDIA-AI-Blueprints deleted a comment from coderabbitai Bot May 7, 2026
@nv-nikkulkarni
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/nvidia_rag/utils/configuration.py (1)

946-986: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the Python config default migration to Omni.

VLMConfig.model_name switches here, but the in-code defaults still don't match the rest of this PR: NvIngestConfig.caption_model_name above is still nvidia/nemotron-nano-12b-v2-vl, and VLMConfig still falls back to the old 4096 / 0.7 / 1.0 generation defaults. Any NvidiaRAGConfig() path without Compose/Helm env overrides will run a mixed Nano/Omni setup and different sampling behavior from the documented default deployment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/utils/configuration.py` around lines 946 - 986, The config
defaults are inconsistent: NvIngestConfig.caption_model_name still uses the old
Nano model and VLMConfig.{temperature,top_p,max_tokens} remain at legacy values,
causing a mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to
use the same Omni model string as VLMConfig.model_name
("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change
VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the
PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env
defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without
overrides uses a consistent Omni model and sampling settings.
🧹 Nitpick comments (3)
tests/unit/test_rag_server/test_vlm.py (1)

74-87: ⚡ Quick win

Add return annotations to the newly added test methods.

These new test signatures are missing -> None, which breaks the repo-wide Python typing rule for changed code. As per coding guidelines, "All function signatures must include type hints."

Also applies to: 273-433

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_rag_server/test_vlm.py` around lines 74 - 87, The three test
functions test_build_extra_body_includes_thinking_budget_when_enabled,
test_build_extra_body_omits_thinking_budget_when_disabled, and
test_build_extra_body_omits_zero_budget should include explicit return type
annotations; update each signature to add "-> None" (e.g., def
test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to
satisfy the repository typing rule for changed code that all function signatures
include type hints; ensure you apply the same change to any other new/changed
test functions in the same file range (lines ~273-433) as indicated.
src/nvidia_rag/rag_server/vlm.py (2)

462-463: 💤 Low value

Add debug logging for consistency with _extract_images_from_docs.

Same pattern as above—silently swallowing exceptions makes debugging NRL image extraction issues difficult.

♻️ Proposed fix
-            except Exception:
+            except Exception as e:
+                logger.debug("Failed to extract NRL image from doc: %s", e)
                 continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/rag_server/vlm.py` around lines 462 - 463, The silent except
block (except Exception: continue) should log the caught exception for debugging
consistency with _extract_images_from_docs; update the handler in the loop that
swallows NRL image extraction errors to capture the exception (e) and call the
module/class logger (e.g., self.logger.debug or logger.debug) including the
exception info and relevant context (doc id, page/index, or image metadata) so
failures are recorded but still continue processing.

402-423: ⚡ Quick win

Add debug logging for exceptions and use source_meta to avoid potential AttributeError.

  1. Line 403: Re-accessing doc.metadata.get("source") could return None, causing AttributeError on the chained .get(). Since source_meta is already safely extracted at line 381, reuse it.
  2. Lines 422-423: The bare except silently swallows all exceptions without logging, making production debugging difficult when image extraction fails unexpectedly. Static analysis also flags this (Ruff S112, BLE001).
♻️ Proposed fix
             try:
-                source_location = doc.metadata.get("source").get("source_location")
+                source_location = source_meta.get("source_location")
                 if source_location:
                     raw_content = get_object_store_operator().get_object_from_uri(
                         source_location
                     )
                     content_b64 = base64.b64encode(raw_content).decode("ascii")
                 else:
                     content_b64 = ""
                 if not content_b64:
                     continue
                 png_b64 = VLM._convert_image_url_to_png_b64(content_b64)
                 parts.append(
                     {
                         "type": "image_url",
                         "image_url": {"url": f"data:image/png;base64,{png_b64}"},
                     }
                 )
                 if remaining_image_budget is not None:
                     remaining_image_budget -= 1
-            except Exception:
+            except Exception as e:
+                logger.debug("Failed to extract image from doc: %s", e)
                 continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/rag_server/vlm.py` around lines 402 - 423, Replace the unsafe
chained access and silent except in the image extraction block: reuse the
already-extracted source_meta (instead of re-calling doc.metadata.get("source"))
to obtain "source_location", call
get_object_store_operator().get_object_from_uri(...) and
VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except
Exception as e and log the error (e.g., logger.exception or logging.exception)
including the exception and context (doc id or source_location) before
continuing; keep updating parts and remaining_image_budget the same and ensure
you only skip on actual failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/change-model.md`:
- Around line 305-360: When rolling back to Nemotron Nano 12B also reset
Omni-specific "thinking" knobs: set APP_VLM_ENABLE_THINKING=false and
APP_VLM_THINKING_BUDGET=0 in the vlm-ms service environment (where
APP_VLM_MODELNAME and APP_NVINGEST_CAPTIONMODELNAME are set) and mirror those
values in the Helm values under nimOperator -> envVars (APP_VLM_MODELNAME),
ingestor-server -> envVars (APP_NVINGEST_CAPTIONMODELNAME) and nv-ingest ->
envVars (VLM_CAPTION_MODEL_NAME); additionally note to switch any client prompts
back from Omni-style "/no_think" usage if Nano requires a different prompt
format.

In `@docs/vlm.md`:
- Around line 61-62: Update the docs in docs/vlm.md to reflect the new Nemotron
Omni defaults: replace references to the old default mode `/no_think` with the
reasoning-enabled default (remove or mark `/no_think` as non-default) and update
the tuning baseline values from `8192 / 0.1 / 1.0` to the new deployment
defaults by citing the environment variables APP_VLM_ENABLE_THINKING=true,
APP_VLM_THINKING_TOKEN_BUDGET=16384, APP_VLM_MAX_TOKENS=32768,
APP_VLM_TEMPERATURE=0.6, and APP_VLM_TOP_P=0.95 (also update the repeated
sections called out in the comment).

---

Outside diff comments:
In `@src/nvidia_rag/utils/configuration.py`:
- Around line 946-986: The config defaults are inconsistent:
NvIngestConfig.caption_model_name still uses the old Nano model and
VLMConfig.{temperature,top_p,max_tokens} remain at legacy values, causing a
mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to use the
same Omni model string as VLMConfig.model_name
("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change
VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the
PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env
defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without
overrides uses a consistent Omni model and sampling settings.

---

Nitpick comments:
In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 462-463: The silent except block (except Exception: continue)
should log the caught exception for debugging consistency with
_extract_images_from_docs; update the handler in the loop that swallows NRL
image extraction errors to capture the exception (e) and call the module/class
logger (e.g., self.logger.debug or logger.debug) including the exception info
and relevant context (doc id, page/index, or image metadata) so failures are
recorded but still continue processing.
- Around line 402-423: Replace the unsafe chained access and silent except in
the image extraction block: reuse the already-extracted source_meta (instead of
re-calling doc.metadata.get("source")) to obtain "source_location", call
get_object_store_operator().get_object_from_uri(...) and
VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except
Exception as e and log the error (e.g., logger.exception or logging.exception)
including the exception and context (doc id or source_location) before
continuing; keep updating parts and remaining_image_budget the same and ensure
you only skip on actual failure.

In `@tests/unit/test_rag_server/test_vlm.py`:
- Around line 74-87: The three test functions
test_build_extra_body_includes_thinking_budget_when_enabled,
test_build_extra_body_omits_thinking_budget_when_disabled, and
test_build_extra_body_omits_zero_budget should include explicit return type
annotations; update each signature to add "-> None" (e.g., def
test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to
satisfy the repository typing rule for changed code that all function signatures
include type hints; ensure you apply the same change to any other new/changed
test functions in the same file range (lines ~273-433) as indicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ea4bed47-5ffb-4208-adf3-5a75deed1c7b

📥 Commits

Reviewing files that changed from the base of the PR and between a9335ff and db25300.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • deploy/compose/docker-compose-ingestor-server.yaml
  • deploy/compose/docker-compose-rag-server.yaml
  • deploy/compose/nims.yaml
  • deploy/compose/nvdev.env
  • deploy/helm/nvidia-blueprint-rag/files/prompt.yaml
  • deploy/helm/nvidia-blueprint-rag/values.yaml
  • docs/change-model.md
  • docs/image_captioning.md
  • docs/multimodal-query.md
  • docs/vlm.md
  • pyproject.toml
  • src/nvidia_rag/rag_server/prompt.yaml
  • src/nvidia_rag/rag_server/vlm.py
  • src/nvidia_rag/utils/configuration.py
  • tests/unit/test_rag_server/test_vlm.py
✅ Files skipped from review due to trivial changes (5)
  • docs/image_captioning.md
  • docs/multimodal-query.md
  • deploy/compose/nvdev.env
  • deploy/compose/docker-compose-ingestor-server.yaml
  • src/nvidia_rag/rag_server/prompt.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • deploy/compose/nims.yaml

Comment thread docs/change-model.md
Comment thread docs/vlm.md Outdated
@nv-nikkulkarni nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 5 times, most recently from c9aab4c to 828b8d1 Compare May 7, 2026 11:30
- Switch VLM from nemotron-nano-12b-v2-vl to Nemotron Omni GA image
  (nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant)
- Update all documentation references from nemotron-nano-12b-v2-vl to
  Nemotron Omni; add section in change-model.md for switching back to
  Nemotron Nano 12B VLM
- Update Helm values.yaml with Nemotron Omni image and model names
Drops langchain_core/langchain_openai from rag_server/vlm.py in favor of
the openai AsyncOpenAI client. The langchain-openai 0.2.8 streaming-delta
parser only propagated function_call/tool_calls into additional_kwargs,
silently dropping the reasoning delta from Nemotron Omni; reading
choices[0].delta directly restores `reasoning` and `reasoning_content`
when VLM_FILTER_THINK_TOKENS=false.

Other changes:
- VLMConfig.model_name default flipped to
  nvidia/nemotron-3-nano-30b-a3b-omni-reasoning to match
  enable_thinking=True (was legacy nemotron-nano-12b-v2-vl).
- extra_body passed as a first-class kwarg rather than nested in
  model_kwargs.
- @trace_function decorators on _normalize_messages,
  extract_and_process_messages, _extract_images_from_docs(_nrl),
  invoke_model_async, and analyze_with_messages; inline traced_span
  around the stream_with_messages async-generator body.
- openai>=1.0,<2.0 added to rag/ingest/all optional-dependency groups
  (was only transitive via langchain-openai).
- Tests rewritten against dict-based message format and the AsyncOpenAI
  streaming shape; new coverage for reasoning-trace surfacing.

Signed-off-by: Nikhil Kulkarni <nikkulkarni@nvidia.com>
@nv-pranjald nv-pranjald force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 097ca9c to 828b8d1 Compare May 7, 2026 13:03
- vlm.md: replace /no_think//think route framing with APP_VLM_ENABLE_THINKING
  env var; set reasoning mode as default and update deployment defaults
  (TEMPERATURE=0.6, TOP_P=0.95, MAX_TOKENS=32768, THINKING_TOKEN_BUDGET=16384)
- change-model.md: add APP_VLM_ENABLE_THINKING=false and
  APP_VLM_THINKING_TOKEN_BUDGET=0 to the Nano 12B rollback steps for both
  Docker Compose and Helm so Omni-specific knobs are reset on model switch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nv-pranjald nv-pranjald force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 76b7b9c to 8fd621d Compare May 7, 2026 18:53
nv-nikkulkarni and others added 2 commits May 7, 2026 22:58
…usage

Three changes to the VLM streaming path:

1. Routed model + usage in the SSE response.
   - main.py _rag_chain: pass model=vlm_model_cfg (was model=, the LLM
     name) so VLM responses advertise the VLM model. Capture and forward
     vlm_token_usage to generate_answer_async on both VLM call sites
     (_rag_chain and _vlm_direct_chain).
   - vlm.py stream_with_messages: accept token_usage dict; request
     stream_options.include_usage=True from the OpenAI SDK; populate the
     dict from the final usage chunk. generate_answer_async then emits
     usage on the closing SSE chunk like the LLM path already does.

2. Reasoning sentinels for streamed chain-of-thought.
   - vlm.py: when filter_think_tokens=False, wrap each reasoning span
     with [reasoning]...[/reasoning] markers so clients can structurally
     separate deliberation from the final answer. Square-bracket form is
     used because Message.content is sanitised via bleach, which strips
     angle-bracket tags.

3. Per-request reasoning controls on the API.
   - configuration.py: add VLMConfig.filter_think_tokens
     (env=VLM_FILTER_THINK_TOKENS, default True) so the filter is a
     first-class config field instead of an os.getenv read inside vlm.py.
   - server.py Prompt: add vlm_enable_thinking, vlm_thinking_token_budget
     (ge=0, le=65536), vlm_filter_thinking_tokens. All default None, so
     omitted requests fall back to server-side config.
   - main.py generate(): accept the three params, forward them through
     vlm_settings to both VLM call sites.
   - vlm.py stream_with_messages: accept the three kwargs; per-request
     value (when not None) overrides the config default.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two filter-related stream_with_messages tests were written against the
pre-PR behavior where vlm.py read VLM_FILTER_THINK_TOKENS via os.getenv
and reasoning was streamed unwrapped. After moving the filter to
VLMConfig.filter_think_tokens and wrapping reasoning in
[reasoning]...[/reasoning] sentinels, both tests needed updating:

- Drop the now-ineffective monkeypatch.setenv() calls.
- Set self.mock_config.vlm.filter_think_tokens explicitly (True/False).
- Update the filter-disabled test's expected output to
  ["[reasoning]", "step1", "[/reasoning]", "answer"].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@nv-nikkulkarni nv-nikkulkarni merged commit 3cc5ee7 into develop May 7, 2026
5 of 6 checks passed
@nv-nikkulkarni nv-nikkulkarni deleted the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch May 7, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants