Migrate VLM to Nemotron Omni as default VLM by nv-nikkulkarni · Pull Request #509 · NVIDIA-AI-Blueprints/rag

nv-nikkulkarni · 2026-05-05T14:10:31Z

Switch VLM from nemotron-nano-12b-v2-vl to Nemotron Omni GA image (nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant)
Update all documentation references from nemotron-nano-12b-v2-vl to Nemotron Omni; add section in change-model.md for switching back to Nemotron Nano 12B VLM
Update Helm values.yaml with Nemotron Omni image and model names

Checklist

I am familiar with the Contributing Guidelines.
All commits are signed-off (git commit -s) and GPG signed (git commit -S).
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Summary by CodeRabbit

New Features
- Upgraded default Vision‑Language Model to a reasoning-capable Nemotron variant.
- Added thinking-mode controls: enable/disable thinking, token budget, and reasoning-token filtering.
Behavior Changes
- VLM generation defaults adjusted (max tokens, temperature, top-p) and memory use increased for the VLM service.
- Multimodal answer behavior tightened to prefer input evidence and avoid fabrication.
Documentation
- Updated VLM, image-captioning, and multimodal query guides; added rollback instructions.
Tests / Chores
- Unit tests aligned to new VLM message format; dependency pins updated.

nv-nikkulkarni · 2026-05-05T14:16:17Z

@coderabbitai review

coderabbitai · 2026-05-05T14:16:22Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-05T14:25:25Z

Warning

`.coderabbit.yaml` has a parsing error

The CodeRabbit configuration file in this repository has a parsing error and default settings were used instead. Please fix the error(s) in the configuration file. You can initialize chat with CodeRabbit to get help with the configuration file.

💥 Parsing errors (1)

Validation error: Invalid regex pattern for base branch. Received: "release-**" at "reviews.auto_review.base_branches[2]"

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

The PR upgrades the default Vision-Language Model from Nemotron Nano 12B to Nemotron Omni across deployment and Helm configs, adds VLM reasoning/token-streaming controls, refactors VLM to an OpenAI-style async client with dict messages, updates system prompts, tests, and documentation.

Changes

VLM Model Upgrade & Reasoning Controls

Layer / File(s)	Summary
Deployment & Helm defaults `deploy/compose/docker-compose-ingestor-server.yaml`, `deploy/compose/docker-compose-rag-server.yaml`, `deploy/compose/nims.yaml`, `deploy/compose/nvdev.env`, `deploy/helm/nvidia-blueprint-rag/values.yaml`	Model defaults switched to `nvidia/nemotron-3-nano-30b-a3b-omni-reasoning`; `vlm-ms` image/container updated; `shm_size` increased to `32GB`; `APP_FETCH_FULL_PAGE_CONTEXT` casing normalized; new env vars added: `VLM_FILTER_THINK_TOKENS`, `APP_VLM_ENABLE_THINKING`, `APP_VLM_THINKING_TOKEN_BUDGET`; VLM generation defaults adjusted.
Configuration Model `src/nvidia_rag/utils/configuration.py`	`VLMConfig` default `model_name` updated and fields added: `enable_thinking` (env: `APP_VLM_ENABLE_THINKING`) and `thinking_token_budget` (env: `APP_VLM_THINKING_TOKEN_BUDGET`).
VLM Core Implementation `src/nvidia_rag/rag_server/vlm.py`	Refactor to construct OpenAI-style dict messages and use an async OpenAI-compatible client (`AsyncOpenAI`); add `_build_extra_body` to inject thinking controls; add `organize_by_page` param; streaming honors `VLM_FILTER_THINK_TOKENS` to filter reasoning tokens.
VLM System Prompt `src/nvidia_rag/rag_server/prompt.yaml`, `deploy/helm/nvidia-blueprint-rag/files/prompt.yaml`	`vlm_template.system` rewritten with detailed operational rules: use provided text/images as authoritative, forbid fabrication, require answering on any relevant overlap, preserve exact values, extract table/list data precisely, prefer image evidence when conflicting, return all distinct values for multi-part queries, and format yes/no answers answer-first with evidence.
Dependency Manifest `pyproject.toml`	Add `openai>=1.0,<2.0` to optional dependency groups `rag`, `ingest`, and `all`.
Documentation `docs/change-model.md`, `docs/image_captioning.md`, `docs/multimodal-query.md`, `docs/vlm.md`	Update VLM model references to the new default; add "Switch Back to Nemotron Nano 12B VLM" rollback instructions; update examples and Helm snippets.
Tests `tests/unit/test_rag_server/test_vlm.py`	Tests updated to assert dict-based messages and async client usage; added tests for `_build_extra_body`, analyze/streaming behavior (think-token filtering), and adjusted redaction/image tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA-AI-Blueprints/rag#105: Related async conversions and VLM changes.
NVIDIA-AI-Blueprints/rag#145: Related VLM code and configuration adjustments.
NVIDIA-AI-Blueprints/rag#72: Related VLM configuration and reasoning-control work.

Suggested labels

enhancement, documentation

Suggested reviewers

nv-pranjald
smasurekar
shubhadeepd

Poem

🐰 I hopped through configs, docs, and code,

swapped Nano for Omni on the road,
thinking tokens now dance in a stream,
prompts sharpened like a carrot dream,
hooray—models ponder and answers beam!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Migrate VLM to Nemotron Omni as default VLM' clearly and specifically summarizes the main change: switching the default Vision-Language Model from Nemotron Nano 12B to Nemotron Omni. It is concise and directly reflects the primary objective evident across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dev/nikkulkarni/nemotron-omni-rc-nim-integration

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deploy/compose/nims.yaml`:
- Around line 333-334: Update documentation references that still mention the
old container name "nemo-vlm-microservice" to the new container name
"nemotron-3-nano-omni-30b-a3b-reasoning" in the listed docs; specifically edit
docs/image_captioning.md (replace occurrences around the captioning examples at
the earlier mentions) and docs/service-port-gpu-reference.md (replace the
service/container name usages and any related commands or docker-compose
examples) so all references match the new container_name value used in the
compose diff.

In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 163-175: The extra_body dict is being passed nested inside
model_kwargs which causes LangChain to treat it as a top-level field; update the
ChatOpenAI instantiation in vlm.py to pass extra_body directly as the extra_body
parameter (use the existing extra_body variable) instead of including it in
model_kwargs, i.e., remove "model_kwargs={'extra_body': extra_body}" and add
"extra_body=extra_body" to the ChatOpenAI call so ChatOpenAI receives
provider-specific fields (chat_template_kwargs, thinking_token_budget)
correctly.
- Around line 928-979: Add an integration or verification step to ensure the
NVIDIA reasoning delta is actually propagated into langchain-openai streaming
chunks: call vlm.astream (the same async streaming path used in VLM streaming)
against a real or staging Nemotron Omni endpoint with enable_thinking enabled
and assert that each received chunk has getattr(chunk, "additional_kwargs",
{}).get("reasoning") populated when expected; if the field is missing, fall back
to parsing chunk.content for reasoning markers or log a clear warning (include
VLM_FILTER_THINK_TOKENS and enable_thinking in the log) and update tests to
exercise this end-to-end behavior rather than only mocking
chunk.additional_kwargs.

In `@src/nvidia_rag/utils/configuration.py`:
- Around line 970-986: The VLM reasoning defaults (enable_thinking=True,
thinking_token_budget) are Omni-specific but the class still defaults
VLMConfig.model_name to the legacy Nano model; update VLMConfig.model_name
default from "nvidia/nemotron-nano-12b-v2-vl" to
"nvidia/nemotron-3-nano-30b-a3b-omni-reasoning" (and adjust its description if
present) so the default model matches enable_thinking=True, or alternatively set
enable_thinking default to False for legacy fallback—modify the
VLMConfig.model_name and/or enable_thinking fields accordingly in the class
where those attributes are defined.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cd9add50-4e54-4124-a0dd-c9a97f39fcbf

📥 Commits

Reviewing files that changed from the base of the PR and between 487b8c5 and a9335ff.

📒 Files selected for processing (12)

deploy/compose/docker-compose-ingestor-server.yaml
deploy/compose/docker-compose-rag-server.yaml
deploy/compose/nims.yaml
deploy/compose/nvdev.env
deploy/helm/nvidia-blueprint-rag/values.yaml
docs/change-model.md
docs/image_captioning.md
docs/multimodal-query.md
docs/vlm.md
src/nvidia_rag/rag_server/prompt.yaml
src/nvidia_rag/rag_server/vlm.py
src/nvidia_rag/utils/configuration.py

nv-nikkulkarni · 2026-05-07T05:25:14Z

@coderabbitai review

coderabbitai · 2026-05-07T06:43:36Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/nvidia_rag/utils/configuration.py (1)
946-986: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the Python config default migration to Omni.

VLMConfig.model_name switches here, but the in-code defaults still don't match the rest of this PR: NvIngestConfig.caption_model_name above is still nvidia/nemotron-nano-12b-v2-vl, and VLMConfig still falls back to the old 4096 / 0.7 / 1.0 generation defaults. Any NvidiaRAGConfig() path without Compose/Helm env overrides will run a mixed Nano/Omni setup and different sampling behavior from the documented default deployment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/utils/configuration.py` around lines 946 - 986, The config
defaults are inconsistent: NvIngestConfig.caption_model_name still uses the old
Nano model and VLMConfig.{temperature,top_p,max_tokens} remain at legacy values,
causing a mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to
use the same Omni model string as VLMConfig.model_name
("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change
VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the
PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env
defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without
overrides uses a consistent Omni model and sampling settings.

🧹 Nitpick comments (3)

tests/unit/test_rag_server/test_vlm.py (1)

74-87: ⚡ Quick win

Add return annotations to the newly added test methods.

These new test signatures are missing -> None, which breaks the repo-wide Python typing rule for changed code. As per coding guidelines, "All function signatures must include type hints."

Also applies to: 273-433

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_rag_server/test_vlm.py` around lines 74 - 87, The three test
functions test_build_extra_body_includes_thinking_budget_when_enabled,
test_build_extra_body_omits_thinking_budget_when_disabled, and
test_build_extra_body_omits_zero_budget should include explicit return type
annotations; update each signature to add "-> None" (e.g., def
test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to
satisfy the repository typing rule for changed code that all function signatures
include type hints; ensure you apply the same change to any other new/changed
test functions in the same file range (lines ~273-433) as indicated.

src/nvidia_rag/rag_server/vlm.py (2)

462-463: 💤 Low value

Add debug logging for consistency with _extract_images_from_docs.

Same pattern as above—silently swallowing exceptions makes debugging NRL image extraction issues difficult.

♻️ Proposed fix

-            except Exception:
+            except Exception as e:
+                logger.debug("Failed to extract NRL image from doc: %s", e)
                 continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/rag_server/vlm.py` around lines 462 - 463, The silent except
block (except Exception: continue) should log the caught exception for debugging
consistency with _extract_images_from_docs; update the handler in the loop that
swallows NRL image extraction errors to capture the exception (e) and call the
module/class logger (e.g., self.logger.debug or logger.debug) including the
exception info and relevant context (doc id, page/index, or image metadata) so
failures are recorded but still continue processing.

402-423: ⚡ Quick win

Add debug logging for exceptions and use source_meta to avoid potential AttributeError.

Line 403: Re-accessing doc.metadata.get("source") could return None, causing AttributeError on the chained .get(). Since source_meta is already safely extracted at line 381, reuse it.
Lines 422-423: The bare except silently swallows all exceptions without logging, making production debugging difficult when image extraction fails unexpectedly. Static analysis also flags this (Ruff S112, BLE001).

♻️ Proposed fix

             try:
-                source_location = doc.metadata.get("source").get("source_location")
+                source_location = source_meta.get("source_location")
                 if source_location:
                     raw_content = get_object_store_operator().get_object_from_uri(
                         source_location
                     )
                     content_b64 = base64.b64encode(raw_content).decode("ascii")
                 else:
                     content_b64 = ""
                 if not content_b64:
                     continue
                 png_b64 = VLM._convert_image_url_to_png_b64(content_b64)
                 parts.append(
                     {
                         "type": "image_url",
                         "image_url": {"url": f"data:image/png;base64,{png_b64}"},
                     }
                 )
                 if remaining_image_budget is not None:
                     remaining_image_budget -= 1
-            except Exception:
+            except Exception as e:
+                logger.debug("Failed to extract image from doc: %s", e)
                 continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nvidia_rag/rag_server/vlm.py` around lines 402 - 423, Replace the unsafe
chained access and silent except in the image extraction block: reuse the
already-extracted source_meta (instead of re-calling doc.metadata.get("source"))
to obtain "source_location", call
get_object_store_operator().get_object_from_uri(...) and
VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except
Exception as e and log the error (e.g., logger.exception or logging.exception)
including the exception and context (doc id or source_location) before
continuing; keep updating parts and remaining_image_budget the same and ensure
you only skip on actual failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/change-model.md`:
- Around line 305-360: When rolling back to Nemotron Nano 12B also reset
Omni-specific "thinking" knobs: set APP_VLM_ENABLE_THINKING=false and
APP_VLM_THINKING_BUDGET=0 in the vlm-ms service environment (where
APP_VLM_MODELNAME and APP_NVINGEST_CAPTIONMODELNAME are set) and mirror those
values in the Helm values under nimOperator -> envVars (APP_VLM_MODELNAME),
ingestor-server -> envVars (APP_NVINGEST_CAPTIONMODELNAME) and nv-ingest ->
envVars (VLM_CAPTION_MODEL_NAME); additionally note to switch any client prompts
back from Omni-style "/no_think" usage if Nano requires a different prompt
format.

In `@docs/vlm.md`:
- Around line 61-62: Update the docs in docs/vlm.md to reflect the new Nemotron
Omni defaults: replace references to the old default mode `/no_think` with the
reasoning-enabled default (remove or mark `/no_think` as non-default) and update
the tuning baseline values from `8192 / 0.1 / 1.0` to the new deployment
defaults by citing the environment variables APP_VLM_ENABLE_THINKING=true,
APP_VLM_THINKING_TOKEN_BUDGET=16384, APP_VLM_MAX_TOKENS=32768,
APP_VLM_TEMPERATURE=0.6, and APP_VLM_TOP_P=0.95 (also update the repeated
sections called out in the comment).

---

Outside diff comments:
In `@src/nvidia_rag/utils/configuration.py`:
- Around line 946-986: The config defaults are inconsistent:
NvIngestConfig.caption_model_name still uses the old Nano model and
VLMConfig.{temperature,top_p,max_tokens} remain at legacy values, causing a
mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to use the
same Omni model string as VLMConfig.model_name
("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change
VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the
PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env
defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without
overrides uses a consistent Omni model and sampling settings.

---

Nitpick comments:
In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 462-463: The silent except block (except Exception: continue)
should log the caught exception for debugging consistency with
_extract_images_from_docs; update the handler in the loop that swallows NRL
image extraction errors to capture the exception (e) and call the module/class
logger (e.g., self.logger.debug or logger.debug) including the exception info
and relevant context (doc id, page/index, or image metadata) so failures are
recorded but still continue processing.
- Around line 402-423: Replace the unsafe chained access and silent except in
the image extraction block: reuse the already-extracted source_meta (instead of
re-calling doc.metadata.get("source")) to obtain "source_location", call
get_object_store_operator().get_object_from_uri(...) and
VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except
Exception as e and log the error (e.g., logger.exception or logging.exception)
including the exception and context (doc id or source_location) before
continuing; keep updating parts and remaining_image_budget the same and ensure
you only skip on actual failure.

In `@tests/unit/test_rag_server/test_vlm.py`:
- Around line 74-87: The three test functions
test_build_extra_body_includes_thinking_budget_when_enabled,
test_build_extra_body_omits_thinking_budget_when_disabled, and
test_build_extra_body_omits_zero_budget should include explicit return type
annotations; update each signature to add "-> None" (e.g., def
test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to
satisfy the repository typing rule for changed code that all function signatures
include type hints; ensure you apply the same change to any other new/changed
test functions in the same file range (lines ~273-433) as indicated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: ea4bed47-5ffb-4208-adf3-5a75deed1c7b

📥 Commits

Reviewing files that changed from the base of the PR and between a9335ff and db25300.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

deploy/compose/docker-compose-ingestor-server.yaml
deploy/compose/docker-compose-rag-server.yaml
deploy/compose/nims.yaml
deploy/compose/nvdev.env
deploy/helm/nvidia-blueprint-rag/files/prompt.yaml
deploy/helm/nvidia-blueprint-rag/values.yaml
docs/change-model.md
docs/image_captioning.md
docs/multimodal-query.md
docs/vlm.md
pyproject.toml
src/nvidia_rag/rag_server/prompt.yaml
src/nvidia_rag/rag_server/vlm.py
src/nvidia_rag/utils/configuration.py
tests/unit/test_rag_server/test_vlm.py

✅ Files skipped from review due to trivial changes (5)

docs/image_captioning.md
docs/multimodal-query.md
deploy/compose/nvdev.env
deploy/compose/docker-compose-ingestor-server.yaml
src/nvidia_rag/rag_server/prompt.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

deploy/compose/nims.yaml

- Switch VLM from nemotron-nano-12b-v2-vl to Nemotron Omni GA image (nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant) - Update all documentation references from nemotron-nano-12b-v2-vl to Nemotron Omni; add section in change-model.md for switching back to Nemotron Nano 12B VLM - Update Helm values.yaml with Nemotron Omni image and model names

Drops langchain_core/langchain_openai from rag_server/vlm.py in favor of the openai AsyncOpenAI client. The langchain-openai 0.2.8 streaming-delta parser only propagated function_call/tool_calls into additional_kwargs, silently dropping the reasoning delta from Nemotron Omni; reading choices[0].delta directly restores `reasoning` and `reasoning_content` when VLM_FILTER_THINK_TOKENS=false. Other changes: - VLMConfig.model_name default flipped to nvidia/nemotron-3-nano-30b-a3b-omni-reasoning to match enable_thinking=True (was legacy nemotron-nano-12b-v2-vl). - extra_body passed as a first-class kwarg rather than nested in model_kwargs. - @trace_function decorators on _normalize_messages, extract_and_process_messages, _extract_images_from_docs(_nrl), invoke_model_async, and analyze_with_messages; inline traced_span around the stream_with_messages async-generator body. - openai>=1.0,<2.0 added to rag/ingest/all optional-dependency groups (was only transitive via langchain-openai). - Tests rewritten against dict-based message format and the AsyncOpenAI streaming shape; new coverage for reasoning-trace surfacing. Signed-off-by: Nikhil Kulkarni <nikkulkarni@nvidia.com>

- vlm.md: replace /no_think//think route framing with APP_VLM_ENABLE_THINKING env var; set reasoning mode as default and update deployment defaults (TEMPERATURE=0.6, TOP_P=0.95, MAX_TOKENS=32768, THINKING_TOKEN_BUDGET=16384) - change-model.md: add APP_VLM_ENABLE_THINKING=false and APP_VLM_THINKING_TOKEN_BUDGET=0 to the Nano 12B rollback steps for both Docker Compose and Helm so Omni-specific knobs are reset on model switch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…usage Three changes to the VLM streaming path: 1. Routed model + usage in the SSE response. - main.py _rag_chain: pass model=vlm_model_cfg (was model=, the LLM name) so VLM responses advertise the VLM model. Capture and forward vlm_token_usage to generate_answer_async on both VLM call sites (_rag_chain and _vlm_direct_chain). - vlm.py stream_with_messages: accept token_usage dict; request stream_options.include_usage=True from the OpenAI SDK; populate the dict from the final usage chunk. generate_answer_async then emits usage on the closing SSE chunk like the LLM path already does. 2. Reasoning sentinels for streamed chain-of-thought. - vlm.py: when filter_think_tokens=False, wrap each reasoning span with [reasoning]...[/reasoning] markers so clients can structurally separate deliberation from the final answer. Square-bracket form is used because Message.content is sanitised via bleach, which strips angle-bracket tags. 3. Per-request reasoning controls on the API. - configuration.py: add VLMConfig.filter_think_tokens (env=VLM_FILTER_THINK_TOKENS, default True) so the filter is a first-class config field instead of an os.getenv read inside vlm.py. - server.py Prompt: add vlm_enable_thinking, vlm_thinking_token_budget (ge=0, le=65536), vlm_filter_thinking_tokens. All default None, so omitted requests fall back to server-side config. - main.py generate(): accept the three params, forward them through vlm_settings to both VLM call sites. - vlm.py stream_with_messages: accept the three kwargs; per-request value (when not None) overrides the config default. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The two filter-related stream_with_messages tests were written against the pre-PR behavior where vlm.py read VLM_FILTER_THINK_TOKENS via os.getenv and reasoning was streamed unwrapped. After moving the filter to VLMConfig.filter_think_tokens and wrapping reasoning in [reasoning]...[/reasoning] sentinels, both tests needed updating: - Drop the now-ineffective monkeypatch.setenv() calls. - Set self.mock_config.vlm.filter_think_tokens explicitly (True/False). - Update the filter-disabled test's expected output to ["[reasoning]", "step1", "[/reasoning]", "answer"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 2 times, most recently from 1f10480 to a9335ff Compare May 5, 2026 14:15

nv-nikkulkarni changed the title ~~Migrate VLM to Nemotron Omni and add VLM Reranker support~~ Migrate VLM to Nemotron Omni ad default VLM May 5, 2026

nv-nikkulkarni changed the title ~~Migrate VLM to Nemotron Omni ad default VLM~~ Migrate VLM to Nemotron Omni as default VLM May 5, 2026

nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 2 times, most recently from 9250e81 to 1d0e966 Compare May 5, 2026 14:19

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread deploy/compose/nims.yaml

Comment thread src/nvidia_rag/rag_server/vlm.py Outdated

Comment thread src/nvidia_rag/rag_server/vlm.py Outdated

Comment thread src/nvidia_rag/utils/configuration.py

nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 1d0e966 to 197bcf1 Compare May 6, 2026 14:42

NVIDIA-AI-Blueprints deleted a comment from coderabbitai Bot May 7, 2026

nv-pranjald approved these changes May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread docs/change-model.md

Comment thread docs/vlm.md Outdated

nv-nikkulkarni force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch 5 times, most recently from c9aab4c to 828b8d1 Compare May 7, 2026 11:30

nv-nikkulkarni added 2 commits May 7, 2026 11:31

nv-pranjald force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 097ca9c to 828b8d1 Compare May 7, 2026 13:03

nv-pranjald force-pushed the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch from 76b7b9c to 8fd621d Compare May 7, 2026 18:53

nv-nikkulkarni and others added 2 commits May 7, 2026 22:58

nv-nikkulkarni merged commit 3cc5ee7 into develop May 7, 2026
5 of 6 checks passed

nv-nikkulkarni deleted the dev/nikkulkarni/nemotron-omni-rc-nim-integration branch May 7, 2026 23:16

coderabbitai Bot mentioned this pull request Jun 1, 2026

[v2.6.0] Sync release-v2.6.0 to main #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate VLM to Nemotron Omni as default VLM#509

Migrate VLM to Nemotron Omni as default VLM#509
nv-nikkulkarni merged 5 commits into
developfrom
dev/nikkulkarni/nemotron-omni-rc-nim-integration

nv-nikkulkarni commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

nv-nikkulkarni commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

`.coderabbit.yaml` has a parsing error

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-nikkulkarni commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nv-nikkulkarni commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Summary by CodeRabbit

Uh oh!

nv-nikkulkarni commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has a parsing error

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nv-nikkulkarni commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-nikkulkarni commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

`.coderabbit.yaml` has a parsing error