Migrate VLM to Nemotron Omni as default VLM#509
Conversation
1f10480 to
a9335ff
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
9250e81 to
1d0e966
Compare
|
Warning
|
| Layer / File(s) | Summary |
|---|---|
Deployment & Helm defaults deploy/compose/docker-compose-ingestor-server.yaml, deploy/compose/docker-compose-rag-server.yaml, deploy/compose/nims.yaml, deploy/compose/nvdev.env, deploy/helm/nvidia-blueprint-rag/values.yaml |
Model defaults switched to nvidia/nemotron-3-nano-30b-a3b-omni-reasoning; vlm-ms image/container updated; shm_size increased to 32GB; APP_FETCH_FULL_PAGE_CONTEXT casing normalized; new env vars added: VLM_FILTER_THINK_TOKENS, APP_VLM_ENABLE_THINKING, APP_VLM_THINKING_TOKEN_BUDGET; VLM generation defaults adjusted. |
Configuration Model src/nvidia_rag/utils/configuration.py |
VLMConfig default model_name updated and fields added: enable_thinking (env: APP_VLM_ENABLE_THINKING) and thinking_token_budget (env: APP_VLM_THINKING_TOKEN_BUDGET). |
VLM Core Implementation src/nvidia_rag/rag_server/vlm.py |
Refactor to construct OpenAI-style dict messages and use an async OpenAI-compatible client (AsyncOpenAI); add _build_extra_body to inject thinking controls; add organize_by_page param; streaming honors VLM_FILTER_THINK_TOKENS to filter reasoning tokens. |
VLM System Prompt src/nvidia_rag/rag_server/prompt.yaml, deploy/helm/nvidia-blueprint-rag/files/prompt.yaml |
vlm_template.system rewritten with detailed operational rules: use provided text/images as authoritative, forbid fabrication, require answering on any relevant overlap, preserve exact values, extract table/list data precisely, prefer image evidence when conflicting, return all distinct values for multi-part queries, and format yes/no answers answer-first with evidence. |
Dependency Manifest pyproject.toml |
Add openai>=1.0,<2.0 to optional dependency groups rag, ingest, and all. |
Documentation docs/change-model.md, docs/image_captioning.md, docs/multimodal-query.md, docs/vlm.md |
Update VLM model references to the new default; add "Switch Back to Nemotron Nano 12B VLM" rollback instructions; update examples and Helm snippets. |
Tests tests/unit/test_rag_server/test_vlm.py |
Tests updated to assert dict-based messages and async client usage; added tests for _build_extra_body, analyze/streaming behavior (think-token filtering), and adjusted redaction/image tests. |
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
- NVIDIA-AI-Blueprints/rag#105: Related async conversions and VLM changes.
- NVIDIA-AI-Blueprints/rag#145: Related VLM code and configuration adjustments.
- NVIDIA-AI-Blueprints/rag#72: Related VLM configuration and reasoning-control work.
Suggested labels
enhancement, documentation
Suggested reviewers
- nv-pranjald
- smasurekar
- shubhadeepd
Poem
🐰 I hopped through configs, docs, and code,
swapped Nano for Omni on the road,
thinking tokens now dance in a stream,
prompts sharpened like a carrot dream,
hooray—models ponder and answers beam!
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The title 'Migrate VLM to Nemotron Omni as default VLM' clearly and specifically summarizes the main change: switching the default Vision-Language Model from Nemotron Nano 12B to Nemotron Omni. It is concise and directly reflects the primary objective evident across all modified files. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
📝 Generate docstrings
- Create stacked PR
- Commit on current branch
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Commit unit tests in branch
dev/nikkulkarni/nemotron-omni-rc-nim-integration
Tip
💬 Introducing Slack Agent: The best way for teams to turn conversations into code.
Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
- Generate code and open pull requests
- Plan features and break down work
- Investigate incidents and troubleshoot customer tickets together
- Automate recurring tasks and respond to alerts with triggers
- Summarize progress and report instantly
Built for teams:
- Shared memory across your entire org—no repeating context
- Per-thread sandboxes to safely plan and execute work
- Governance built-in—scoped access, auditability, and budget controls
One agent for your entire SDLC. Right inside Slack.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/compose/nims.yaml`:
- Around line 333-334: Update documentation references that still mention the
old container name "nemo-vlm-microservice" to the new container name
"nemotron-3-nano-omni-30b-a3b-reasoning" in the listed docs; specifically edit
docs/image_captioning.md (replace occurrences around the captioning examples at
the earlier mentions) and docs/service-port-gpu-reference.md (replace the
service/container name usages and any related commands or docker-compose
examples) so all references match the new container_name value used in the
compose diff.
In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 163-175: The extra_body dict is being passed nested inside
model_kwargs which causes LangChain to treat it as a top-level field; update the
ChatOpenAI instantiation in vlm.py to pass extra_body directly as the extra_body
parameter (use the existing extra_body variable) instead of including it in
model_kwargs, i.e., remove "model_kwargs={'extra_body': extra_body}" and add
"extra_body=extra_body" to the ChatOpenAI call so ChatOpenAI receives
provider-specific fields (chat_template_kwargs, thinking_token_budget)
correctly.
- Around line 928-979: Add an integration or verification step to ensure the
NVIDIA reasoning delta is actually propagated into langchain-openai streaming
chunks: call vlm.astream (the same async streaming path used in VLM streaming)
against a real or staging Nemotron Omni endpoint with enable_thinking enabled
and assert that each received chunk has getattr(chunk, "additional_kwargs",
{}).get("reasoning") populated when expected; if the field is missing, fall back
to parsing chunk.content for reasoning markers or log a clear warning (include
VLM_FILTER_THINK_TOKENS and enable_thinking in the log) and update tests to
exercise this end-to-end behavior rather than only mocking
chunk.additional_kwargs.
In `@src/nvidia_rag/utils/configuration.py`:
- Around line 970-986: The VLM reasoning defaults (enable_thinking=True,
thinking_token_budget) are Omni-specific but the class still defaults
VLMConfig.model_name to the legacy Nano model; update VLMConfig.model_name
default from "nvidia/nemotron-nano-12b-v2-vl" to
"nvidia/nemotron-3-nano-30b-a3b-omni-reasoning" (and adjust its description if
present) so the default model matches enable_thinking=True, or alternatively set
enable_thinking default to False for legacy fallback—modify the
VLMConfig.model_name and/or enable_thinking fields accordingly in the class
where those attributes are defined.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: cd9add50-4e54-4124-a0dd-c9a97f39fcbf
📒 Files selected for processing (12)
deploy/compose/docker-compose-ingestor-server.yamldeploy/compose/docker-compose-rag-server.yamldeploy/compose/nims.yamldeploy/compose/nvdev.envdeploy/helm/nvidia-blueprint-rag/values.yamldocs/change-model.mddocs/image_captioning.mddocs/multimodal-query.mddocs/vlm.mdsrc/nvidia_rag/rag_server/prompt.yamlsrc/nvidia_rag/rag_server/vlm.pysrc/nvidia_rag/utils/configuration.py
1d0e966 to
197bcf1
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/nvidia_rag/utils/configuration.py (1)
946-986:⚠️ Potential issue | 🟠 Major | ⚡ Quick winComplete the Python config default migration to Omni.
VLMConfig.model_nameswitches here, but the in-code defaults still don't match the rest of this PR:NvIngestConfig.caption_model_nameabove is stillnvidia/nemotron-nano-12b-v2-vl, andVLMConfigstill falls back to the old4096 / 0.7 / 1.0generation defaults. AnyNvidiaRAGConfig()path without Compose/Helm env overrides will run a mixed Nano/Omni setup and different sampling behavior from the documented default deployment.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nvidia_rag/utils/configuration.py` around lines 946 - 986, The config defaults are inconsistent: NvIngestConfig.caption_model_name still uses the old Nano model and VLMConfig.{temperature,top_p,max_tokens} remain at legacy values, causing a mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to use the same Omni model string as VLMConfig.model_name ("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without overrides uses a consistent Omni model and sampling settings.
🧹 Nitpick comments (3)
tests/unit/test_rag_server/test_vlm.py (1)
74-87: ⚡ Quick winAdd return annotations to the newly added test methods.
These new test signatures are missing
-> None, which breaks the repo-wide Python typing rule for changed code. As per coding guidelines, "All function signatures must include type hints."Also applies to: 273-433
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_rag_server/test_vlm.py` around lines 74 - 87, The three test functions test_build_extra_body_includes_thinking_budget_when_enabled, test_build_extra_body_omits_thinking_budget_when_disabled, and test_build_extra_body_omits_zero_budget should include explicit return type annotations; update each signature to add "-> None" (e.g., def test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to satisfy the repository typing rule for changed code that all function signatures include type hints; ensure you apply the same change to any other new/changed test functions in the same file range (lines ~273-433) as indicated.src/nvidia_rag/rag_server/vlm.py (2)
462-463: 💤 Low valueAdd debug logging for consistency with
_extract_images_from_docs.Same pattern as above—silently swallowing exceptions makes debugging NRL image extraction issues difficult.
♻️ Proposed fix
- except Exception: + except Exception as e: + logger.debug("Failed to extract NRL image from doc: %s", e) continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nvidia_rag/rag_server/vlm.py` around lines 462 - 463, The silent except block (except Exception: continue) should log the caught exception for debugging consistency with _extract_images_from_docs; update the handler in the loop that swallows NRL image extraction errors to capture the exception (e) and call the module/class logger (e.g., self.logger.debug or logger.debug) including the exception info and relevant context (doc id, page/index, or image metadata) so failures are recorded but still continue processing.
402-423: ⚡ Quick winAdd debug logging for exceptions and use
source_metato avoid potentialAttributeError.
- Line 403: Re-accessing
doc.metadata.get("source")could returnNone, causingAttributeErroron the chained.get(). Sincesource_metais already safely extracted at line 381, reuse it.- Lines 422-423: The bare
exceptsilently swallows all exceptions without logging, making production debugging difficult when image extraction fails unexpectedly. Static analysis also flags this (Ruff S112, BLE001).♻️ Proposed fix
try: - source_location = doc.metadata.get("source").get("source_location") + source_location = source_meta.get("source_location") if source_location: raw_content = get_object_store_operator().get_object_from_uri( source_location ) content_b64 = base64.b64encode(raw_content).decode("ascii") else: content_b64 = "" if not content_b64: continue png_b64 = VLM._convert_image_url_to_png_b64(content_b64) parts.append( { "type": "image_url", "image_url": {"url": f"data:image/png;base64,{png_b64}"}, } ) if remaining_image_budget is not None: remaining_image_budget -= 1 - except Exception: + except Exception as e: + logger.debug("Failed to extract image from doc: %s", e) continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nvidia_rag/rag_server/vlm.py` around lines 402 - 423, Replace the unsafe chained access and silent except in the image extraction block: reuse the already-extracted source_meta (instead of re-calling doc.metadata.get("source")) to obtain "source_location", call get_object_store_operator().get_object_from_uri(...) and VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except Exception as e and log the error (e.g., logger.exception or logging.exception) including the exception and context (doc id or source_location) before continuing; keep updating parts and remaining_image_budget the same and ensure you only skip on actual failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/change-model.md`:
- Around line 305-360: When rolling back to Nemotron Nano 12B also reset
Omni-specific "thinking" knobs: set APP_VLM_ENABLE_THINKING=false and
APP_VLM_THINKING_BUDGET=0 in the vlm-ms service environment (where
APP_VLM_MODELNAME and APP_NVINGEST_CAPTIONMODELNAME are set) and mirror those
values in the Helm values under nimOperator -> envVars (APP_VLM_MODELNAME),
ingestor-server -> envVars (APP_NVINGEST_CAPTIONMODELNAME) and nv-ingest ->
envVars (VLM_CAPTION_MODEL_NAME); additionally note to switch any client prompts
back from Omni-style "/no_think" usage if Nano requires a different prompt
format.
In `@docs/vlm.md`:
- Around line 61-62: Update the docs in docs/vlm.md to reflect the new Nemotron
Omni defaults: replace references to the old default mode `/no_think` with the
reasoning-enabled default (remove or mark `/no_think` as non-default) and update
the tuning baseline values from `8192 / 0.1 / 1.0` to the new deployment
defaults by citing the environment variables APP_VLM_ENABLE_THINKING=true,
APP_VLM_THINKING_TOKEN_BUDGET=16384, APP_VLM_MAX_TOKENS=32768,
APP_VLM_TEMPERATURE=0.6, and APP_VLM_TOP_P=0.95 (also update the repeated
sections called out in the comment).
---
Outside diff comments:
In `@src/nvidia_rag/utils/configuration.py`:
- Around line 946-986: The config defaults are inconsistent:
NvIngestConfig.caption_model_name still uses the old Nano model and
VLMConfig.{temperature,top_p,max_tokens} remain at legacy values, causing a
mixed Nano/Omni runtime; update NvIngestConfig.caption_model_name to use the
same Omni model string as VLMConfig.model_name
("nvidia/nemotron-3-nano-30b-a3b-omni-reasoning") and change
VLMConfig.temperature, VLMConfig.top_p and VLMConfig.max_tokens to match the
PR/Helm documented Omni deployment defaults (or wire them to the shared/helm env
defaults instead of hardcoding legacy numbers) so any NvidiaRAGConfig() without
overrides uses a consistent Omni model and sampling settings.
---
Nitpick comments:
In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 462-463: The silent except block (except Exception: continue)
should log the caught exception for debugging consistency with
_extract_images_from_docs; update the handler in the loop that swallows NRL
image extraction errors to capture the exception (e) and call the module/class
logger (e.g., self.logger.debug or logger.debug) including the exception info
and relevant context (doc id, page/index, or image metadata) so failures are
recorded but still continue processing.
- Around line 402-423: Replace the unsafe chained access and silent except in
the image extraction block: reuse the already-extracted source_meta (instead of
re-calling doc.metadata.get("source")) to obtain "source_location", call
get_object_store_operator().get_object_from_uri(...) and
VLM._convert_image_url_to_png_b64 as before, but wrap the work in a try/except
Exception as e and log the error (e.g., logger.exception or logging.exception)
including the exception and context (doc id or source_location) before
continuing; keep updating parts and remaining_image_budget the same and ensure
you only skip on actual failure.
In `@tests/unit/test_rag_server/test_vlm.py`:
- Around line 74-87: The three test functions
test_build_extra_body_includes_thinking_budget_when_enabled,
test_build_extra_body_omits_thinking_budget_when_disabled, and
test_build_extra_body_omits_zero_budget should include explicit return type
annotations; update each signature to add "-> None" (e.g., def
test_build_extra_body_includes_thinking_budget_when_enabled(...) -> None:) to
satisfy the repository typing rule for changed code that all function signatures
include type hints; ensure you apply the same change to any other new/changed
test functions in the same file range (lines ~273-433) as indicated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: ea4bed47-5ffb-4208-adf3-5a75deed1c7b
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
deploy/compose/docker-compose-ingestor-server.yamldeploy/compose/docker-compose-rag-server.yamldeploy/compose/nims.yamldeploy/compose/nvdev.envdeploy/helm/nvidia-blueprint-rag/files/prompt.yamldeploy/helm/nvidia-blueprint-rag/values.yamldocs/change-model.mddocs/image_captioning.mddocs/multimodal-query.mddocs/vlm.mdpyproject.tomlsrc/nvidia_rag/rag_server/prompt.yamlsrc/nvidia_rag/rag_server/vlm.pysrc/nvidia_rag/utils/configuration.pytests/unit/test_rag_server/test_vlm.py
✅ Files skipped from review due to trivial changes (5)
- docs/image_captioning.md
- docs/multimodal-query.md
- deploy/compose/nvdev.env
- deploy/compose/docker-compose-ingestor-server.yaml
- src/nvidia_rag/rag_server/prompt.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- deploy/compose/nims.yaml
c9aab4c to
828b8d1
Compare
- Switch VLM from nemotron-nano-12b-v2-vl to Nemotron Omni GA image (nvcr.io/nim/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning:1.7.0-variant) - Update all documentation references from nemotron-nano-12b-v2-vl to Nemotron Omni; add section in change-model.md for switching back to Nemotron Nano 12B VLM - Update Helm values.yaml with Nemotron Omni image and model names
Drops langchain_core/langchain_openai from rag_server/vlm.py in favor of the openai AsyncOpenAI client. The langchain-openai 0.2.8 streaming-delta parser only propagated function_call/tool_calls into additional_kwargs, silently dropping the reasoning delta from Nemotron Omni; reading choices[0].delta directly restores `reasoning` and `reasoning_content` when VLM_FILTER_THINK_TOKENS=false. Other changes: - VLMConfig.model_name default flipped to nvidia/nemotron-3-nano-30b-a3b-omni-reasoning to match enable_thinking=True (was legacy nemotron-nano-12b-v2-vl). - extra_body passed as a first-class kwarg rather than nested in model_kwargs. - @trace_function decorators on _normalize_messages, extract_and_process_messages, _extract_images_from_docs(_nrl), invoke_model_async, and analyze_with_messages; inline traced_span around the stream_with_messages async-generator body. - openai>=1.0,<2.0 added to rag/ingest/all optional-dependency groups (was only transitive via langchain-openai). - Tests rewritten against dict-based message format and the AsyncOpenAI streaming shape; new coverage for reasoning-trace surfacing. Signed-off-by: Nikhil Kulkarni <nikkulkarni@nvidia.com>
097ca9c to
828b8d1
Compare
- vlm.md: replace /no_think//think route framing with APP_VLM_ENABLE_THINKING env var; set reasoning mode as default and update deployment defaults (TEMPERATURE=0.6, TOP_P=0.95, MAX_TOKENS=32768, THINKING_TOKEN_BUDGET=16384) - change-model.md: add APP_VLM_ENABLE_THINKING=false and APP_VLM_THINKING_TOKEN_BUDGET=0 to the Nano 12B rollback steps for both Docker Compose and Helm so Omni-specific knobs are reset on model switch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
76b7b9c to
8fd621d
Compare
…usage
Three changes to the VLM streaming path:
1. Routed model + usage in the SSE response.
- main.py _rag_chain: pass model=vlm_model_cfg (was model=, the LLM
name) so VLM responses advertise the VLM model. Capture and forward
vlm_token_usage to generate_answer_async on both VLM call sites
(_rag_chain and _vlm_direct_chain).
- vlm.py stream_with_messages: accept token_usage dict; request
stream_options.include_usage=True from the OpenAI SDK; populate the
dict from the final usage chunk. generate_answer_async then emits
usage on the closing SSE chunk like the LLM path already does.
2. Reasoning sentinels for streamed chain-of-thought.
- vlm.py: when filter_think_tokens=False, wrap each reasoning span
with [reasoning]...[/reasoning] markers so clients can structurally
separate deliberation from the final answer. Square-bracket form is
used because Message.content is sanitised via bleach, which strips
angle-bracket tags.
3. Per-request reasoning controls on the API.
- configuration.py: add VLMConfig.filter_think_tokens
(env=VLM_FILTER_THINK_TOKENS, default True) so the filter is a
first-class config field instead of an os.getenv read inside vlm.py.
- server.py Prompt: add vlm_enable_thinking, vlm_thinking_token_budget
(ge=0, le=65536), vlm_filter_thinking_tokens. All default None, so
omitted requests fall back to server-side config.
- main.py generate(): accept the three params, forward them through
vlm_settings to both VLM call sites.
- vlm.py stream_with_messages: accept the three kwargs; per-request
value (when not None) overrides the config default.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two filter-related stream_with_messages tests were written against the pre-PR behavior where vlm.py read VLM_FILTER_THINK_TOKENS via os.getenv and reasoning was streamed unwrapped. After moving the filter to VLMConfig.filter_think_tokens and wrapping reasoning in [reasoning]...[/reasoning] sentinels, both tests needed updating: - Drop the now-ineffective monkeypatch.setenv() calls. - Set self.mock_config.vlm.filter_think_tokens explicitly (True/False). - Update the filter-disabled test's expected output to ["[reasoning]", "step1", "[/reasoning]", "answer"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Checklist
git commit -s) and GPG signed (git commit -S).Summary by CodeRabbit
New Features
Behavior Changes
Documentation
Tests / Chores