Skip to content

fix(onboard): prompt for multimodal model inputs (Fixes #3850)#4333

Merged
cv merged 6 commits into
NVIDIA:mainfrom
deepujain:fix/3850-multimodal-input-prompt
Jun 5, 2026
Merged

fix(onboard): prompt for multimodal model inputs (Fixes #3850)#4333
cv merged 6 commits into
NVIDIA:mainfrom
deepujain:fix/3850-multimodal-input-prompt

Conversation

@deepujain
Copy link
Copy Markdown
Contributor

@deepujain deepujain commented May 27, 2026

Summary

Fixes #3850.

Some providers only return a model id during discovery, so a vision-capable model can get baked into OpenClaw as text-only. NemoClaw already supports NEMOCLAW_INFERENCE_INPUTS, but that is easy to miss during interactive onboarding.

This adds a small interactive prompt for model names that strongly look multimodal, such as omni, vision, vl, image, or multimodal. The default remains text-only. Choosing Text + Image sets the existing NEMOCLAW_INFERENCE_INPUTS=text,image path before the sandbox config is generated.

Changes

  • src/lib/onboard.ts: detects likely multimodal model names and asks for input capability during interactive onboarding.
  • src/lib/onboard.ts: preserves the existing env override for non-interactive and scripted flows.
  • test/onboard.test.ts: covers the model-name detection and accepted input capability values.

Testing

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/onboard.test.ts -t 'input capability'
  • npx vitest run test/onboard.test.ts

Evidence

The focused tests cover the multimodal-name gate and strict capability vocabulary. The full onboard helper suite passes with 57 tests.

Signed-off-by: Deepak Jain dejain@nvidia.com

Summary by CodeRabbit

  • New Features
    • Onboarding now detects likely multimodal models and interactively prompts users to choose inference input capability (text-only or text+image) during setup, respecting non-interactive mode and pre-set environment values.
    • Invalid inference-input values are validated and normalized to a safe default when needed.
  • Tests
    • Unit tests added for prompting logic, allowed-value validation, and normalization behavior.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds helpers to detect/validate inference-input overrides and prompts interactively for text-only vs text+image when a selected model appears multimodal; integrates the prompt into onboarding after model selection and adds tests.

Changes

Inference Input Capability Prompting for Multimodal Models

Layer / File(s) Summary
Inference input capability validation and detection
src/lib/onboard/inference-input-capability.ts
Adds constants/regex and three helpers: isValidInferenceInputsOverride() validates override strings, shouldPromptForInferenceInputCapability() detects likely multimodal model names, and maybePromptForInferenceInputCapability() prompts interactively and sets NEMOCLAW_INFERENCE_INPUTS when appropriate.
Onboard integration
src/lib/onboard.ts
Imports the helpers and invokes maybePromptForInferenceInputCapability(selectedModel, { isNonInteractive, prompt }) immediately after model selection.
Unit tests for helpers
test/onboard.test.ts
Adds imports and Vitest cases for prompt decision heuristics, override validation rules (accept text/text,image, reject whitespace-separated or unsupported values), and env normalization when choosing text-only.
sequenceDiagram
  participant User
  participant Onboard
  participant InferenceInputCapability
  participant Environment
  User->>Onboard: select inference model
  Onboard->>InferenceInputCapability: maybePromptForInferenceInputCapability(model, deps)
  InferenceInputCapability->>InferenceInputCapability: shouldPromptForInferenceInputCapability(model)
  alt Multimodal & interactive
    InferenceInputCapability->>User: prompt (1: Text only, 2: Text+Image)
    User->>InferenceInputCapability: choose
    InferenceInputCapability->>Environment: set NEMOCLAW_INFERENCE_INPUTS accordingly
  else Skip (non-interactive or valid env)
    InferenceInputCapability->>Environment: no change
  end
  InferenceInputCapability-->>Onboard: continue with env state
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3966: Modifies the same Step 3 onboarding area; related changes to the inference-provider selection flow.

Suggested labels

enhancement: ui

Suggested reviewers

  • ericksoa
  • cv

Poem

🐰 I hopped through onboarding with a curious twitch,
Found a model that might paint or just stitch;
"Text or text+image?" I offered a choice,
A tap, a path—now the model has voice. 🎨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a prompt during onboarding to let users override multimodal model inputs.
Linked Issues check ✅ Passed The PR fully implements the proposed design from issue #3850: detecting multimodal model names, prompting users after model selection, and respecting environment-variable overrides for non-interactive flows.
Out of Scope Changes check ✅ Passed All changes directly support the linked issue requirement to override model input capability during onboarding; no unrelated modifications are present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/onboard.test.ts (1)

360-366: ⚡ Quick win

Add an explicit "image" acceptance assertion.

This test currently validates text, text,image, and image,text, but not the singleton image value. Adding that one assertion will lock the full accepted-input contract and prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard.test.ts` around lines 360 - 366, Update the unit test that
verifies accepted inference input overrides by adding an explicit assertion that
the singleton "image" value is accepted; specifically, inside the same test
block that calls isValidInferenceInputsOverride for "text", "text,image", and
"image,text", add expect(isValidInferenceInputsOverride("image")).toBe(true) so
the test covers the standalone "image" input case and prevents regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3964-3997: The onboarding file grew by adding the
inference-input-capability helpers; extract VALID_INFERENCE_INPUTS_PATTERN,
MULTIMODAL_MODEL_HINT_PATTERN, isValidInferenceInputsOverride,
shouldPromptForInferenceInputCapability and
maybePromptForInferenceInputCapability into a new module (e.g.,
src/lib/inferenceInputCapability.ts), export the functions/constants, then
import maybePromptForInferenceInputCapability (and any other needed symbols)
into src/lib/onboard.ts and replace the in-file definitions with the import so
onboard.ts no longer contains that block and stays within the entrypoint size
budget; ensure function signatures and behavior are unchanged and update any
references to the moved symbols accordingly.
- Around line 3975-3995: The prompt currently leaves an invalid
NEMOCLAW_INFERENCE_INPUTS value unchanged when the user selects "Text only"
(default). In maybePromptForInferenceInputCapability, normalize or clear the env
var when the choice is not "2": after resolving (choice || "1").trim(), set
process.env.NEMOCLAW_INFERENCE_INPUTS = "text" (or delete/clear it) for the
Text-only branch so invalid prior values cannot propagate; keep the existing
branch that sets "text,image" when the user explicitly selects "2".

---

Nitpick comments:
In `@test/onboard.test.ts`:
- Around line 360-366: Update the unit test that verifies accepted inference
input overrides by adding an explicit assertion that the singleton "image" value
is accepted; specifically, inside the same test block that calls
isValidInferenceInputsOverride for "text", "text,image", and "image,text", add
expect(isValidInferenceInputsOverride("image")).toBe(true) so the test covers
the standalone "image" input case and prevents regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 59abfcbd-6a40-4870-9269-fc49d667f6b3

📥 Commits

Reviewing files that changed from the base of the PR and between e139dbc and b2dd722.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/inference-input-capability.ts`:
- Line 10: The current regex constant VALID_INFERENCE_INPUTS_PATTERN allows
duplicate tokens (e.g., "text,text"); update the validation to ensure tokens are
unique by replacing or augmenting usage of VALID_INFERENCE_INPUTS_PATTERN with
logic that splits the input by comma, verifies each token is one of "text" or
"image", and then checks that new Set(tokens).size === tokens.length to reject
duplicates; update any code that relies on VALID_INFERENCE_INPUTS_PATTERN
(search for that constant) to use this uniqueness check so values like
"text,text" and "image,image" are considered invalid.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4bbd1676-b232-4e40-90e6-59cf0d72cd06

📥 Commits

Reviewing files that changed from the base of the PR and between b2dd722 and a282e89.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/inference-input-capability.ts
  • test/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/onboard.test.ts
  • src/lib/onboard.ts

Comment thread src/lib/onboard/inference-input-capability.ts Outdated
@deepujain
Copy link
Copy Markdown
Contributor Author

Moved input-capability logic under src/lib/onboard, normalized invalid text-only overrides to text, rejected duplicate tokens, and kept onboard.ts budget-neutral. Local checks: build:cli, typecheck:cli, lint, and full test/onboard.test.ts.

@wscurran wscurran added enhancement New capability or improvement request fix labels May 27, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR that addresses the issue with multimodal model inputs during onboarding. This proposes a fix for the problem where vision-capable models are treated as text-only and adds an interactive prompt to handle such cases, improving the overall onboarding experience.


Related open issues:

@deepujain deepujain force-pushed the fix/3850-multimodal-input-prompt branch from 6299944 to 5a3d64d Compare May 30, 2026 02:01
@deepujain
Copy link
Copy Markdown
Contributor Author

Rebased on current main. build:cli and the focused multimodal input prompt test pass.

@wscurran wscurran added the v0.0.58 Release target label Jun 2, 2026
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality and removed NemoClaw CLI labels Jun 3, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — narrow addition in setupNim with an isolated helper, env override preserved for non-interactive flows, tests cover detection, override vocabulary, and bad-override normalization. CI green on 5a3d64d.

Non-blocking follow-up: should non-NIM provider paths (e.g. setupOpenAI, the Ollama path) also surface this prompt for multimodal-hinting model names, or is NIM the only provider where the model id alone leaves capability ambiguous?

@prekshivyas
Copy link
Copy Markdown
Contributor

@deepujain — DCO is failing on the merge commit 6751692 ("Merge branch 'main' into ...") that I created via GitHub's "Update branch" button; it lacks a Signed-off-by trailer. Your 4 commits on this branch are all signed off cleanly.

Could you rebase this branch onto current main and force-push? That'll drop the merge commit and DCO should go green. Sorry for the noise.

@cv cv removed the v0.0.58 Release target label Jun 3, 2026
deepujain added 4 commits June 4, 2026 12:35
Fixes NVIDIA#3850

Signed-off-by: Deepak Jain <deepujain@gmail.com>
Signed-off-by: Deepak Jain <deepujain@gmail.com>
Signed-off-by: Deepak Jain <deepujain@gmail.com>
Signed-off-by: Deepak Jain <deepujain@gmail.com>
@deepujain deepujain force-pushed the fix/3850-multimodal-input-prompt branch from 6751692 to 409b372 Compare June 4, 2026 19:36
@deepujain
Copy link
Copy Markdown
Contributor Author

Rebased onto current main and dropped the unsigned merge commit, so DCO should clear on the next run. The four PR commits are signed, and build:cli plus the focused multimodal input prompt test pass.

@cv cv enabled auto-merge (squash) June 4, 2026 20:56
@cv cv merged commit 5838349 into NVIDIA:main Jun 5, 2026
19 checks passed
miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression enhancement New capability or improvement request feature PR adds or expands user-visible functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow nemoclaw onboard to override model input capability for multimodal models

4 participants