fix(inference): tighten Ollama bootstrap fit and raise runtime context floor#4852
Conversation
…t floor Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Warning Review limit reached
More reviews will be available in 1 minute and 35 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR fixes two linked dead-loop and context-window issues in Ollama onboarding by detecting compute-constrained platforms, expanding probe timeouts, filtering unsuitable models, preventing model re-selection failures, and enforcing a minimum 16KB context window throughout the runtime and systemd configuration stack. ChangesOllama Model Selection and Context Window Hardening
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
PR Review AdvisorFindings: 0 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3878-3888: Extract the new probe-failure state and thresholds into
a small helper (e.g., ProbeFailureTracker) instead of keeping
probeFailureCounts, excludedAfterRepeatFail, MAX_PROBE_FAILS_SAME_MODEL,
MAX_PROBE_FAILS_TOTAL, and totalProbeFailures inline in onboard.ts: create a
module that encapsulates the Map/Set and counters and exposes methods like
recordFailure(tag):boolean (returns whether tag is now excluded),
shouldExclude(tag):boolean, getTotalFailures():number, and reset(); then replace
the inline variables/logic in the onboarding orchestration with a lightweight
instance call to those methods (update the spots that currently reference
probeFailureCounts/excludedAfterRepeatFail/totalProbeFailures or apply the
thresholds) so the function remains orchestration-only and file growth is moved
to the new helper module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b0faa8bd-9884-4af2-a920-a9f312bcb232
📒 Files selected for processing (12)
src/lib/inference/local.test.tssrc/lib/inference/local.tssrc/lib/inference/nim.tssrc/lib/inference/ollama-model-registry.test.tssrc/lib/inference/ollama-model-registry.tssrc/lib/inference/ollama-runtime-context.test.tssrc/lib/inference/ollama-runtime-context.tssrc/lib/inference/ollama/proxy.test.tssrc/lib/inference/ollama/proxy.tssrc/lib/onboard.tssrc/lib/onboard/ollama-systemd.test.tssrc/lib/onboard/ollama-systemd.ts
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Three compounding faults steer Ollama onboarding into a dead-loop on tight-VRAM dGPU hosts (L4 23 GB) and leave the agent with a 4096-token runtime context window that cannot fit the base prompt + tool catalogue.
Tightens the bootstrap-model registry, raises the auto-adopted runtime context window to a workable floor, extends the cold-load probe retry to non-Spark hosts, and breaks the model-selection re-prompt out of dead-loops after repeated probe failures.
Related Issue
Fixes #4812
Fixes #4813
Refs #3707
Changes
requiredMemoryMBso 30B-class entries no longer pass the fit check on L4-class 23 GB dGPUs:nemotron-3-nano:30b22000 → 26000 andqwen3.6:35b26000 → 30000. The original 22000 budget left ~1 GB headroom over the 19 GB on-disk weight, which is not enough for KV cache + activations + agent prompt at default context; the runner ended up spilling GPU→CPU during warm-up and the probe timed out.OllamaModelEntry.computeIntensiveandGpuInfo.computeConstrainedsofittableOllamaModelTags/modelFitsAvailableMemory/anyRegistryModelFitsskip 30B-class entries on integrated-GPU hosts (platform === "jetson"), where memory ostensibly fits but token-generation throughput cannot clear agent-loop timeouts.sparkHostguard on the 300 s probe retry invalidateOllamaModel. Cold-loading a large model from disk can routinely exceed the default 120 s window on any tight-VRAM dGPU, not just Spark. Fast failures (connection refused) keeptimedOut === falseand surface immediately.MIN_AUTODETECTED_OLLAMA_CONTEXT_WINDOW = 16_384and modifiedapplyOllamaRuntimeContextWindowto raiseNEMOCLAW_CONTEXT_WINDOWto the floor when the daemon-reportedcontext_lengthis below it. Ollama's stocknum_ctx=4096cannot fit the OpenClaw agent base prompt + tool catalogue (~7.4 k tokens) so every turn previously hitContext overflow: prompt too large for the model.mergeOllamaLoopbackSystemdOverridewriteEnvironment="OLLAMA_CONTEXT_LENGTH=16384"alongsideOLLAMA_HOST=127.0.0.1, so a daemon restarted through the override serves the workable context length. Preserves user-supplied values above the NemoClaw floor; strips stale below-floor lines.selectAndValidateOllamaModel: tracks per-model probe-failure counts, threads anexcludeModelsset throughpromptOllamaModel, and falls back to provider selection after 2 failures on the same model or 3 failures total. Replaces the previous dead-loop that re-offered the same broken installed model every round.computeConstrained: trueiGPU excludes compute-intensive entries regardless of memory; runtime context floor raises 4096 → 16384 and preserves 32768; systemd override writes/preserves/stripsOLLAMA_CONTEXT_LENGTH;validateOllamaModelretries on non-Spark when timed out;promptOllamaModelexcludes failed tags from both menus.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Improvements
Tests