refactor(onboard): extract modules from onboard.ts (Steps 1-3, rebased) by jyaunches · Pull Request #2495 · NVIDIA/NemoClaw

jyaunches · 2026-04-27T00:25:22Z

⚡ Interim PR — Unblocks #2306

This is an interim extraction PR that lands the foundational module decomposition needed by #2306 (extract rebuild/recreate path + canonicalize credential resolution). It is intentionally scoped to pure code movement with no behavior changes.

Rebased onto current main. Supersedes #2087 and #2489.

Summary

Steps 1–3 of the original extraction plan (authored by @ericksoa in #2087):

Extract onboard-providers.ts — provider metadata, CRUD, labels, inference config
Extract onboard-ollama-proxy.ts — proxy lifecycle, model pull/validate
Extract onboard-inference-probes.ts — endpoint validation probes (OpenAI, Anthropic, Gemini)

Reduces onboard.ts by ~1,000 lines. All 3 new modules are under 400 lines. No behavior changes — all existing exports remain accessible via re-exports and thin wrappers.

Step 4 (dashboard extraction) was dropped — main already landed dashboard-contract.ts, dashboard-health.ts, and dashboard-recover.ts (#2398) which supersede that extraction.

Steps 5–10 (gateway, preflight, messaging, policies, inference-setup, sandbox) remain deferred per the original plan.

Downstream: #2306

#2306 (extract rebuild/recreate path + canonicalize credential resolution) builds directly on top of this PR. It imports REMOTE_PROVIDER_CONFIG, upsertProvider, and ensureOllamaAuthProxy from the new extracted modules via onboard.ts re-exports. Merging this PR first gives #2306 a clean module surface to build on.

CodeRabbit comments

All 6 comments addressed:

1 resolved (deleted planning doc)
4 are valid improvements on pre-existing code that was moved as-is — intentionally out of scope for a pure extraction PR, flagged as follow-up material
1 is incorrect (upsertMessagingProviders was never exported on main)

Rebase notes

Ported all main-side refinements into the extracted modules (e.g., startOllamaAuthProxy now returns boolean, getProbeAuthMode always returns undefined)
Added type annotations for the CJS require() imports to satisfy strict TypeScript
Re-imported endpoint URL constants that onboard.ts still references after extraction

Test plan

npx tsc -p tsconfig.src.json --noEmit compiles cleanly
npx tsc -p tsconfig.cli.json --noEmit compiles cleanly (pre-existing failures only, same as main)
Affected tests pass: gemini-probe-auth, credential-exposure, wsl2-probe-timeout
CI: all 15 checks green
Manual smoke test: nemoclaw onboard completes successfully

Co-authored-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

Refactor
- Reorganized onboarding logic into modular utilities for improved code maintainability and separation of concerns.
- Separated provider management, inference endpoint validation, and Ollama proxy lifecycle into dedicated modules.
Tests
- Updated tests to reflect new code organization structure.

Move provider metadata constants (REMOTE_PROVIDER_CONFIG, endpoint URLs, LOCAL_INFERENCE_PROVIDERS, DISCORD_SNOWFLAKE_RE), lookup helpers (getEffectiveProviderName, getNonInteractive*, getRequested*Hint), gateway CRUD (upsertProvider, providerExistsInGateway, upsertMessagingProviders, buildProviderArgs), and getSandboxInferenceConfig into a dedicated onboard-providers module. Add getProviderLabel() to consolidate the scattered if/else provider name→label chains (used in printDashboard). Functions that need runOpenshell accept it as a last parameter (dependency injection) to avoid circular requires. onboard.ts retains thin wrappers with the original signatures. Update source-scanning tests to read onboard-providers.ts where the code now lives.

Move the entire Ollama auth-proxy subsystem into a dedicated module: token/PID persistence, proxy spawn/kill lifecycle, model prompt/pull, and the exposure warning. Zero coupling to other onboard logic — all dependencies (runner, ports, local-inference, credentials, model-prompts) are imported directly.

Move inference endpoint probe functions (parseJsonObject, hasResponsesToolCall, shouldRequireResponsesToolCalling, getProbeAuthMode, getValidationProbeCurlArgs, probeResponsesToolCalling, probeOpenAiLikeEndpoint, probeAnthropicEndpoint) into a dedicated module. All dependencies (http-probe, credentials, platform, validation) are imported directly — no circular deps. The 4 validate* wrapper functions remain in onboard.ts for now as they depend on promptValidationRecovery; they'll consolidate when setupNim moves in a later step. Update source-scanning tests to read from the new compiled file.

coderabbitai · 2026-04-27T00:25:33Z

📝 Walkthrough

Walkthrough

This change refactors a large onboarding module into three focused, specialized modules. It extracts inference endpoint probing logic, Ollama proxy management, and provider configuration handling into separate files, while maintaining the same public API through re-exports from the main onboard module. Tests are updated to reference the new module structure.

Changes

Cohort / File(s)	Summary
Inference Probe Utilities `src/lib/onboard-inference-probes.ts`	New module containing endpoint validation helpers for provider onboarding, including JSON parsing, tool-call detection, authentication mode selection, curl timeout configuration (with WSL-specific tuning), and three distinct probe implementations for OpenAI Responses API, OpenAI-like endpoints, and Anthropic endpoints.
Ollama Proxy Management `src/lib/onboard-ollama-proxy.ts`	New module managing Ollama auth proxy lifecycle, including token/PID persistence in `~/.nemoclaw`, proxy startup with environment configuration, stale proxy detection and cleanup, model selection prompting, model pulling with 10-minute timeout, and model warming/validation.
Provider Configuration `src/lib/onboard-providers.ts`	New module centralizing provider endpoints, constants, and management functions including provider label resolution, effective provider name mapping, non-interactive provider/model hints from environment variables, provider CRUD operations, gateway existence checks, messaging provider bulk upsert, and sandbox inference configuration mapping.
Onboarding Module Refactoring `src/lib/onboard.ts`	Refactored to delegate provider, probing, and Ollama logic to new modules via imports and re-exports. Removes ~874 net lines of inline implementations while preserving public API surface through thin wrapper functions and re-exports.
Test Updates `test/credential-exposure.test.ts`, `test/onboard.test.ts`, `test/wsl2-probe-timeout.test.ts`	Updated test targets to reference new module structure: credential-exposure now scans both onboard.ts and onboard-providers.ts; onboard.test.ts reads from onboard-providers.ts; wsl2-probe-timeout.test.ts reads from onboard-inference-probes.js.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop with glee, the code's now free!
Three modules dance where once was one,
Probes, Ollama, Providers—neat as can be,
Refactored with care for the maintainer's fun! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: a refactor that extracts functionality from onboard.ts into three new modules. It is specific, concise, and accurately reflects the primary objective of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/onboard-extract-modules-v3

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/refactor-onboard-plan.md:
- Around line 1-3: Add the required SPDX license header to the top of
.claude/refactor-onboard-plan.md using an HTML comment block — insert the
repository-mandated SPDX line (e.g., <!-- SPDX-License-Identifier: ... -->) as
the very first lines of the Markdown file so it complies with the rule that
Markdown files use HTML comments for license headers; ensure the header exactly
matches the project's SPDX formatting convention.

In `@src/lib/onboard-inference-probes.ts`:
- Around line 135-323: probeOpenAiLikeEndpoint is too complex and the eslint
complexity suppression must be removed; refactor by extracting smaller helpers:
(1) buildProbeRequests that returns responsesProbe and chatCompletionsProbe
(move the auth/header/appendKey logic and JSON bodies out of the main function),
(2) handleStreamingFallback that encapsulates the runStreamingEventProbe logic
for "openai-responses" and returns the streamResult handling, (3)
executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using
isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and
runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and
detects NVCF using isNvcfFunctionNotFoundForAccount and
nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these
helpers, remove the eslint-disable-next-line complexity, and keep existing
behavior for responsesProbe, chatCompletionsProbe, runCurlProbe,
runStreamingEventProbe, and failure handling.
- Around line 327-330: The Anthropic probe (probeAnthropicEndpoint) is using the
loose default getCurlTimingArgs() and should use the validation-timeout helper
used by the OpenAI probes; update the runCurlProbe call inside
probeAnthropicEndpoint to replace getCurlTimingArgs() with the validation
timeout helper function used elsewhere for API validation (i.e., the same helper
the OpenAI probe uses) so Anthropic endpoints honor the stricter validation
budget.
- Around line 272-289: The retry path currently always adds an Authorization
header and targets "/chat/completions", losing the original authMode and causing
query‑param auth to fail; update the retry in the isTimeoutOrConnFailure branch
(where retriedAfterTimeout, getValidationProbeCurlArgs, runCurlProbe, apiKey,
normalizeCredentialValue, and endpointUrl are used) to preserve authMode by:
when authMode === "query-param" append the normalized apiKey as the same query
parameter used by the initial probe to the retry URL (instead of adding an
Authorization header), and otherwise keep adding the Authorization: Bearer
header; also reuse the same base endpointUrl construction
(String(endpointUrl).replace(/\/+$/, "")) rather than hardcoding a different
path so the retry targets the same route as the original probe.

In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 172-188: ensureOllamaAuthProxy sets ollamaProxyToken before
verifying the restarted proxy; change the flow so you only assign
ollamaProxyToken after confirming the proxy came up successfully. Specifically:
in ensureOllamaAuthProxy, keep using loadPersistedProxyToken and
killStaleProxy/spawnOllamaAuthProxy, but after spawnOllamaAuthProxy(token) wait
and re-check with isOllamaProxyProcess (or any existing health check) and only
set ollamaProxyToken when that check passes; if the check fails, do not set
ollamaProxyToken and surface/log an error so callers don’t proceed assuming a
live proxy. Make these changes around the ensureOllamaAuthProxy,
spawnOllamaAuthProxy, and isOllamaProxyProcess interaction.

In `@src/lib/onboard.ts`:
- Around line 855-857: The wrapper function upsertMessagingProviders currently
delegates to onboardProviders.upsertMessagingProviders but is not exported,
causing require("./onboard").upsertMessagingProviders to be undefined; add it to
the module's public exports (e.g., include upsertMessagingProviders in
module.exports or the existing export object) so callers can access it, ensuring
the exported symbol name matches upsertMessagingProviders and continues to call
onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25755052-5c8a-4d5e-b4ab-9e665fb62af4

📥 Commits

Reviewing files that changed from the base of the PR and between 1f615e2 and 06e04d7.

📒 Files selected for processing (8)

.claude/refactor-onboard-plan.md
src/lib/onboard-inference-probes.ts
src/lib/onboard-ollama-proxy.ts
src/lib/onboard-providers.ts
src/lib/onboard.ts
test/credential-exposure.test.ts
test/onboard.test.ts
test/wsl2-probe-timeout.test.ts

coderabbitai · 2026-04-27T00:30:54Z

+// eslint-disable-next-line complexity
+function probeOpenAiLikeEndpoint(endpointUrl, model, apiKey, options = {}) {
+  const useQueryParam = options.authMode === "query-param";
+  const normalizedKey = apiKey ? normalizeCredentialValue(apiKey) : "";
+  const baseUrl = String(endpointUrl).replace(/\/+$/, "");
+  const authHeader = !useQueryParam && normalizedKey
+    ? ["-H", `Authorization: Bearer ${normalizedKey}`]
+    : [];
+  const appendKey = (urlPath) =>
+    useQueryParam && normalizedKey ? `${baseUrl}${urlPath}?key=${encodeURIComponent(normalizedKey)}` : `${baseUrl}${urlPath}`;
+
+  const responsesProbe =
+    options.requireResponsesToolCalling === true
+      ? {
+          name: "Responses API with tool calling",
+          api: "openai-responses",
+          execute: () => probeResponsesToolCalling(endpointUrl, model, apiKey, { authMode: options.authMode }),
+        }
+      : {
+          name: "Responses API",
+          api: "openai-responses",
+          execute: () =>
+            runCurlProbe([
+              "-sS",
+              ...getValidationProbeCurlArgs(),
+              "-H",
+              "Content-Type: application/json",
+              ...authHeader,
+              "-d",
+              JSON.stringify({
+                model,
+                input: "Reply with exactly: OK",
+              }),
+              appendKey("/responses"),
+            ]),
+        };
+
+  const chatCompletionsProbe = {
+    name: "Chat Completions API",
+    api: "openai-completions",
+    execute: () =>
+      runCurlProbe([
+        "-sS",
+        ...getValidationProbeCurlArgs(),
+        "-H",
+        "Content-Type: application/json",
+        ...authHeader,
+        "-d",
+        JSON.stringify({
+          model,
+          messages: [{ role: "user", content: "Reply with exactly: OK" }],
+        }),
+        appendKey("/chat/completions"),
+      ]),
+  };
+
+  // NVIDIA Build does not expose /v1/responses; probing it always returns
+  // "404 page not found" and only adds noise to error messages. Skip it
+  // entirely for that provider. See issue #1601.
+  const probes = options.skipResponsesProbe
+    ? [chatCompletionsProbe]
+    : [responsesProbe, chatCompletionsProbe];
+
+  const failures = [];
+  for (const probe of probes) {
+    const result = probe.execute();
+    if (result.ok) {
+      // Streaming event validation — catch backends like SGLang that return
+      // valid non-streaming responses but emit incomplete SSE events in
+      // streaming mode. Only run for /responses probes on custom endpoints
+      // where probeStreaming was requested.
+      if (probe.api === "openai-responses" && options.probeStreaming === true) {
+        const streamResult = runStreamingEventProbe([
+          "-sS",
+          ...getValidationProbeCurlArgs(),
+          "-H",
+          "Content-Type: application/json",
+          ...authHeader,
+          "-d",
+          JSON.stringify({
+            model,
+            input: "Reply with exactly: OK",
+            stream: true,
+          }),
+          appendKey("/responses"),
+        ]);
+        if (!streamResult.ok && streamResult.missingEvents.length > 0) {
+          // Backend responds but lacks required streaming events — fall back
+          // to /chat/completions silently.
+          console.log(`  ℹ ${streamResult.message}`);
+          failures.push({
+            name: probe.name + " (streaming)",
+            httpStatus: 0,
+            curlStatus: 0,
+            message: streamResult.message,
+            body: "",
+          });
+          continue;
+        }
+        if (!streamResult.ok) {
+          // Transport or execution failure — surface as a hard error instead
+          // of silently switching APIs.
+          return {
+            ok: false,
+            message: `${probe.name} (streaming): ${streamResult.message}`,
+            failures: [
+              {
+                name: probe.name + " (streaming)",
+                httpStatus: 0,
+                curlStatus: 0,
+                message: streamResult.message,
+                body: "",
+              },
+            ],
+          };
+        }
+      }
+      return { ok: true, api: probe.api, label: probe.name };
+    }
+    // Preserve the raw response body alongside the summarized message so the
+    // NVCF "Function not found for account" detector below can fall back to
+    // the raw body if summarizeProbeError ever stops surfacing the marker
+    // through `message`.
+    failures.push({
+      name: probe.name,
+      httpStatus: result.httpStatus,
+      curlStatus: result.curlStatus,
+      message: result.message,
+      body: result.body,
+    });
+  }
+
+  // Single retry with doubled timeouts on timeout/connection failure.
+  // WSL2's virtualized network stack can cause the initial probe to time out
+  // before the TLS handshake completes. See issue #987.
+  const isTimeoutOrConnFailure = (cs) => cs === 28 || cs === 6 || cs === 7;
+  let retriedAfterTimeout = false;
+  if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
+    retriedAfterTimeout = true;
+    const baseArgs = getValidationProbeCurlArgs();
+    const doubledArgs = baseArgs.map((arg) =>
+      /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
+    );
+    const retryResult = runCurlProbe([
+      "-sS",
+      ...doubledArgs,
+      "-H",
+      "Content-Type: application/json",
+      ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
+      "-d",
+      JSON.stringify({
+        model,
+        messages: [{ role: "user", content: "Reply with exactly: OK" }],
+      }),
+      `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,
+    ]);
+    if (retryResult.ok) {
+      return { ok: true, api: "openai-completions", label: "Chat Completions API" };
+    }
+  }
+
+  // Detect the NVCF "Function not found for account" error and reframe it
+  // with an actionable next step instead of dumping the raw NVCF body.
+  // See issue #1601 (Bug 2).
+  const accountFailure = failures.find(
+    (failure) =>
+      isNvcfFunctionNotFoundForAccount(failure.message) ||
+      isNvcfFunctionNotFoundForAccount(failure.body),
+  );
+  if (accountFailure) {
+    return {
+      ok: false,
+      message: nvcfFunctionNotFoundMessage(model),
+      failures,
+    };
+  }
+
+  const baseMessage = failures.map((failure) => `${failure.name}: ${failure.message}`).join(" | ");
+  const wslHint =
+    isWsl() && retriedAfterTimeout
+      ? " · WSL2 detected \u2014 network verification may be slower than expected. " +
+        "Run `nemoclaw onboard` with the `--skip-verify` flag if this endpoint is known to be reachable."
+      : "";
+  return {
+    ok: false,
+    message: baseMessage + wslHint,
+    failures,
+  };
+}


🛠️ Refactor suggestion | 🟠 Major

Please split probeOpenAiLikeEndpoint instead of disabling the complexity check.

The new module carries forward a very large decision tree and suppresses the repo’s complexity gate with eslint-disable-next-line complexity. This is exactly the kind of logic that becomes brittle during extractions like this; extracting the request builders, streaming fallback, retry path, and failure summarization into helpers would keep the public behavior intact without bypassing the rule.

As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-inference-probes.ts` around lines 135 - 323, probeOpenAiLikeEndpoint is too complex and the eslint complexity suppression must be removed; refactor by extracting smaller helpers: (1) buildProbeRequests that returns responsesProbe and chatCompletionsProbe (move the auth/header/appendKey logic and JSON bodies out of the main function), (2) handleStreamingFallback that encapsulates the runStreamingEventProbe logic for "openai-responses" and returns the streamResult handling, (3) executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and detects NVCF using isNvcfFunctionNotFoundForAccount and nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these helpers, remove the eslint-disable-next-line complexity, and keep existing behavior for responsesProbe, chatCompletionsProbe, runCurlProbe, runStreamingEventProbe, and failure handling.

Valid suggestion but out of scope for this PR. This is a pure extraction — code was moved as-is from onboard.ts with no behavior changes. Complexity reduction in the extracted modules is good follow-up material but should be a separate PR to keep this one reviewable.

@jyaunches, understood — keeping extraction PRs behavior-preserving and deferring quality improvements is a sound approach. Would you like me to open a follow-up issue to track the probeOpenAiLikeEndpoint complexity refactor (extracting buildProbeRequests, handleStreamingFallback, executeRetryWithDoubledTimeouts, and summarizeFailures, and removing the eslint-disable-next-line complexity suppression)?

✏️ Learnings added

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 2495 File: src/lib/onboard-inference-probes.ts:135-323 Timestamp: 2026-04-27T13:06:57.456Z Learning: In the NemoClaw repository, PRs that extract code from `src/lib/onboard.ts` into new modules (e.g., `onboard-providers.ts`, `onboard-ollama-proxy.ts`, `onboard-inference-probes.ts`) are intentionally behavior-preserving with no refactoring, so complexity or style improvements in the extracted modules should be deferred to separate follow-up PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-04-27T00:30:54Z

+  if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
+    retriedAfterTimeout = true;
+    const baseArgs = getValidationProbeCurlArgs();
+    const doubledArgs = baseArgs.map((arg) =>
+      /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
+    );
+    const retryResult = runCurlProbe([
+      "-sS",
+      ...doubledArgs,
+      "-H",
+      "Content-Type: application/json",
+      ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
+      "-d",
+      JSON.stringify({
+        model,
+        messages: [{ role: "user", content: "Reply with exactly: OK" }],
+      }),
+      `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,


⚠️ Potential issue | 🟠 Major

Preserve authMode in the timeout retry path.

The retry always switches to Authorization: Bearer ... and a plain /chat/completions URL. For endpoints validated with authMode: "query-param", the initial probe can time out, then the retry will fail for the wrong reason because it drops the query-param auth scheme.

Suggested fix

if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) { retriedAfterTimeout = true; const baseArgs = getValidationProbeCurlArgs(); const doubledArgs = baseArgs.map((arg) => /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg, ); + const retryAuthHeader = !useQueryParam && normalizedKey + ? ["-H", `Authorization: Bearer ${normalizedKey}`] + : []; + const retryUrl = + useQueryParam && normalizedKey + ? `${baseUrl}/chat/completions?key=${encodeURIComponent(normalizedKey)}` + : `${baseUrl}/chat/completions`; const retryResult = runCurlProbe([ "-sS", ...doubledArgs, "-H", "Content-Type: application/json", - ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []), + ...retryAuthHeader, "-d", JSON.stringify({ model, messages: [{ role: "user", content: "Reply with exactly: OK" }], }), - `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`, + retryUrl, ]);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-inference-probes.ts` around lines 272 - 289, The retry path currently always adds an Authorization header and targets "/chat/completions", losing the original authMode and causing query‑param auth to fail; update the retry in the isTimeoutOrConnFailure branch (where retriedAfterTimeout, getValidationProbeCurlArgs, runCurlProbe, apiKey, normalizeCredentialValue, and endpointUrl are used) to preserve authMode by: when authMode === "query-param" append the normalized apiKey as the same query parameter used by the initial probe to the retry URL (instead of adding an Authorization header), and otherwise keep adding the Authorization: Bearer header; also reuse the same base endpointUrl construction (String(endpointUrl).replace(/\/+$/, "")) rather than hardcoding a different path so the retry targets the same route as the original probe.

Pre-existing behavior, moved as-is. The authMode retry path works today because getProbeAuthMode now returns undefined for all providers (changed on main before this PR). Agree it could be more defensive — follow-up material.

@jyaunches, understood — thanks for the context. Since getProbeAuthMode unconditionally returns undefined as of main, authMode is never "query-param" in practice, so the retry path is safe as-is. Happy to treat the hardening as follow-up.

Would you like me to open a follow-up GitHub issue to track making the retry path preserve authMode defensively (in case getProbeAuthMode ever returns a non-undefined value again)?

🐰 noted and filed away for later!

🧠 Learnings used

Learnt from: ericksoa Repo: NVIDIA/NemoClaw PR: 1967 File: test/e2e/test-token-rotation.sh:0-0 Timestamp: 2026-04-16T18:34:40.564Z Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.

coderabbitai · 2026-04-27T00:30:54Z

+function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
+  const result = runCurlProbe([
+    "-sS",
+    ...getCurlTimingArgs(),


⚠️ Potential issue | 🟠 Major

Use the validation timeout helper for Anthropic probes too.

This path falls back to getCurlTimingArgs(), which the comment above explicitly says is the looser default budget. A broken Anthropic-compatible endpoint can therefore stall onboarding much longer than OpenAI-like validation.

Suggested fix

function probeAnthropicEndpoint(endpointUrl, model, apiKey) { const result = runCurlProbe([ "-sS", - ...getCurlTimingArgs(), + ...getValidationProbeCurlArgs(), "-H", `x-api-key: ${normalizeCredentialValue(apiKey)}`,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function probeAnthropicEndpoint(endpointUrl, model, apiKey) {

const result = runCurlProbe([

"-sS",

...getCurlTimingArgs(),

function probeAnthropicEndpoint(endpointUrl, model, apiKey) {

const result = runCurlProbe([

"-sS",

...getValidationProbeCurlArgs(),

"-H",

`x-api-key: ${normalizeCredentialValue(apiKey)}`,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-inference-probes.ts` around lines 327 - 330, The Anthropic probe (probeAnthropicEndpoint) is using the loose default getCurlTimingArgs() and should use the validation-timeout helper used by the OpenAI probes; update the runCurlProbe call inside probeAnthropicEndpoint to replace getCurlTimingArgs() with the validation timeout helper function used elsewhere for API validation (i.e., the same helper the OpenAI probe uses) so Anthropic endpoints honor the stricter validation budget.

Pre-existing behavior, moved as-is. Good catch on the timeout inconsistency — but not introduced by this extraction. Follow-up material.

@jyaunches, understood — thanks for the context. Since this is a valid inconsistency worth tracking, would you like me to open a follow-up GitHub issue to capture the fix (replacing getCurlTimingArgs() with getValidationProbeCurlArgs() in probeAnthropicEndpoint) so it doesn't get lost?

🧠 Learnings used

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 1886 File: src/lib/local-inference.ts:289-303 Timestamp: 2026-04-14T22:22:08.976Z Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 1886 File: test/onboard-selection.test.ts:137-141 Timestamp: 2026-04-15T14:16:36.856Z Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.

coderabbitai · 2026-04-27T00:30:54Z

+function ensureOllamaAuthProxy(): void {
+  // Try to load persisted token first — if none, this isn't an Ollama setup.
+  const token = loadPersistedProxyToken();
+  if (!token) return;
+
+  const pid = loadPersistedProxyPid();
+  if (isOllamaProxyProcess(pid)) {
+    ollamaProxyToken = token;
+    return;
+  }
+
+  // Proxy not running — restart it with the persisted token.
+  killStaleProxy();
+  ollamaProxyToken = token;
+  spawnOllamaAuthProxy(token);
+  sleep(1);
+}


⚠️ Potential issue | 🟠 Major

Verify the restarted proxy before keeping the persisted token active.

If the restart fails here, ollamaProxyToken is still left in memory and the caller proceeds as if the proxy exists. After a reboot, that turns into opaque local-inference failures against a dead proxy endpoint.

Suggested fix

function ensureOllamaAuthProxy(): void { const token = loadPersistedProxyToken(); if (!token) return; @@ killStaleProxy(); ollamaProxyToken = token; - spawnOllamaAuthProxy(token); + const restartedPid = spawnOllamaAuthProxy(token); sleep(1); + if (!isOllamaProxyProcess(restartedPid)) { + ollamaProxyToken = null; + clearPersistedProxyPid(); + console.error(` Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`); + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-ollama-proxy.ts` around lines 172 - 188, ensureOllamaAuthProxy sets ollamaProxyToken before verifying the restarted proxy; change the flow so you only assign ollamaProxyToken after confirming the proxy came up successfully. Specifically: in ensureOllamaAuthProxy, keep using loadPersistedProxyToken and killStaleProxy/spawnOllamaAuthProxy, but after spawnOllamaAuthProxy(token) wait and re-check with isOllamaProxyProcess (or any existing health check) and only set ollamaProxyToken when that check passes; if the check fails, do not set ollamaProxyToken and surface/log an error so callers don’t proceed assuming a live proxy. Make these changes around the ensureOllamaAuthProxy, spawnOllamaAuthProxy, and isOllamaProxyProcess interaction.

Pre-existing behavior, moved as-is. The proxy restart-without-verify pattern exists on main today. Agree it should be hardened — follow-up material.

@jyaunches, understood — thanks for the context. Since this is agreed-upon follow-up work, would you like me to open a GitHub issue to track the ensureOllamaAuthProxy restart-verification hardening? I can capture the proposed fix (verify the restarted PID via isOllamaProxyProcess, null-guard ollamaProxyToken on failure, and log an actionable error) as the issue body with a backlink to this PR.

🧠 Learnings used

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 1886 File: src/lib/local-inference.ts:289-303 Timestamp: 2026-04-14T22:22:08.976Z Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.

Learnt from: ericksoa Repo: NVIDIA/NemoClaw PR: 1967 File: test/e2e/test-token-rotation.sh:0-0 Timestamp: 2026-04-16T18:34:40.564Z Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.

Learnt from: jieunl24 Repo: NVIDIA/NemoClaw PR: 1104 File: bin/lib/onboard.js:2199-2200 Timestamp: 2026-03-30T17:32:16.517Z Learning: In `bin/lib/onboard.js` (NemoClaw project), the WSL2 Ollama startup path intentionally does NOT clear an inherited `OLLAMA_HOST` environment variable (e.g., via `env -u OLLAMA_HOST`). The design decision is to respect any explicitly exported `OLLAMA_HOST` as an intentional user configuration choice. The fix only avoids *setting* `OLLAMA_HOST=0.0.0.0:11434` inline on WSL2; it does not override pre-existing env values. This is considered acceptable because neither the Ollama installer nor NemoClaw sets `OLLAMA_HOST` in shell profiles.

Learnt from: ericksoa Repo: NVIDIA/NemoClaw PR: 2066 File: src/lib/shields.ts:162-180 Timestamp: 2026-04-19T14:03:06.546Z Learning: In `src/lib/shields-timer.ts` (TypeScript/Node.js, NemoClaw), the re-lock and lock-verification logic is intentionally inlined (duplicated from `lockAgentConfig` in `shields.ts`) because the timer runs as a detached standalone script via `fork()` with `ts-nocheck` and cannot import TypeScript modules from the main codebase at runtime. Do not flag this duplication as a refactoring opportunity.

coderabbitai · 2026-04-27T00:30:54Z

+function upsertMessagingProviders(tokenDefs: MessagingTokenDef[]) {
+  return onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell);
 }


⚠️ Potential issue | 🟠 Major

Re-export upsertMessagingProviders from this entry point.

Lines 855-857 add the wrapper, but module.exports never exposes it. That makes require("./onboard").upsertMessagingProviders resolve to undefined, which is a breaking API change for existing callers.

Suggested fix

module.exports = { buildProviderArgs, buildGatewayBootstrapSecretsScript, buildSandboxConfigSyncScript, compactText, @@ summarizeCurlFailure, summarizeProbeFailure, hasResponsesToolCall, + upsertMessagingProviders, upsertProvider, hashCredential, detectMessagingCredentialRotation, hydrateCredentialEnv,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 855 - 857, The wrapper function upsertMessagingProviders currently delegates to onboardProviders.upsertMessagingProviders but is not exported, causing require("./onboard").upsertMessagingProviders to be undefined; add it to the module's public exports (e.g., include upsertMessagingProviders in module.exports or the existing export object) so callers can access it, ensuring the exported symbol name matches upsertMessagingProviders and continues to call onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).

Not a regression — upsertMessagingProviders was never in module.exports on main either. It's an internal function used only within onboard.ts. The wrapper exists so internal call sites can reach the extracted implementation via runOpenshell injection.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if upsertMessagingProviders appears in module.exports on the main branch git show main:src/lib/onboard.ts | grep -n "upsertMessagingProviders" | head -30

Length of output: 408

@jyaunches, you're right — confirmed on main, upsertMessagingProviders has no entry in module.exports and is only used at internal call sites within onboard.ts. My comment was incorrect; this is not a regression. Apologies for the noise.

(╥﹏╥)

✏️ Learnings added

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 2495 File: src/lib/onboard.ts:855-857 Timestamp: 2026-04-27T13:07:20.779Z Learning: In `src/lib/onboard.ts` (NVIDIA/NemoClaw), `upsertMessagingProviders` is a module-private function that is intentionally NOT exported via `module.exports`. It is used only internally within `onboard.ts` to delegate to `onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell)` with `runOpenshell` injection. Do not flag its absence from `module.exports` as a missing export or regression.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: jyaunches Repo: NVIDIA/NemoClaw PR: 1886 File: test/onboard-selection.test.ts:137-141 Timestamp: 2026-04-15T14:16:36.856Z Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.

The .claude/refactor-onboard-plan.md was an internal planning doc for the full 11-step extraction. Since this PR only lands Steps 1-3, the full plan doc is no longer relevant and causes markdownlint failures.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/credential-exposure.test.ts`:
- Around line 68-70: The existing credential-leak assertions only scan
ONBOARD_JS for KEY=VALUE and interpolation patterns, leaving
src/lib/onboard-providers.ts unscanned; update the test to use the combined
source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS) for all
credential-leak checks (not just the ["--credential", credentialEnv] assertion)
so that functions like buildProviderArgs and any KEY=VALUE / ${...}
interpolation patterns in onboard-providers.ts are included in the same
assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14663b86-3f99-4dc4-8f3f-27336c99f31d

📥 Commits

Reviewing files that changed from the base of the PR and between 06e04d7 and c1d9335.

📒 Files selected for processing (2)

src/lib/onboard.ts
test/credential-exposure.test.ts

coderabbitai · 2026-04-27T13:13:12Z

+    // buildProviderArgs lives in onboard-providers.ts; scan both files.
+    const src = fs.readFileSync(ONBOARD_JS, "utf-8") +
+      fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");


⚠️ Potential issue | 🟠 Major

Keep the broader credential-leak scan on onboard-providers.ts too.

This only expands the exact ["--credential", credentialEnv] assertion. The earlier KEY=VALUE / interpolation leak checks still read src/lib/onboard.ts only, so a regression inside src/lib/onboard-providers.ts would now bypass the security test suite.

Suggested change

const ONBOARD_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard.ts"); const ONBOARD_PROVIDERS_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard-providers.ts"); +const ONBOARD_SRC = + fs.readFileSync(ONBOARD_JS, "utf-8") + + "\n" + + fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8"); @@ it("onboard.js must not pass KEY=VALUE to --credential", () => { - const src = fs.readFileSync(ONBOARD_JS, "utf-8"); + const src = ONBOARD_SRC; const lines = src.split("\n"); @@ it("onboard.js --credential flags pass env var names only", () => { - // buildProviderArgs lives in onboard-providers.ts; scan both files. - const src = fs.readFileSync(ONBOARD_JS, "utf-8") + - fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8"); + const src = ONBOARD_SRC;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/credential-exposure.test.ts` around lines 68 - 70, The existing credential-leak assertions only scan ONBOARD_JS for KEY=VALUE and interpolation patterns, leaving src/lib/onboard-providers.ts unscanned; update the test to use the combined source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS) for all credential-leak checks (not just the ["--credential", credentialEnv] assertion) so that functions like buildProviderArgs and any KEY=VALUE / ${...} interpolation patterns in onboard-providers.ts are included in the same assertions.

ericksoa added 3 commits April 26, 2026 14:20

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

jyaunches added 2 commits April 26, 2026 22:16

chore: remove refactor planning notes from PR

151fd8b

The .claude/refactor-onboard-plan.md was an internal planning doc for the full 11-step extraction. Since this PR only lands Steps 1-3, the full plan doc is no longer relevant and causes markdownlint failures.

merge: bring branch up to date with main

c1d9335

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

wscurran added refactor This is a refactor of the code and/or architecture. dependencies Pull requests that update a dependency file labels Apr 27, 2026

jyaunches requested a review from cv April 27, 2026 16:05

cv approved these changes Apr 27, 2026

View reviewed changes

jyaunches merged commit 7da86fa into main Apr 27, 2026
20 checks passed

jyaunches mentioned this pull request Apr 27, 2026

refactor(cli): centralize subprocess execution and make shell use explicit #2465

Open

13 tasks

Conversation

jyaunches commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Interim PR — Unblocks #2306

Summary

Downstream: #2306

CodeRabbit comments

Rebase notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jyaunches Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jyaunches Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jyaunches Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jyaunches Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jyaunches Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jyaunches commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading

coderabbitai Bot Apr 27, 2026 •

edited

Loading