Skip to content

refactor(onboard): extract modules from onboard.ts (Steps 1-3, rebased)#2495

Merged
jyaunches merged 5 commits intomainfrom
refactor/onboard-extract-modules-v3
Apr 27, 2026
Merged

refactor(onboard): extract modules from onboard.ts (Steps 1-3, rebased)#2495
jyaunches merged 5 commits intomainfrom
refactor/onboard-extract-modules-v3

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 27, 2026

⚡ Interim PR — Unblocks #2306

This is an interim extraction PR that lands the foundational module decomposition needed by #2306 (extract rebuild/recreate path + canonicalize credential resolution). It is intentionally scoped to pure code movement with no behavior changes.

Rebased onto current main. Supersedes #2087 and #2489.


Summary

Steps 1–3 of the original extraction plan (authored by @ericksoa in #2087):

  • Extract onboard-providers.ts — provider metadata, CRUD, labels, inference config
  • Extract onboard-ollama-proxy.ts — proxy lifecycle, model pull/validate
  • Extract onboard-inference-probes.ts — endpoint validation probes (OpenAI, Anthropic, Gemini)

Reduces onboard.ts by ~1,000 lines. All 3 new modules are under 400 lines. No behavior changes — all existing exports remain accessible via re-exports and thin wrappers.

Step 4 (dashboard extraction) was dropped — main already landed dashboard-contract.ts, dashboard-health.ts, and dashboard-recover.ts (#2398) which supersede that extraction.

Steps 5–10 (gateway, preflight, messaging, policies, inference-setup, sandbox) remain deferred per the original plan.

Downstream: #2306

#2306 (extract rebuild/recreate path + canonicalize credential resolution) builds directly on top of this PR. It imports REMOTE_PROVIDER_CONFIG, upsertProvider, and ensureOllamaAuthProxy from the new extracted modules via onboard.ts re-exports. Merging this PR first gives #2306 a clean module surface to build on.

CodeRabbit comments

All 6 comments addressed:

  • 1 resolved (deleted planning doc)
  • 4 are valid improvements on pre-existing code that was moved as-is — intentionally out of scope for a pure extraction PR, flagged as follow-up material
  • 1 is incorrect (upsertMessagingProviders was never exported on main)

Rebase notes

  • Ported all main-side refinements into the extracted modules (e.g., startOllamaAuthProxy now returns boolean, getProbeAuthMode always returns undefined)
  • Added type annotations for the CJS require() imports to satisfy strict TypeScript
  • Re-imported endpoint URL constants that onboard.ts still references after extraction

Test plan

  • npx tsc -p tsconfig.src.json --noEmit compiles cleanly
  • npx tsc -p tsconfig.cli.json --noEmit compiles cleanly (pre-existing failures only, same as main)
  • Affected tests pass: gemini-probe-auth, credential-exposure, wsl2-probe-timeout
  • CI: all 15 checks green
  • Manual smoke test: nemoclaw onboard completes successfully

Co-authored-by: Aaron Erickson aerickson@nvidia.com

Summary by CodeRabbit

  • Refactor

    • Reorganized onboarding logic into modular utilities for improved code maintainability and separation of concerns.
    • Separated provider management, inference endpoint validation, and Ollama proxy lifecycle into dedicated modules.
  • Tests

    • Updated tests to reflect new code organization structure.

Move provider metadata constants (REMOTE_PROVIDER_CONFIG, endpoint URLs,
LOCAL_INFERENCE_PROVIDERS, DISCORD_SNOWFLAKE_RE), lookup helpers
(getEffectiveProviderName, getNonInteractive*, getRequested*Hint),
gateway CRUD (upsertProvider, providerExistsInGateway,
upsertMessagingProviders, buildProviderArgs), and
getSandboxInferenceConfig into a dedicated onboard-providers module.

Add getProviderLabel() to consolidate the scattered if/else provider
name→label chains (used in printDashboard).

Functions that need runOpenshell accept it as a last parameter
(dependency injection) to avoid circular requires. onboard.ts retains
thin wrappers with the original signatures.

Update source-scanning tests to read onboard-providers.ts where the
code now lives.
Move the entire Ollama auth-proxy subsystem into a dedicated module:
token/PID persistence, proxy spawn/kill lifecycle, model prompt/pull,
and the exposure warning.

Zero coupling to other onboard logic — all dependencies (runner, ports,
local-inference, credentials, model-prompts) are imported directly.
Move inference endpoint probe functions (parseJsonObject,
hasResponsesToolCall, shouldRequireResponsesToolCalling,
getProbeAuthMode, getValidationProbeCurlArgs, probeResponsesToolCalling,
probeOpenAiLikeEndpoint, probeAnthropicEndpoint) into a dedicated
module.

All dependencies (http-probe, credentials, platform, validation) are
imported directly — no circular deps. The 4 validate* wrapper functions
remain in onboard.ts for now as they depend on promptValidationRecovery;
they'll consolidate when setupNim moves in a later step.

Update source-scanning tests to read from the new compiled file.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This change refactors a large onboarding module into three focused, specialized modules. It extracts inference endpoint probing logic, Ollama proxy management, and provider configuration handling into separate files, while maintaining the same public API through re-exports from the main onboard module. Tests are updated to reference the new module structure.

Changes

Cohort / File(s) Summary
Inference Probe Utilities
src/lib/onboard-inference-probes.ts
New module containing endpoint validation helpers for provider onboarding, including JSON parsing, tool-call detection, authentication mode selection, curl timeout configuration (with WSL-specific tuning), and three distinct probe implementations for OpenAI Responses API, OpenAI-like endpoints, and Anthropic endpoints.
Ollama Proxy Management
src/lib/onboard-ollama-proxy.ts
New module managing Ollama auth proxy lifecycle, including token/PID persistence in ~/.nemoclaw, proxy startup with environment configuration, stale proxy detection and cleanup, model selection prompting, model pulling with 10-minute timeout, and model warming/validation.
Provider Configuration
src/lib/onboard-providers.ts
New module centralizing provider endpoints, constants, and management functions including provider label resolution, effective provider name mapping, non-interactive provider/model hints from environment variables, provider CRUD operations, gateway existence checks, messaging provider bulk upsert, and sandbox inference configuration mapping.
Onboarding Module Refactoring
src/lib/onboard.ts
Refactored to delegate provider, probing, and Ollama logic to new modules via imports and re-exports. Removes ~874 net lines of inline implementations while preserving public API surface through thin wrapper functions and re-exports.
Test Updates
test/credential-exposure.test.ts, test/onboard.test.ts, test/wsl2-probe-timeout.test.ts
Updated test targets to reference new module structure: credential-exposure now scans both onboard.ts and onboard-providers.ts; onboard.test.ts reads from onboard-providers.ts; wsl2-probe-timeout.test.ts reads from onboard-inference-probes.js.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hop with glee, the code's now free!
Three modules dance where once was one,
Probes, Ollama, Providers—neat as can be,
Refactored with care for the maintainer's fun!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: a refactor that extracts functionality from onboard.ts into three new modules. It is specific, concise, and accurately reflects the primary objective of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/onboard-extract-modules-v3

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/refactor-onboard-plan.md:
- Around line 1-3: Add the required SPDX license header to the top of
.claude/refactor-onboard-plan.md using an HTML comment block — insert the
repository-mandated SPDX line (e.g., <!-- SPDX-License-Identifier: ... -->) as
the very first lines of the Markdown file so it complies with the rule that
Markdown files use HTML comments for license headers; ensure the header exactly
matches the project's SPDX formatting convention.

In `@src/lib/onboard-inference-probes.ts`:
- Around line 135-323: probeOpenAiLikeEndpoint is too complex and the eslint
complexity suppression must be removed; refactor by extracting smaller helpers:
(1) buildProbeRequests that returns responsesProbe and chatCompletionsProbe
(move the auth/header/appendKey logic and JSON bodies out of the main function),
(2) handleStreamingFallback that encapsulates the runStreamingEventProbe logic
for "openai-responses" and returns the streamResult handling, (3)
executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using
isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and
runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and
detects NVCF using isNvcfFunctionNotFoundForAccount and
nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these
helpers, remove the eslint-disable-next-line complexity, and keep existing
behavior for responsesProbe, chatCompletionsProbe, runCurlProbe,
runStreamingEventProbe, and failure handling.
- Around line 327-330: The Anthropic probe (probeAnthropicEndpoint) is using the
loose default getCurlTimingArgs() and should use the validation-timeout helper
used by the OpenAI probes; update the runCurlProbe call inside
probeAnthropicEndpoint to replace getCurlTimingArgs() with the validation
timeout helper function used elsewhere for API validation (i.e., the same helper
the OpenAI probe uses) so Anthropic endpoints honor the stricter validation
budget.
- Around line 272-289: The retry path currently always adds an Authorization
header and targets "/chat/completions", losing the original authMode and causing
query‑param auth to fail; update the retry in the isTimeoutOrConnFailure branch
(where retriedAfterTimeout, getValidationProbeCurlArgs, runCurlProbe, apiKey,
normalizeCredentialValue, and endpointUrl are used) to preserve authMode by:
when authMode === "query-param" append the normalized apiKey as the same query
parameter used by the initial probe to the retry URL (instead of adding an
Authorization header), and otherwise keep adding the Authorization: Bearer
header; also reuse the same base endpointUrl construction
(String(endpointUrl).replace(/\/+$/, "")) rather than hardcoding a different
path so the retry targets the same route as the original probe.

In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 172-188: ensureOllamaAuthProxy sets ollamaProxyToken before
verifying the restarted proxy; change the flow so you only assign
ollamaProxyToken after confirming the proxy came up successfully. Specifically:
in ensureOllamaAuthProxy, keep using loadPersistedProxyToken and
killStaleProxy/spawnOllamaAuthProxy, but after spawnOllamaAuthProxy(token) wait
and re-check with isOllamaProxyProcess (or any existing health check) and only
set ollamaProxyToken when that check passes; if the check fails, do not set
ollamaProxyToken and surface/log an error so callers don’t proceed assuming a
live proxy. Make these changes around the ensureOllamaAuthProxy,
spawnOllamaAuthProxy, and isOllamaProxyProcess interaction.

In `@src/lib/onboard.ts`:
- Around line 855-857: The wrapper function upsertMessagingProviders currently
delegates to onboardProviders.upsertMessagingProviders but is not exported,
causing require("./onboard").upsertMessagingProviders to be undefined; add it to
the module's public exports (e.g., include upsertMessagingProviders in
module.exports or the existing export object) so callers can access it, ensuring
the exported symbol name matches upsertMessagingProviders and continues to call
onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25755052-5c8a-4d5e-b4ab-9e665fb62af4

📥 Commits

Reviewing files that changed from the base of the PR and between 1f615e2 and 06e04d7.

📒 Files selected for processing (8)
  • .claude/refactor-onboard-plan.md
  • src/lib/onboard-inference-probes.ts
  • src/lib/onboard-ollama-proxy.ts
  • src/lib/onboard-providers.ts
  • src/lib/onboard.ts
  • test/credential-exposure.test.ts
  • test/onboard.test.ts
  • test/wsl2-probe-timeout.test.ts

Comment thread .claude/refactor-onboard-plan.md Outdated
Comment on lines +135 to +323
// eslint-disable-next-line complexity
function probeOpenAiLikeEndpoint(endpointUrl, model, apiKey, options = {}) {
const useQueryParam = options.authMode === "query-param";
const normalizedKey = apiKey ? normalizeCredentialValue(apiKey) : "";
const baseUrl = String(endpointUrl).replace(/\/+$/, "");
const authHeader = !useQueryParam && normalizedKey
? ["-H", `Authorization: Bearer ${normalizedKey}`]
: [];
const appendKey = (urlPath) =>
useQueryParam && normalizedKey ? `${baseUrl}${urlPath}?key=${encodeURIComponent(normalizedKey)}` : `${baseUrl}${urlPath}`;

const responsesProbe =
options.requireResponsesToolCalling === true
? {
name: "Responses API with tool calling",
api: "openai-responses",
execute: () => probeResponsesToolCalling(endpointUrl, model, apiKey, { authMode: options.authMode }),
}
: {
name: "Responses API",
api: "openai-responses",
execute: () =>
runCurlProbe([
"-sS",
...getValidationProbeCurlArgs(),
"-H",
"Content-Type: application/json",
...authHeader,
"-d",
JSON.stringify({
model,
input: "Reply with exactly: OK",
}),
appendKey("/responses"),
]),
};

const chatCompletionsProbe = {
name: "Chat Completions API",
api: "openai-completions",
execute: () =>
runCurlProbe([
"-sS",
...getValidationProbeCurlArgs(),
"-H",
"Content-Type: application/json",
...authHeader,
"-d",
JSON.stringify({
model,
messages: [{ role: "user", content: "Reply with exactly: OK" }],
}),
appendKey("/chat/completions"),
]),
};

// NVIDIA Build does not expose /v1/responses; probing it always returns
// "404 page not found" and only adds noise to error messages. Skip it
// entirely for that provider. See issue #1601.
const probes = options.skipResponsesProbe
? [chatCompletionsProbe]
: [responsesProbe, chatCompletionsProbe];

const failures = [];
for (const probe of probes) {
const result = probe.execute();
if (result.ok) {
// Streaming event validation — catch backends like SGLang that return
// valid non-streaming responses but emit incomplete SSE events in
// streaming mode. Only run for /responses probes on custom endpoints
// where probeStreaming was requested.
if (probe.api === "openai-responses" && options.probeStreaming === true) {
const streamResult = runStreamingEventProbe([
"-sS",
...getValidationProbeCurlArgs(),
"-H",
"Content-Type: application/json",
...authHeader,
"-d",
JSON.stringify({
model,
input: "Reply with exactly: OK",
stream: true,
}),
appendKey("/responses"),
]);
if (!streamResult.ok && streamResult.missingEvents.length > 0) {
// Backend responds but lacks required streaming events — fall back
// to /chat/completions silently.
console.log(` ℹ ${streamResult.message}`);
failures.push({
name: probe.name + " (streaming)",
httpStatus: 0,
curlStatus: 0,
message: streamResult.message,
body: "",
});
continue;
}
if (!streamResult.ok) {
// Transport or execution failure — surface as a hard error instead
// of silently switching APIs.
return {
ok: false,
message: `${probe.name} (streaming): ${streamResult.message}`,
failures: [
{
name: probe.name + " (streaming)",
httpStatus: 0,
curlStatus: 0,
message: streamResult.message,
body: "",
},
],
};
}
}
return { ok: true, api: probe.api, label: probe.name };
}
// Preserve the raw response body alongside the summarized message so the
// NVCF "Function not found for account" detector below can fall back to
// the raw body if summarizeProbeError ever stops surfacing the marker
// through `message`.
failures.push({
name: probe.name,
httpStatus: result.httpStatus,
curlStatus: result.curlStatus,
message: result.message,
body: result.body,
});
}

// Single retry with doubled timeouts on timeout/connection failure.
// WSL2's virtualized network stack can cause the initial probe to time out
// before the TLS handshake completes. See issue #987.
const isTimeoutOrConnFailure = (cs) => cs === 28 || cs === 6 || cs === 7;
let retriedAfterTimeout = false;
if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
retriedAfterTimeout = true;
const baseArgs = getValidationProbeCurlArgs();
const doubledArgs = baseArgs.map((arg) =>
/^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
);
const retryResult = runCurlProbe([
"-sS",
...doubledArgs,
"-H",
"Content-Type: application/json",
...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
"-d",
JSON.stringify({
model,
messages: [{ role: "user", content: "Reply with exactly: OK" }],
}),
`${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,
]);
if (retryResult.ok) {
return { ok: true, api: "openai-completions", label: "Chat Completions API" };
}
}

// Detect the NVCF "Function not found for account" error and reframe it
// with an actionable next step instead of dumping the raw NVCF body.
// See issue #1601 (Bug 2).
const accountFailure = failures.find(
(failure) =>
isNvcfFunctionNotFoundForAccount(failure.message) ||
isNvcfFunctionNotFoundForAccount(failure.body),
);
if (accountFailure) {
return {
ok: false,
message: nvcfFunctionNotFoundMessage(model),
failures,
};
}

const baseMessage = failures.map((failure) => `${failure.name}: ${failure.message}`).join(" | ");
const wslHint =
isWsl() && retriedAfterTimeout
? " · WSL2 detected \u2014 network verification may be slower than expected. " +
"Run `nemoclaw onboard` with the `--skip-verify` flag if this endpoint is known to be reachable."
: "";
return {
ok: false,
message: baseMessage + wslHint,
failures,
};
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Please split probeOpenAiLikeEndpoint instead of disabling the complexity check.

The new module carries forward a very large decision tree and suppresses the repo’s complexity gate with eslint-disable-next-line complexity. This is exactly the kind of logic that becomes brittle during extractions like this; extracting the request builders, streaming fallback, retry path, and failure summarization into helpers would keep the public behavior intact without bypassing the rule.

As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-inference-probes.ts` around lines 135 - 323,
probeOpenAiLikeEndpoint is too complex and the eslint complexity suppression
must be removed; refactor by extracting smaller helpers: (1) buildProbeRequests
that returns responsesProbe and chatCompletionsProbe (move the
auth/header/appendKey logic and JSON bodies out of the main function), (2)
handleStreamingFallback that encapsulates the runStreamingEventProbe logic for
"openai-responses" and returns the streamResult handling, (3)
executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using
isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and
runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and
detects NVCF using isNvcfFunctionNotFoundForAccount and
nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these
helpers, remove the eslint-disable-next-line complexity, and keep existing
behavior for responsesProbe, chatCompletionsProbe, runCurlProbe,
runStreamingEventProbe, and failure handling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid suggestion but out of scope for this PR. This is a pure extraction — code was moved as-is from onboard.ts with no behavior changes. Complexity reduction in the extracted modules is good follow-up material but should be a separate PR to keep this one reviewable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyaunches, understood — keeping extraction PRs behavior-preserving and deferring quality improvements is a sound approach. Would you like me to open a follow-up issue to track the probeOpenAiLikeEndpoint complexity refactor (extracting buildProbeRequests, handleStreamingFallback, executeRetryWithDoubledTimeouts, and summarizeFailures, and removing the eslint-disable-next-line complexity suppression)?


✏️ Learnings added
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 2495
File: src/lib/onboard-inference-probes.ts:135-323
Timestamp: 2026-04-27T13:06:57.456Z
Learning: In the NemoClaw repository, PRs that extract code from `src/lib/onboard.ts` into new modules (e.g., `onboard-providers.ts`, `onboard-ollama-proxy.ts`, `onboard-inference-probes.ts`) are intentionally behavior-preserving with no refactoring, so complexity or style improvements in the extracted modules should be deferred to separate follow-up PRs.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +272 to +289
if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
retriedAfterTimeout = true;
const baseArgs = getValidationProbeCurlArgs();
const doubledArgs = baseArgs.map((arg) =>
/^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
);
const retryResult = runCurlProbe([
"-sS",
...doubledArgs,
"-H",
"Content-Type: application/json",
...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
"-d",
JSON.stringify({
model,
messages: [{ role: "user", content: "Reply with exactly: OK" }],
}),
`${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve authMode in the timeout retry path.

The retry always switches to Authorization: Bearer ... and a plain /chat/completions URL. For endpoints validated with authMode: "query-param", the initial probe can time out, then the retry will fail for the wrong reason because it drops the query-param auth scheme.

Suggested fix
   if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
     retriedAfterTimeout = true;
     const baseArgs = getValidationProbeCurlArgs();
     const doubledArgs = baseArgs.map((arg) =>
       /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
     );
+    const retryAuthHeader = !useQueryParam && normalizedKey
+      ? ["-H", `Authorization: Bearer ${normalizedKey}`]
+      : [];
+    const retryUrl =
+      useQueryParam && normalizedKey
+        ? `${baseUrl}/chat/completions?key=${encodeURIComponent(normalizedKey)}`
+        : `${baseUrl}/chat/completions`;
     const retryResult = runCurlProbe([
       "-sS",
       ...doubledArgs,
       "-H",
       "Content-Type: application/json",
-      ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
+      ...retryAuthHeader,
       "-d",
       JSON.stringify({
         model,
         messages: [{ role: "user", content: "Reply with exactly: OK" }],
       }),
-      `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,
+      retryUrl,
     ]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-inference-probes.ts` around lines 272 - 289, The retry path
currently always adds an Authorization header and targets "/chat/completions",
losing the original authMode and causing query‑param auth to fail; update the
retry in the isTimeoutOrConnFailure branch (where retriedAfterTimeout,
getValidationProbeCurlArgs, runCurlProbe, apiKey, normalizeCredentialValue, and
endpointUrl are used) to preserve authMode by: when authMode === "query-param"
append the normalized apiKey as the same query parameter used by the initial
probe to the retry URL (instead of adding an Authorization header), and
otherwise keep adding the Authorization: Bearer header; also reuse the same base
endpointUrl construction (String(endpointUrl).replace(/\/+$/, "")) rather than
hardcoding a different path so the retry targets the same route as the original
probe.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing behavior, moved as-is. The authMode retry path works today because getProbeAuthMode now returns undefined for all providers (changed on main before this PR). Agree it could be more defensive — follow-up material.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyaunches, understood — thanks for the context. Since getProbeAuthMode unconditionally returns undefined as of main, authMode is never "query-param" in practice, so the retry path is safe as-is. Happy to treat the hardening as follow-up.

Would you like me to open a follow-up GitHub issue to track making the retry path preserve authMode defensively (in case getProbeAuthMode ever returns a non-undefined value again)?

🐰 noted and filed away for later!


🧠 Learnings used
Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 1967
File: test/e2e/test-token-rotation.sh:0-0
Timestamp: 2026-04-16T18:34:40.564Z
Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.

Comment on lines +327 to +330
function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
const result = runCurlProbe([
"-sS",
...getCurlTimingArgs(),
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the validation timeout helper for Anthropic probes too.

This path falls back to getCurlTimingArgs(), which the comment above explicitly says is the looser default budget. A broken Anthropic-compatible endpoint can therefore stall onboarding much longer than OpenAI-like validation.

Suggested fix
 function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
   const result = runCurlProbe([
     "-sS",
-    ...getCurlTimingArgs(),
+    ...getValidationProbeCurlArgs(),
     "-H",
     `x-api-key: ${normalizeCredentialValue(apiKey)}`,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
const result = runCurlProbe([
"-sS",
...getCurlTimingArgs(),
function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
const result = runCurlProbe([
"-sS",
...getValidationProbeCurlArgs(),
"-H",
`x-api-key: ${normalizeCredentialValue(apiKey)}`,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-inference-probes.ts` around lines 327 - 330, The Anthropic
probe (probeAnthropicEndpoint) is using the loose default getCurlTimingArgs()
and should use the validation-timeout helper used by the OpenAI probes; update
the runCurlProbe call inside probeAnthropicEndpoint to replace
getCurlTimingArgs() with the validation timeout helper function used elsewhere
for API validation (i.e., the same helper the OpenAI probe uses) so Anthropic
endpoints honor the stricter validation budget.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing behavior, moved as-is. Good catch on the timeout inconsistency — but not introduced by this extraction. Follow-up material.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyaunches, understood — thanks for the context. Since this is a valid inconsistency worth tracking, would you like me to open a follow-up GitHub issue to capture the fix (replacing getCurlTimingArgs() with getValidationProbeCurlArgs() in probeAnthropicEndpoint) so it doesn't get lost?


🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: src/lib/local-inference.ts:289-303
Timestamp: 2026-04-14T22:22:08.976Z
Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.

Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: test/onboard-selection.test.ts:137-141
Timestamp: 2026-04-15T14:16:36.856Z
Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.

Comment on lines +172 to +188
function ensureOllamaAuthProxy(): void {
// Try to load persisted token first — if none, this isn't an Ollama setup.
const token = loadPersistedProxyToken();
if (!token) return;

const pid = loadPersistedProxyPid();
if (isOllamaProxyProcess(pid)) {
ollamaProxyToken = token;
return;
}

// Proxy not running — restart it with the persisted token.
killStaleProxy();
ollamaProxyToken = token;
spawnOllamaAuthProxy(token);
sleep(1);
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify the restarted proxy before keeping the persisted token active.

If the restart fails here, ollamaProxyToken is still left in memory and the caller proceeds as if the proxy exists. After a reboot, that turns into opaque local-inference failures against a dead proxy endpoint.

Suggested fix
 function ensureOllamaAuthProxy(): void {
   const token = loadPersistedProxyToken();
   if (!token) return;
@@
   killStaleProxy();
   ollamaProxyToken = token;
-  spawnOllamaAuthProxy(token);
+  const restartedPid = spawnOllamaAuthProxy(token);
   sleep(1);
+  if (!isOllamaProxyProcess(restartedPid)) {
+    ollamaProxyToken = null;
+    clearPersistedProxyPid();
+    console.error(`  Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`);
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-ollama-proxy.ts` around lines 172 - 188,
ensureOllamaAuthProxy sets ollamaProxyToken before verifying the restarted
proxy; change the flow so you only assign ollamaProxyToken after confirming the
proxy came up successfully. Specifically: in ensureOllamaAuthProxy, keep using
loadPersistedProxyToken and killStaleProxy/spawnOllamaAuthProxy, but after
spawnOllamaAuthProxy(token) wait and re-check with isOllamaProxyProcess (or any
existing health check) and only set ollamaProxyToken when that check passes; if
the check fails, do not set ollamaProxyToken and surface/log an error so callers
don’t proceed assuming a live proxy. Make these changes around the
ensureOllamaAuthProxy, spawnOllamaAuthProxy, and isOllamaProxyProcess
interaction.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing behavior, moved as-is. The proxy restart-without-verify pattern exists on main today. Agree it should be hardened — follow-up material.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jyaunches, understood — thanks for the context. Since this is agreed-upon follow-up work, would you like me to open a GitHub issue to track the ensureOllamaAuthProxy restart-verification hardening? I can capture the proposed fix (verify the restarted PID via isOllamaProxyProcess, null-guard ollamaProxyToken on failure, and log an actionable error) as the issue body with a backlink to this PR.


🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: src/lib/local-inference.ts:289-303
Timestamp: 2026-04-14T22:22:08.976Z
Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.

Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 1967
File: test/e2e/test-token-rotation.sh:0-0
Timestamp: 2026-04-16T18:34:40.564Z
Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.

Learnt from: jieunl24
Repo: NVIDIA/NemoClaw PR: 1104
File: bin/lib/onboard.js:2199-2200
Timestamp: 2026-03-30T17:32:16.517Z
Learning: In `bin/lib/onboard.js` (NemoClaw project), the WSL2 Ollama startup path intentionally does NOT clear an inherited `OLLAMA_HOST` environment variable (e.g., via `env -u OLLAMA_HOST`). The design decision is to respect any explicitly exported `OLLAMA_HOST` as an intentional user configuration choice. The fix only avoids *setting* `OLLAMA_HOST=0.0.0.0:11434` inline on WSL2; it does not override pre-existing env values. This is considered acceptable because neither the Ollama installer nor NemoClaw sets `OLLAMA_HOST` in shell profiles.

Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 2066
File: src/lib/shields.ts:162-180
Timestamp: 2026-04-19T14:03:06.546Z
Learning: In `src/lib/shields-timer.ts` (TypeScript/Node.js, NemoClaw), the re-lock and lock-verification logic is intentionally inlined (duplicated from `lockAgentConfig` in `shields.ts`) because the timer runs as a detached standalone script via `fork()` with `ts-nocheck` and cannot import TypeScript modules from the main codebase at runtime. Do not flag this duplication as a refactoring opportunity.

Comment thread src/lib/onboard.ts
Comment on lines +855 to 857
function upsertMessagingProviders(tokenDefs: MessagingTokenDef[]) {
return onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell);
}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-export upsertMessagingProviders from this entry point.

Lines 855-857 add the wrapper, but module.exports never exposes it. That makes require("./onboard").upsertMessagingProviders resolve to undefined, which is a breaking API change for existing callers.

Suggested fix
 module.exports = {
   buildProviderArgs,
   buildGatewayBootstrapSecretsScript,
   buildSandboxConfigSyncScript,
   compactText,
@@
   summarizeCurlFailure,
   summarizeProbeFailure,
   hasResponsesToolCall,
+  upsertMessagingProviders,
   upsertProvider,
   hashCredential,
   detectMessagingCredentialRotation,
   hydrateCredentialEnv,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 855 - 857, The wrapper function
upsertMessagingProviders currently delegates to
onboardProviders.upsertMessagingProviders but is not exported, causing
require("./onboard").upsertMessagingProviders to be undefined; add it to the
module's public exports (e.g., include upsertMessagingProviders in
module.exports or the existing export object) so callers can access it, ensuring
the exported symbol name matches upsertMessagingProviders and continues to call
onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a regression — upsertMessagingProviders was never in module.exports on main either. It's an internal function used only within onboard.ts. The wrapper exists so internal call sites can reach the extracted implementation via runOpenshell injection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if upsertMessagingProviders appears in module.exports on the main branch
git show main:src/lib/onboard.ts | grep -n "upsertMessagingProviders" | head -30

Length of output: 408


@jyaunches, you're right — confirmed on main, upsertMessagingProviders has no entry in module.exports and is only used at internal call sites within onboard.ts. My comment was incorrect; this is not a regression. Apologies for the noise.

(╥﹏╥)


✏️ Learnings added
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 2495
File: src/lib/onboard.ts:855-857
Timestamp: 2026-04-27T13:07:20.779Z
Learning: In `src/lib/onboard.ts` (NVIDIA/NemoClaw), `upsertMessagingProviders` is a module-private function that is intentionally NOT exported via `module.exports`. It is used only internally within `onboard.ts` to delegate to `onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell)` with `runOpenshell` injection. Do not flag its absence from `module.exports` as a missing export or regression.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: test/onboard-selection.test.ts:137-141
Timestamp: 2026-04-15T14:16:36.856Z
Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.

The .claude/refactor-onboard-plan.md was an internal planning doc
for the full 11-step extraction. Since this PR only lands Steps 1-3,
the full plan doc is no longer relevant and causes markdownlint failures.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/credential-exposure.test.ts`:
- Around line 68-70: The existing credential-leak assertions only scan
ONBOARD_JS for KEY=VALUE and interpolation patterns, leaving
src/lib/onboard-providers.ts unscanned; update the test to use the combined
source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS) for all
credential-leak checks (not just the ["--credential", credentialEnv] assertion)
so that functions like buildProviderArgs and any KEY=VALUE / ${...}
interpolation patterns in onboard-providers.ts are included in the same
assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 14663b86-3f99-4dc4-8f3f-27336c99f31d

📥 Commits

Reviewing files that changed from the base of the PR and between 06e04d7 and c1d9335.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/credential-exposure.test.ts

Comment on lines +68 to +70
// buildProviderArgs lives in onboard-providers.ts; scan both files.
const src = fs.readFileSync(ONBOARD_JS, "utf-8") +
fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the broader credential-leak scan on onboard-providers.ts too.

This only expands the exact ["--credential", credentialEnv] assertion. The earlier KEY=VALUE / interpolation leak checks still read src/lib/onboard.ts only, so a regression inside src/lib/onboard-providers.ts would now bypass the security test suite.

Suggested change
 const ONBOARD_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard.ts");
 const ONBOARD_PROVIDERS_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard-providers.ts");
+const ONBOARD_SRC =
+  fs.readFileSync(ONBOARD_JS, "utf-8") +
+  "\n" +
+  fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");
@@
   it("onboard.js must not pass KEY=VALUE to --credential", () => {
-    const src = fs.readFileSync(ONBOARD_JS, "utf-8");
+    const src = ONBOARD_SRC;
     const lines = src.split("\n");
@@
   it("onboard.js --credential flags pass env var names only", () => {
-    // buildProviderArgs lives in onboard-providers.ts; scan both files.
-    const src = fs.readFileSync(ONBOARD_JS, "utf-8") +
-      fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");
+    const src = ONBOARD_SRC;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/credential-exposure.test.ts` around lines 68 - 70, The existing
credential-leak assertions only scan ONBOARD_JS for KEY=VALUE and interpolation
patterns, leaving src/lib/onboard-providers.ts unscanned; update the test to use
the combined source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS)
for all credential-leak checks (not just the ["--credential", credentialEnv]
assertion) so that functions like buildProviderArgs and any KEY=VALUE / ${...}
interpolation patterns in onboard-providers.ts are included in the same
assertions.

@wscurran wscurran added refactor This is a refactor of the code and/or architecture. dependencies Pull requests that update a dependency file labels Apr 27, 2026
@jyaunches jyaunches requested a review from cv April 27, 2026 16:05
@jyaunches jyaunches merged commit 7da86fa into main Apr 27, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file refactor This is a refactor of the code and/or architecture.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants