refactor(onboard): extract modules from onboard.ts (Steps 1-3, rebased)#2495
refactor(onboard): extract modules from onboard.ts (Steps 1-3, rebased)#2495
Conversation
Move provider metadata constants (REMOTE_PROVIDER_CONFIG, endpoint URLs, LOCAL_INFERENCE_PROVIDERS, DISCORD_SNOWFLAKE_RE), lookup helpers (getEffectiveProviderName, getNonInteractive*, getRequested*Hint), gateway CRUD (upsertProvider, providerExistsInGateway, upsertMessagingProviders, buildProviderArgs), and getSandboxInferenceConfig into a dedicated onboard-providers module. Add getProviderLabel() to consolidate the scattered if/else provider name→label chains (used in printDashboard). Functions that need runOpenshell accept it as a last parameter (dependency injection) to avoid circular requires. onboard.ts retains thin wrappers with the original signatures. Update source-scanning tests to read onboard-providers.ts where the code now lives.
Move the entire Ollama auth-proxy subsystem into a dedicated module: token/PID persistence, proxy spawn/kill lifecycle, model prompt/pull, and the exposure warning. Zero coupling to other onboard logic — all dependencies (runner, ports, local-inference, credentials, model-prompts) are imported directly.
Move inference endpoint probe functions (parseJsonObject, hasResponsesToolCall, shouldRequireResponsesToolCalling, getProbeAuthMode, getValidationProbeCurlArgs, probeResponsesToolCalling, probeOpenAiLikeEndpoint, probeAnthropicEndpoint) into a dedicated module. All dependencies (http-probe, credentials, platform, validation) are imported directly — no circular deps. The 4 validate* wrapper functions remain in onboard.ts for now as they depend on promptValidationRecovery; they'll consolidate when setupNim moves in a later step. Update source-scanning tests to read from the new compiled file.
📝 WalkthroughWalkthroughThis change refactors a large onboarding module into three focused, specialized modules. It extracts inference endpoint probing logic, Ollama proxy management, and provider configuration handling into separate files, while maintaining the same public API through re-exports from the main onboard module. Tests are updated to reference the new module structure. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/refactor-onboard-plan.md:
- Around line 1-3: Add the required SPDX license header to the top of
.claude/refactor-onboard-plan.md using an HTML comment block — insert the
repository-mandated SPDX line (e.g., <!-- SPDX-License-Identifier: ... -->) as
the very first lines of the Markdown file so it complies with the rule that
Markdown files use HTML comments for license headers; ensure the header exactly
matches the project's SPDX formatting convention.
In `@src/lib/onboard-inference-probes.ts`:
- Around line 135-323: probeOpenAiLikeEndpoint is too complex and the eslint
complexity suppression must be removed; refactor by extracting smaller helpers:
(1) buildProbeRequests that returns responsesProbe and chatCompletionsProbe
(move the auth/header/appendKey logic and JSON bodies out of the main function),
(2) handleStreamingFallback that encapsulates the runStreamingEventProbe logic
for "openai-responses" and returns the streamResult handling, (3)
executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using
isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and
runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and
detects NVCF using isNvcfFunctionNotFoundForAccount and
nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these
helpers, remove the eslint-disable-next-line complexity, and keep existing
behavior for responsesProbe, chatCompletionsProbe, runCurlProbe,
runStreamingEventProbe, and failure handling.
- Around line 327-330: The Anthropic probe (probeAnthropicEndpoint) is using the
loose default getCurlTimingArgs() and should use the validation-timeout helper
used by the OpenAI probes; update the runCurlProbe call inside
probeAnthropicEndpoint to replace getCurlTimingArgs() with the validation
timeout helper function used elsewhere for API validation (i.e., the same helper
the OpenAI probe uses) so Anthropic endpoints honor the stricter validation
budget.
- Around line 272-289: The retry path currently always adds an Authorization
header and targets "/chat/completions", losing the original authMode and causing
query‑param auth to fail; update the retry in the isTimeoutOrConnFailure branch
(where retriedAfterTimeout, getValidationProbeCurlArgs, runCurlProbe, apiKey,
normalizeCredentialValue, and endpointUrl are used) to preserve authMode by:
when authMode === "query-param" append the normalized apiKey as the same query
parameter used by the initial probe to the retry URL (instead of adding an
Authorization header), and otherwise keep adding the Authorization: Bearer
header; also reuse the same base endpointUrl construction
(String(endpointUrl).replace(/\/+$/, "")) rather than hardcoding a different
path so the retry targets the same route as the original probe.
In `@src/lib/onboard-ollama-proxy.ts`:
- Around line 172-188: ensureOllamaAuthProxy sets ollamaProxyToken before
verifying the restarted proxy; change the flow so you only assign
ollamaProxyToken after confirming the proxy came up successfully. Specifically:
in ensureOllamaAuthProxy, keep using loadPersistedProxyToken and
killStaleProxy/spawnOllamaAuthProxy, but after spawnOllamaAuthProxy(token) wait
and re-check with isOllamaProxyProcess (or any existing health check) and only
set ollamaProxyToken when that check passes; if the check fails, do not set
ollamaProxyToken and surface/log an error so callers don’t proceed assuming a
live proxy. Make these changes around the ensureOllamaAuthProxy,
spawnOllamaAuthProxy, and isOllamaProxyProcess interaction.
In `@src/lib/onboard.ts`:
- Around line 855-857: The wrapper function upsertMessagingProviders currently
delegates to onboardProviders.upsertMessagingProviders but is not exported,
causing require("./onboard").upsertMessagingProviders to be undefined; add it to
the module's public exports (e.g., include upsertMessagingProviders in
module.exports or the existing export object) so callers can access it, ensuring
the exported symbol name matches upsertMessagingProviders and continues to call
onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 25755052-5c8a-4d5e-b4ab-9e665fb62af4
📒 Files selected for processing (8)
.claude/refactor-onboard-plan.mdsrc/lib/onboard-inference-probes.tssrc/lib/onboard-ollama-proxy.tssrc/lib/onboard-providers.tssrc/lib/onboard.tstest/credential-exposure.test.tstest/onboard.test.tstest/wsl2-probe-timeout.test.ts
| // eslint-disable-next-line complexity | ||
| function probeOpenAiLikeEndpoint(endpointUrl, model, apiKey, options = {}) { | ||
| const useQueryParam = options.authMode === "query-param"; | ||
| const normalizedKey = apiKey ? normalizeCredentialValue(apiKey) : ""; | ||
| const baseUrl = String(endpointUrl).replace(/\/+$/, ""); | ||
| const authHeader = !useQueryParam && normalizedKey | ||
| ? ["-H", `Authorization: Bearer ${normalizedKey}`] | ||
| : []; | ||
| const appendKey = (urlPath) => | ||
| useQueryParam && normalizedKey ? `${baseUrl}${urlPath}?key=${encodeURIComponent(normalizedKey)}` : `${baseUrl}${urlPath}`; | ||
|
|
||
| const responsesProbe = | ||
| options.requireResponsesToolCalling === true | ||
| ? { | ||
| name: "Responses API with tool calling", | ||
| api: "openai-responses", | ||
| execute: () => probeResponsesToolCalling(endpointUrl, model, apiKey, { authMode: options.authMode }), | ||
| } | ||
| : { | ||
| name: "Responses API", | ||
| api: "openai-responses", | ||
| execute: () => | ||
| runCurlProbe([ | ||
| "-sS", | ||
| ...getValidationProbeCurlArgs(), | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| ...authHeader, | ||
| "-d", | ||
| JSON.stringify({ | ||
| model, | ||
| input: "Reply with exactly: OK", | ||
| }), | ||
| appendKey("/responses"), | ||
| ]), | ||
| }; | ||
|
|
||
| const chatCompletionsProbe = { | ||
| name: "Chat Completions API", | ||
| api: "openai-completions", | ||
| execute: () => | ||
| runCurlProbe([ | ||
| "-sS", | ||
| ...getValidationProbeCurlArgs(), | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| ...authHeader, | ||
| "-d", | ||
| JSON.stringify({ | ||
| model, | ||
| messages: [{ role: "user", content: "Reply with exactly: OK" }], | ||
| }), | ||
| appendKey("/chat/completions"), | ||
| ]), | ||
| }; | ||
|
|
||
| // NVIDIA Build does not expose /v1/responses; probing it always returns | ||
| // "404 page not found" and only adds noise to error messages. Skip it | ||
| // entirely for that provider. See issue #1601. | ||
| const probes = options.skipResponsesProbe | ||
| ? [chatCompletionsProbe] | ||
| : [responsesProbe, chatCompletionsProbe]; | ||
|
|
||
| const failures = []; | ||
| for (const probe of probes) { | ||
| const result = probe.execute(); | ||
| if (result.ok) { | ||
| // Streaming event validation — catch backends like SGLang that return | ||
| // valid non-streaming responses but emit incomplete SSE events in | ||
| // streaming mode. Only run for /responses probes on custom endpoints | ||
| // where probeStreaming was requested. | ||
| if (probe.api === "openai-responses" && options.probeStreaming === true) { | ||
| const streamResult = runStreamingEventProbe([ | ||
| "-sS", | ||
| ...getValidationProbeCurlArgs(), | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| ...authHeader, | ||
| "-d", | ||
| JSON.stringify({ | ||
| model, | ||
| input: "Reply with exactly: OK", | ||
| stream: true, | ||
| }), | ||
| appendKey("/responses"), | ||
| ]); | ||
| if (!streamResult.ok && streamResult.missingEvents.length > 0) { | ||
| // Backend responds but lacks required streaming events — fall back | ||
| // to /chat/completions silently. | ||
| console.log(` ℹ ${streamResult.message}`); | ||
| failures.push({ | ||
| name: probe.name + " (streaming)", | ||
| httpStatus: 0, | ||
| curlStatus: 0, | ||
| message: streamResult.message, | ||
| body: "", | ||
| }); | ||
| continue; | ||
| } | ||
| if (!streamResult.ok) { | ||
| // Transport or execution failure — surface as a hard error instead | ||
| // of silently switching APIs. | ||
| return { | ||
| ok: false, | ||
| message: `${probe.name} (streaming): ${streamResult.message}`, | ||
| failures: [ | ||
| { | ||
| name: probe.name + " (streaming)", | ||
| httpStatus: 0, | ||
| curlStatus: 0, | ||
| message: streamResult.message, | ||
| body: "", | ||
| }, | ||
| ], | ||
| }; | ||
| } | ||
| } | ||
| return { ok: true, api: probe.api, label: probe.name }; | ||
| } | ||
| // Preserve the raw response body alongside the summarized message so the | ||
| // NVCF "Function not found for account" detector below can fall back to | ||
| // the raw body if summarizeProbeError ever stops surfacing the marker | ||
| // through `message`. | ||
| failures.push({ | ||
| name: probe.name, | ||
| httpStatus: result.httpStatus, | ||
| curlStatus: result.curlStatus, | ||
| message: result.message, | ||
| body: result.body, | ||
| }); | ||
| } | ||
|
|
||
| // Single retry with doubled timeouts on timeout/connection failure. | ||
| // WSL2's virtualized network stack can cause the initial probe to time out | ||
| // before the TLS handshake completes. See issue #987. | ||
| const isTimeoutOrConnFailure = (cs) => cs === 28 || cs === 6 || cs === 7; | ||
| let retriedAfterTimeout = false; | ||
| if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) { | ||
| retriedAfterTimeout = true; | ||
| const baseArgs = getValidationProbeCurlArgs(); | ||
| const doubledArgs = baseArgs.map((arg) => | ||
| /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg, | ||
| ); | ||
| const retryResult = runCurlProbe([ | ||
| "-sS", | ||
| ...doubledArgs, | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []), | ||
| "-d", | ||
| JSON.stringify({ | ||
| model, | ||
| messages: [{ role: "user", content: "Reply with exactly: OK" }], | ||
| }), | ||
| `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`, | ||
| ]); | ||
| if (retryResult.ok) { | ||
| return { ok: true, api: "openai-completions", label: "Chat Completions API" }; | ||
| } | ||
| } | ||
|
|
||
| // Detect the NVCF "Function not found for account" error and reframe it | ||
| // with an actionable next step instead of dumping the raw NVCF body. | ||
| // See issue #1601 (Bug 2). | ||
| const accountFailure = failures.find( | ||
| (failure) => | ||
| isNvcfFunctionNotFoundForAccount(failure.message) || | ||
| isNvcfFunctionNotFoundForAccount(failure.body), | ||
| ); | ||
| if (accountFailure) { | ||
| return { | ||
| ok: false, | ||
| message: nvcfFunctionNotFoundMessage(model), | ||
| failures, | ||
| }; | ||
| } | ||
|
|
||
| const baseMessage = failures.map((failure) => `${failure.name}: ${failure.message}`).join(" | "); | ||
| const wslHint = | ||
| isWsl() && retriedAfterTimeout | ||
| ? " · WSL2 detected \u2014 network verification may be slower than expected. " + | ||
| "Run `nemoclaw onboard` with the `--skip-verify` flag if this endpoint is known to be reachable." | ||
| : ""; | ||
| return { | ||
| ok: false, | ||
| message: baseMessage + wslHint, | ||
| failures, | ||
| }; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Please split probeOpenAiLikeEndpoint instead of disabling the complexity check.
The new module carries forward a very large decision tree and suppresses the repo’s complexity gate with eslint-disable-next-line complexity. This is exactly the kind of logic that becomes brittle during extractions like this; extracting the request builders, streaming fallback, retry path, and failure summarization into helpers would keep the public behavior intact without bypassing the rule.
As per coding guidelines, {src,nemoclaw/src,scripts}/**/*.{js,ts,tsx}: TypeScript files must maintain cyclomatic complexity limit of 20, ratcheting down to 15. Enforced by ESLint.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-inference-probes.ts` around lines 135 - 323,
probeOpenAiLikeEndpoint is too complex and the eslint complexity suppression
must be removed; refactor by extracting smaller helpers: (1) buildProbeRequests
that returns responsesProbe and chatCompletionsProbe (move the
auth/header/appendKey logic and JSON bodies out of the main function), (2)
handleStreamingFallback that encapsulates the runStreamingEventProbe logic for
"openai-responses" and returns the streamResult handling, (3)
executeRetryWithDoubledTimeouts encapsulating the retry logic (currently using
isTimeoutOrConnFailure, getValidationProbeCurlArgs, normalizeCredentialValue and
runCurlProbe), and (4) summarizeFailures that builds baseMessage/wslHint and
detects NVCF using isNvcfFunctionNotFoundForAccount and
nvcfFunctionNotFoundMessage; update probeOpenAiLikeEndpoint to orchestrate these
helpers, remove the eslint-disable-next-line complexity, and keep existing
behavior for responsesProbe, chatCompletionsProbe, runCurlProbe,
runStreamingEventProbe, and failure handling.
There was a problem hiding this comment.
Valid suggestion but out of scope for this PR. This is a pure extraction — code was moved as-is from onboard.ts with no behavior changes. Complexity reduction in the extracted modules is good follow-up material but should be a separate PR to keep this one reviewable.
There was a problem hiding this comment.
@jyaunches, understood — keeping extraction PRs behavior-preserving and deferring quality improvements is a sound approach. Would you like me to open a follow-up issue to track the probeOpenAiLikeEndpoint complexity refactor (extracting buildProbeRequests, handleStreamingFallback, executeRetryWithDoubledTimeouts, and summarizeFailures, and removing the eslint-disable-next-line complexity suppression)?
✏️ Learnings added
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 2495
File: src/lib/onboard-inference-probes.ts:135-323
Timestamp: 2026-04-27T13:06:57.456Z
Learning: In the NemoClaw repository, PRs that extract code from `src/lib/onboard.ts` into new modules (e.g., `onboard-providers.ts`, `onboard-ollama-proxy.ts`, `onboard-inference-probes.ts`) are intentionally behavior-preserving with no refactoring, so complexity or style improvements in the extracted modules should be deferred to separate follow-up PRs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) { | ||
| retriedAfterTimeout = true; | ||
| const baseArgs = getValidationProbeCurlArgs(); | ||
| const doubledArgs = baseArgs.map((arg) => | ||
| /^\d+$/.test(arg) ? String(Number(arg) * 2) : arg, | ||
| ); | ||
| const retryResult = runCurlProbe([ | ||
| "-sS", | ||
| ...doubledArgs, | ||
| "-H", | ||
| "Content-Type: application/json", | ||
| ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []), | ||
| "-d", | ||
| JSON.stringify({ | ||
| model, | ||
| messages: [{ role: "user", content: "Reply with exactly: OK" }], | ||
| }), | ||
| `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`, |
There was a problem hiding this comment.
Preserve authMode in the timeout retry path.
The retry always switches to Authorization: Bearer ... and a plain /chat/completions URL. For endpoints validated with authMode: "query-param", the initial probe can time out, then the retry will fail for the wrong reason because it drops the query-param auth scheme.
Suggested fix
if (failures.length > 0 && isTimeoutOrConnFailure(failures[0].curlStatus)) {
retriedAfterTimeout = true;
const baseArgs = getValidationProbeCurlArgs();
const doubledArgs = baseArgs.map((arg) =>
/^\d+$/.test(arg) ? String(Number(arg) * 2) : arg,
);
+ const retryAuthHeader = !useQueryParam && normalizedKey
+ ? ["-H", `Authorization: Bearer ${normalizedKey}`]
+ : [];
+ const retryUrl =
+ useQueryParam && normalizedKey
+ ? `${baseUrl}/chat/completions?key=${encodeURIComponent(normalizedKey)}`
+ : `${baseUrl}/chat/completions`;
const retryResult = runCurlProbe([
"-sS",
...doubledArgs,
"-H",
"Content-Type: application/json",
- ...(apiKey ? ["-H", `Authorization: Bearer ${normalizeCredentialValue(apiKey)}`] : []),
+ ...retryAuthHeader,
"-d",
JSON.stringify({
model,
messages: [{ role: "user", content: "Reply with exactly: OK" }],
}),
- `${String(endpointUrl).replace(/\/+$/, "")}/chat/completions`,
+ retryUrl,
]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-inference-probes.ts` around lines 272 - 289, The retry path
currently always adds an Authorization header and targets "/chat/completions",
losing the original authMode and causing query‑param auth to fail; update the
retry in the isTimeoutOrConnFailure branch (where retriedAfterTimeout,
getValidationProbeCurlArgs, runCurlProbe, apiKey, normalizeCredentialValue, and
endpointUrl are used) to preserve authMode by: when authMode === "query-param"
append the normalized apiKey as the same query parameter used by the initial
probe to the retry URL (instead of adding an Authorization header), and
otherwise keep adding the Authorization: Bearer header; also reuse the same base
endpointUrl construction (String(endpointUrl).replace(/\/+$/, "")) rather than
hardcoding a different path so the retry targets the same route as the original
probe.
There was a problem hiding this comment.
Pre-existing behavior, moved as-is. The authMode retry path works today because getProbeAuthMode now returns undefined for all providers (changed on main before this PR). Agree it could be more defensive — follow-up material.
There was a problem hiding this comment.
@jyaunches, understood — thanks for the context. Since getProbeAuthMode unconditionally returns undefined as of main, authMode is never "query-param" in practice, so the retry path is safe as-is. Happy to treat the hardening as follow-up.
Would you like me to open a follow-up GitHub issue to track making the retry path preserve authMode defensively (in case getProbeAuthMode ever returns a non-undefined value again)?
🐰 noted and filed away for later!
🧠 Learnings used
Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 1967
File: test/e2e/test-token-rotation.sh:0-0
Timestamp: 2026-04-16T18:34:40.564Z
Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.
| function probeAnthropicEndpoint(endpointUrl, model, apiKey) { | ||
| const result = runCurlProbe([ | ||
| "-sS", | ||
| ...getCurlTimingArgs(), |
There was a problem hiding this comment.
Use the validation timeout helper for Anthropic probes too.
This path falls back to getCurlTimingArgs(), which the comment above explicitly says is the looser default budget. A broken Anthropic-compatible endpoint can therefore stall onboarding much longer than OpenAI-like validation.
Suggested fix
function probeAnthropicEndpoint(endpointUrl, model, apiKey) {
const result = runCurlProbe([
"-sS",
- ...getCurlTimingArgs(),
+ ...getValidationProbeCurlArgs(),
"-H",
`x-api-key: ${normalizeCredentialValue(apiKey)}`,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function probeAnthropicEndpoint(endpointUrl, model, apiKey) { | |
| const result = runCurlProbe([ | |
| "-sS", | |
| ...getCurlTimingArgs(), | |
| function probeAnthropicEndpoint(endpointUrl, model, apiKey) { | |
| const result = runCurlProbe([ | |
| "-sS", | |
| ...getValidationProbeCurlArgs(), | |
| "-H", | |
| `x-api-key: ${normalizeCredentialValue(apiKey)}`, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-inference-probes.ts` around lines 327 - 330, The Anthropic
probe (probeAnthropicEndpoint) is using the loose default getCurlTimingArgs()
and should use the validation-timeout helper used by the OpenAI probes; update
the runCurlProbe call inside probeAnthropicEndpoint to replace
getCurlTimingArgs() with the validation timeout helper function used elsewhere
for API validation (i.e., the same helper the OpenAI probe uses) so Anthropic
endpoints honor the stricter validation budget.
There was a problem hiding this comment.
Pre-existing behavior, moved as-is. Good catch on the timeout inconsistency — but not introduced by this extraction. Follow-up material.
There was a problem hiding this comment.
@jyaunches, understood — thanks for the context. Since this is a valid inconsistency worth tracking, would you like me to open a follow-up GitHub issue to capture the fix (replacing getCurlTimingArgs() with getValidationProbeCurlArgs() in probeAnthropicEndpoint) so it doesn't get lost?
🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: src/lib/local-inference.ts:289-303
Timestamp: 2026-04-14T22:22:08.976Z
Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: test/onboard-selection.test.ts:137-141
Timestamp: 2026-04-15T14:16:36.856Z
Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.
| function ensureOllamaAuthProxy(): void { | ||
| // Try to load persisted token first — if none, this isn't an Ollama setup. | ||
| const token = loadPersistedProxyToken(); | ||
| if (!token) return; | ||
|
|
||
| const pid = loadPersistedProxyPid(); | ||
| if (isOllamaProxyProcess(pid)) { | ||
| ollamaProxyToken = token; | ||
| return; | ||
| } | ||
|
|
||
| // Proxy not running — restart it with the persisted token. | ||
| killStaleProxy(); | ||
| ollamaProxyToken = token; | ||
| spawnOllamaAuthProxy(token); | ||
| sleep(1); | ||
| } |
There was a problem hiding this comment.
Verify the restarted proxy before keeping the persisted token active.
If the restart fails here, ollamaProxyToken is still left in memory and the caller proceeds as if the proxy exists. After a reboot, that turns into opaque local-inference failures against a dead proxy endpoint.
Suggested fix
function ensureOllamaAuthProxy(): void {
const token = loadPersistedProxyToken();
if (!token) return;
@@
killStaleProxy();
ollamaProxyToken = token;
- spawnOllamaAuthProxy(token);
+ const restartedPid = spawnOllamaAuthProxy(token);
sleep(1);
+ if (!isOllamaProxyProcess(restartedPid)) {
+ ollamaProxyToken = null;
+ clearPersistedProxyPid();
+ console.error(` Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`);
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-ollama-proxy.ts` around lines 172 - 188,
ensureOllamaAuthProxy sets ollamaProxyToken before verifying the restarted
proxy; change the flow so you only assign ollamaProxyToken after confirming the
proxy came up successfully. Specifically: in ensureOllamaAuthProxy, keep using
loadPersistedProxyToken and killStaleProxy/spawnOllamaAuthProxy, but after
spawnOllamaAuthProxy(token) wait and re-check with isOllamaProxyProcess (or any
existing health check) and only set ollamaProxyToken when that check passes; if
the check fails, do not set ollamaProxyToken and surface/log an error so callers
don’t proceed assuming a live proxy. Make these changes around the
ensureOllamaAuthProxy, spawnOllamaAuthProxy, and isOllamaProxyProcess
interaction.
There was a problem hiding this comment.
Pre-existing behavior, moved as-is. The proxy restart-without-verify pattern exists on main today. Agree it should be hardened — follow-up material.
There was a problem hiding this comment.
@jyaunches, understood — thanks for the context. Since this is agreed-upon follow-up work, would you like me to open a GitHub issue to track the ensureOllamaAuthProxy restart-verification hardening? I can capture the proposed fix (verify the restarted PID via isOllamaProxyProcess, null-guard ollamaProxyToken on failure, and log an actionable error) as the issue body with a backlink to this PR.
🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: src/lib/local-inference.ts:289-303
Timestamp: 2026-04-14T22:22:08.976Z
Learning: In `src/lib/local-inference.ts`, `getOllamaWarmupCommand()` intentionally returns `["bash", "-c", "nohup curl ... &"]` — a shell-wrapped argv array — because backgrounding (`nohup ... &`) and output redirection (`>/dev/null 2>&1`) require a shell and have no pure synchronous argv equivalent. The payload is safe: the model name is double-escaped via `JSON.stringify` then `shellQuote` (single-quoted). Switching to `spawn({ detached: true })` would change the caller contract to async and is out of scope. This is the one deliberate `bash -c` exception in the argv refactor (PR `#1886`), and the code has an explanatory comment documenting it.
Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 1967
File: test/e2e/test-token-rotation.sh:0-0
Timestamp: 2026-04-16T18:34:40.564Z
Learning: In `test/e2e/test-token-rotation.sh` (NemoClaw E2E Bash tests), the token-rotation E2E test intentionally uses fake/invalid Telegram bot tokens so no real Telegram accounts are required in CI. Because fake tokens always return HTTP 401 from `api.telegram.org/bot.../getMe`, in-sandbox `getMe` HTTP status checks cannot distinguish old vs new credentials and are therefore intentionally omitted. Rotation correctness is instead validated via control-flow assertions (grepping onboard output for "credential(s) rotated" and "Rebuilding sandbox"). Live L7 proxy verification with real tokens is a tracked follow-up in issue `#1903`.
Learnt from: jieunl24
Repo: NVIDIA/NemoClaw PR: 1104
File: bin/lib/onboard.js:2199-2200
Timestamp: 2026-03-30T17:32:16.517Z
Learning: In `bin/lib/onboard.js` (NemoClaw project), the WSL2 Ollama startup path intentionally does NOT clear an inherited `OLLAMA_HOST` environment variable (e.g., via `env -u OLLAMA_HOST`). The design decision is to respect any explicitly exported `OLLAMA_HOST` as an intentional user configuration choice. The fix only avoids *setting* `OLLAMA_HOST=0.0.0.0:11434` inline on WSL2; it does not override pre-existing env values. This is considered acceptable because neither the Ollama installer nor NemoClaw sets `OLLAMA_HOST` in shell profiles.
Learnt from: ericksoa
Repo: NVIDIA/NemoClaw PR: 2066
File: src/lib/shields.ts:162-180
Timestamp: 2026-04-19T14:03:06.546Z
Learning: In `src/lib/shields-timer.ts` (TypeScript/Node.js, NemoClaw), the re-lock and lock-verification logic is intentionally inlined (duplicated from `lockAgentConfig` in `shields.ts`) because the timer runs as a detached standalone script via `fork()` with `ts-nocheck` and cannot import TypeScript modules from the main codebase at runtime. Do not flag this duplication as a refactoring opportunity.
| function upsertMessagingProviders(tokenDefs: MessagingTokenDef[]) { | ||
| return onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell); | ||
| } |
There was a problem hiding this comment.
Re-export upsertMessagingProviders from this entry point.
Lines 855-857 add the wrapper, but module.exports never exposes it. That makes require("./onboard").upsertMessagingProviders resolve to undefined, which is a breaking API change for existing callers.
Suggested fix
module.exports = {
buildProviderArgs,
buildGatewayBootstrapSecretsScript,
buildSandboxConfigSyncScript,
compactText,
@@
summarizeCurlFailure,
summarizeProbeFailure,
hasResponsesToolCall,
+ upsertMessagingProviders,
upsertProvider,
hashCredential,
detectMessagingCredentialRotation,
hydrateCredentialEnv,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard.ts` around lines 855 - 857, The wrapper function
upsertMessagingProviders currently delegates to
onboardProviders.upsertMessagingProviders but is not exported, causing
require("./onboard").upsertMessagingProviders to be undefined; add it to the
module's public exports (e.g., include upsertMessagingProviders in
module.exports or the existing export object) so callers can access it, ensuring
the exported symbol name matches upsertMessagingProviders and continues to call
onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell).
There was a problem hiding this comment.
Not a regression — upsertMessagingProviders was never in module.exports on main either. It's an internal function used only within onboard.ts. The wrapper exists so internal call sites can reach the extracted implementation via runOpenshell injection.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if upsertMessagingProviders appears in module.exports on the main branch
git show main:src/lib/onboard.ts | grep -n "upsertMessagingProviders" | head -30
Length of output: 408
@jyaunches, you're right — confirmed on main, upsertMessagingProviders has no entry in module.exports and is only used at internal call sites within onboard.ts. My comment was incorrect; this is not a regression. Apologies for the noise.
(╥﹏╥)
✏️ Learnings added
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 2495
File: src/lib/onboard.ts:855-857
Timestamp: 2026-04-27T13:07:20.779Z
Learning: In `src/lib/onboard.ts` (NVIDIA/NemoClaw), `upsertMessagingProviders` is a module-private function that is intentionally NOT exported via `module.exports`. It is used only internally within `onboard.ts` to delegate to `onboardProviders.upsertMessagingProviders(tokenDefs, runOpenshell)` with `runOpenshell` injection. Do not flag its absence from `module.exports` as a missing export or regression.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: jyaunches
Repo: NVIDIA/NemoClaw PR: 1886
File: test/onboard-selection.test.ts:137-141
Timestamp: 2026-04-15T14:16:36.856Z
Learning: In `test/onboard-selection.test.ts` (NVIDIA/NemoClaw), the `runner.runCapture` (and `runner.run`) test stubs intentionally normalize commands with `const cmd = Array.isArray(command) ? command.join(" ") : command` because `onboard.ts` still sends legacy shell strings (28 `run()` + 10 `runCapture()` calls not yet migrated to argv), while `local-inference.ts` already sends argv arrays. This dual-form normalization is a deliberate, time-bounded shim for the migration period. Once `#1889` converts `onboard.ts` to argv, these stubs should be updated to assert `Array.isArray` and use structural per-argument checks instead. Do not flag the `join(" ")` normalization as weakening test coverage during this migration window.
The .claude/refactor-onboard-plan.md was an internal planning doc for the full 11-step extraction. Since this PR only lands Steps 1-3, the full plan doc is no longer relevant and causes markdownlint failures.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/credential-exposure.test.ts`:
- Around line 68-70: The existing credential-leak assertions only scan
ONBOARD_JS for KEY=VALUE and interpolation patterns, leaving
src/lib/onboard-providers.ts unscanned; update the test to use the combined
source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS) for all
credential-leak checks (not just the ["--credential", credentialEnv] assertion)
so that functions like buildProviderArgs and any KEY=VALUE / ${...}
interpolation patterns in onboard-providers.ts are included in the same
assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 14663b86-3f99-4dc4-8f3f-27336c99f31d
📒 Files selected for processing (2)
src/lib/onboard.tstest/credential-exposure.test.ts
| // buildProviderArgs lives in onboard-providers.ts; scan both files. | ||
| const src = fs.readFileSync(ONBOARD_JS, "utf-8") + | ||
| fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8"); |
There was a problem hiding this comment.
Keep the broader credential-leak scan on onboard-providers.ts too.
This only expands the exact ["--credential", credentialEnv] assertion. The earlier KEY=VALUE / interpolation leak checks still read src/lib/onboard.ts only, so a regression inside src/lib/onboard-providers.ts would now bypass the security test suite.
Suggested change
const ONBOARD_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard.ts");
const ONBOARD_PROVIDERS_JS = path.join(import.meta.dirname, "..", "src", "lib", "onboard-providers.ts");
+const ONBOARD_SRC =
+ fs.readFileSync(ONBOARD_JS, "utf-8") +
+ "\n" +
+ fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");
@@
it("onboard.js must not pass KEY=VALUE to --credential", () => {
- const src = fs.readFileSync(ONBOARD_JS, "utf-8");
+ const src = ONBOARD_SRC;
const lines = src.split("\n");
@@
it("onboard.js --credential flags pass env var names only", () => {
- // buildProviderArgs lives in onboard-providers.ts; scan both files.
- const src = fs.readFileSync(ONBOARD_JS, "utf-8") +
- fs.readFileSync(ONBOARD_PROVIDERS_JS, "utf-8");
+ const src = ONBOARD_SRC;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/credential-exposure.test.ts` around lines 68 - 70, The existing
credential-leak assertions only scan ONBOARD_JS for KEY=VALUE and interpolation
patterns, leaving src/lib/onboard-providers.ts unscanned; update the test to use
the combined source (the const src built from ONBOARD_JS + ONBOARD_PROVIDERS_JS)
for all credential-leak checks (not just the ["--credential", credentialEnv]
assertion) so that functions like buildProviderArgs and any KEY=VALUE / ${...}
interpolation patterns in onboard-providers.ts are included in the same
assertions.
⚡ Interim PR — Unblocks #2306
This is an interim extraction PR that lands the foundational module decomposition needed by #2306 (extract rebuild/recreate path + canonicalize credential resolution). It is intentionally scoped to pure code movement with no behavior changes.
Rebased onto current main. Supersedes #2087 and #2489.
Summary
Steps 1–3 of the original extraction plan (authored by @ericksoa in #2087):
onboard-providers.ts— provider metadata, CRUD, labels, inference configonboard-ollama-proxy.ts— proxy lifecycle, model pull/validateonboard-inference-probes.ts— endpoint validation probes (OpenAI, Anthropic, Gemini)Reduces
onboard.tsby ~1,000 lines. All 3 new modules are under 400 lines. No behavior changes — all existing exports remain accessible via re-exports and thin wrappers.Step 4 (dashboard extraction) was dropped — main already landed
dashboard-contract.ts,dashboard-health.ts, anddashboard-recover.ts(#2398) which supersede that extraction.Steps 5–10 (gateway, preflight, messaging, policies, inference-setup, sandbox) remain deferred per the original plan.
Downstream: #2306
#2306(extract rebuild/recreate path + canonicalize credential resolution) builds directly on top of this PR. It importsREMOTE_PROVIDER_CONFIG,upsertProvider, andensureOllamaAuthProxyfrom the new extracted modules via onboard.ts re-exports. Merging this PR first gives #2306 a clean module surface to build on.CodeRabbit comments
All 6 comments addressed:
upsertMessagingProviderswas never exported on main)Rebase notes
startOllamaAuthProxynow returnsboolean,getProbeAuthModealways returnsundefined)require()imports to satisfy strict TypeScriptonboard.tsstill references after extractionTest plan
npx tsc -p tsconfig.src.json --noEmitcompiles cleanlynpx tsc -p tsconfig.cli.json --noEmitcompiles cleanly (pre-existing failures only, same as main)nemoclaw onboardcompletes successfullyCo-authored-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit
Refactor
Tests