Skip to content

fix(rebuild): reuse OpenShell gateway credential when host env is empty#3918

Open
ssam18 wants to merge 8 commits into
NVIDIA:mainfrom
ssam18:fix/3895-rebuild-reuse-gateway-credential
Open

fix(rebuild): reuse OpenShell gateway credential when host env is empty#3918
ssam18 wants to merge 8 commits into
NVIDIA:mainfrom
ssam18:fix/3895-rebuild-reuse-gateway-credential

Conversation

@ssam18
Copy link
Copy Markdown

@ssam18 ssam18 commented May 20, 2026

Rebuild was demanding the provider credential from the shell even when onboard had already registered it with the OpenShell gateway, which broke the channel-add auto-rebuild flow whenever the operator no longer had the env var exported. The preflight now falls back to a gateway provider lookup before bailing, and setupInference no longer overwrites the stored credential when the gateway already holds one. Existing tests were tightened so the fail path still requires both the env var and the gateway provider to be missing, and new tests cover the gateway reuse path for nvidia, anthropic, openai, and gemini. Closes #3895

Summary by CodeRabbit

  • Bug Fixes

    • Sandbox rebuild now reuses credentials from the OpenShell gateway when local environment credentials are missing, avoiding unnecessary rebuild failures
    • Provider setup avoids redundant credential updates when gateway already has a provider; mutable-endpoint providers remain exempt and still require local credentials
    • Improved user-facing messaging and failure handling when gateway lookups fail
  • Tests

    • Expanded preflight tests covering gateway provider reuse, lookup failures, mutable-endpoint behavior, and related regressions

Review Change Stack

Rebuild preflight aborted when NVIDIA_API_KEY or any other remote provider credential was missing from the shell even though onboard had already stored it in the OpenShell gateway. This made the channel-add auto-rebuild flow fail for users who closed the shell between onboard and the channel change. Fall back to providerExistsInGateway when the env var is empty so the gateway-stored credential is reused, and add a defensive skip in setupInference so a missing host env cannot overwrite the stored value with empty.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

Rebuild and onboarding now consult the OpenShell gateway for provider credentials. When an expected env credential is missing, code checks gateway registration (except for mutable-endpoint providers), reuses gateway-stored credentials when present, and avoids upserting empty credentials during setupInference().

Changes

Gateway Provider Credential Reuse

Layer / File(s) Summary
Provider lookup & mutability helpers
src/lib/onboard/providers.ts, src/lib/actions/sandbox/rebuild.ts
Adds mutable-endpoint allowlist and isMutableEndpointProvider(), introduces lookupProviderInGateway() to classify exists/missing/lookup_failed, and updates imports used by rebuild preflight.
Rebuild preflight gateway provider check
src/lib/actions/sandbox/rebuild.ts
When a hydrated credential is missing, conditionally consults lookupProviderInGateway() (skipping mutable providers); clears the env requirement to reuse gateway credential if exists, aborts with lookup/RPC error on lookup_failed, or preserves prior abort messaging on missing.
Inference provider upsert wrapper
src/lib/onboard/inference-provider-upsert.ts
New reuseGatewayOrUpsertInferenceProvider() short-circuits when no credentialValue and provider exists in gateway (and is not mutable), otherwise delegates to handlers.upsertProvider and maps failures to { kind: "retry" }, { kind: "selection" }, or exit codes.
setupInference() conditional provider upsert
src/lib/onboard.ts
Replaces direct upsertProvider() calls with reuseGatewayOrUpsertInferenceProvider() and handles exit/retry/selection outcomes accordingly; passes empty env when no hydrated credential.
Rebuild preflight tests and fixtures
test/rebuild-credential-preflight.test.ts
Fixture gains providerLookupErrors; adds tests for gateway reuse, lookup failures, mutable-endpoint exclusion, parameterized provider reuse, and scopes nonzero-exit assertions to gateway-missing cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

fix, OpenShell, Sandbox, NemoClaw CLI

Suggested reviewers

  • ericksoa
  • jyaunches
  • cv

Poem

🐰 A gateway key once hid away,
No env var to show the way.
We peek the gateway, careful and kind,
Reuse the secret that we find.
Rebuild hops forward, sandbox sings—hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(rebuild): reuse OpenShell gateway credential when host env is empty' directly and clearly describes the main change: enabling rebuild to reuse gateway-stored credentials when host environment variables are absent.
Linked Issues check ✅ Passed The code changes comprehensively address issue #3895: rebuild preflight now falls back to gateway credential lookup when host env is empty, setupInference avoids overwriting with empty credentials, and tests verify both gateway reuse and failure scenarios.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the linked issue: preflight lookup logic, credential reuse in setupInference, helper functions for gateway lookups and provider checks, and comprehensive test coverage for the new behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/lib/actions/sandbox/rebuild.ts (1)

371-402: Please run the rebuild channel lifecycle E2E on this path.

This change is in the rebuild destroy/recreate control flow, so I’d still run gh workflow run nightly-e2e.yaml --ref <branch> -f jobs=channels-stop-start-e2e once before merge to catch any resume-time regression that only shows up with cached credentials and channel state. As per coding guidelines, src/lib/actions/sandbox/rebuild.ts changes should exercise channels-stop-start-e2e because this file controls disabled channel resolution used during onboard and rebuild.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/rebuild.ts` around lines 371 - 402, Run the rebuild
channel lifecycle E2E (channels-stop-start-e2e) against this branch before
merging: execute `gh workflow run nightly-e2e.yaml --ref <branch> -f
jobs=channels-stop-start-e2e` and verify the rebuild/destroy→recreate flow in
src/lib/actions/sandbox/rebuild.ts (especially the logic around
rebuildCredentialEnv and providerExistsInGateway/gateway provider path) to catch
resume-time regressions with cached credentials and channel state; only merge
after the job passes and you confirm the rebuild behavior is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 371-397: The preflight currently treats any falsy return from
providerExistsInGateway(...) as "credential not found", which masks gateway/CLI
errors; change providerExistsInGateway (or add a new wrapper) to surface errors
(tri-state: registered / not-registered / lookup-failed or throw on
lookup-failure), then update this block (using rebuildProvider, runOpenshell,
gatewayHasProvider, rebuildCredentialEnv, bail) to: if lookup-failed -> log a
distinct connectivity/lookup-error message telling users to retry/check
OpenShell and bail with a different error code/message; if not-registered ->
keep the existing missing-credential flow; if registered -> keep current
skip-and-nullify behavior. Add/adjust unit/e2e tests (channels-stop-start-e2e)
to cover the lookup-failure path so the rebuild preflight distinguishes
transient gateway failures from true missing credentials.

In `@src/lib/onboard.ts`:
- Around line 7506-7515: The current skipUpsertReusingGatewayCredential guard
only checks credentialValue and providerExistsInGateway(provider) and therefore
incorrectly skips upsert for mutable provider types that carry a user-selected
resolvedEndpointUrl; update the guard so that you only skip the upsert when the
provider exists in the gateway AND the provider type is immutable (i.e., not
'compatible-endpoint' or 'compatible-anthropic-endpoint') and there is no
resolvedEndpointUrl change. Concretely, change the logic around
skipUpsertReusingGatewayCredential (the variable computed before calling
upsertProvider) to also inspect config.providerType and resolvedEndpointUrl (and
credentialValue) so that for mutable types (compatible-endpoint /
compatible-anthropic-endpoint) you do not skip calling upsertProvider(provider,
config.providerType, resolvedCredentialEnv, resolvedEndpointUrl, env).

---

Nitpick comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 371-402: Run the rebuild channel lifecycle E2E
(channels-stop-start-e2e) against this branch before merging: execute `gh
workflow run nightly-e2e.yaml --ref <branch> -f jobs=channels-stop-start-e2e`
and verify the rebuild/destroy→recreate flow in
src/lib/actions/sandbox/rebuild.ts (especially the logic around
rebuildCredentialEnv and providerExistsInGateway/gateway provider path) to catch
resume-time regressions with cached credentials and channel state; only merge
after the job passes and you confirm the rebuild behavior is correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 48f597b6-194e-4f17-a218-13c120e91243

📥 Commits

Reviewing files that changed from the base of the PR and between a438743 and 8cda768.

📒 Files selected for processing (3)
  • src/lib/actions/sandbox/rebuild.ts
  • src/lib/onboard.ts
  • test/rebuild-credential-preflight.test.ts

Comment thread src/lib/actions/sandbox/rebuild.ts
Comment thread src/lib/onboard.ts Outdated
ssam18 added 3 commits May 20, 2026 13:49
Keep src/lib/onboard.ts net-neutral by moving the inference provider upsert with its gateway reuse short circuit into src/lib/onboard/inference-provider-upsert.ts. Behavior is unchanged, this only addresses the onboard entrypoint line budget for issue 3895 work.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
…points

Address review feedback on PR 3918. providerExistsInGateway collapsed every non-zero openshell exit to false, so a gateway outage looked identical to a missing provider and the user got a misleading missing credential message. Added a tristate lookupProviderInGateway and a dedicated lookup_failed error path that points at openshell status. Also stopped reusing the gateway credential for compatible-endpoint and compatible-anthropic-endpoint, since those carry an operator-supplied base URL that a rebuild must re-upsert. Added regression tests for both paths.

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/actions/sandbox/rebuild.ts (1)

303-429: Run the targeted rebuild/channel lifecycle E2E for regression confidence.

Given this file’s rebuild preflight changes, run channels-stop-start-e2e to confirm stop/start state remains stable across destroy/recreate with cached credentials.

As per coding guidelines: "src/lib/actions/sandbox/rebuild.ts ... E2E test recommendation: channels-stop-start-e2e."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/sandbox/rebuild.ts` around lines 303 - 429, Run the E2E
regression "channels-stop-start-e2e" to validate the rebuilt preflight logic
around credential handling; specifically exercise rebuild paths that hit
getRebuildCredentialEnvFromRegistry, the session-matching branch
(sessionMatchesTarget/rebuildProvider), the
hermesProviderAuth.HERMES_PROVIDER_NAME branch which calls
preflightHermesProviderCredentials, the hydrateCredentialEnv code path, and the
gateway lookup path via lookupProviderInGateway/runOpenshell to confirm
stop/start stability across destroy/recreate with cached credentials.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/actions/sandbox/rebuild.ts`:
- Around line 303-429: Run the E2E regression "channels-stop-start-e2e" to
validate the rebuilt preflight logic around credential handling; specifically
exercise rebuild paths that hit getRebuildCredentialEnvFromRegistry, the
session-matching branch (sessionMatchesTarget/rebuildProvider), the
hermesProviderAuth.HERMES_PROVIDER_NAME branch which calls
preflightHermesProviderCredentials, the hydrateCredentialEnv code path, and the
gateway lookup path via lookupProviderInGateway/runOpenshell to confirm
stop/start stability across destroy/recreate with cached credentials.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8a5936a-cff9-4c53-bd78-b4e6356a1d06

📥 Commits

Reviewing files that changed from the base of the PR and between 8cda768 and 57c111e.

📒 Files selected for processing (5)
  • src/lib/actions/sandbox/rebuild.ts
  • src/lib/onboard.ts
  • src/lib/onboard/inference-provider-upsert.ts
  • src/lib/onboard/providers.ts
  • test/rebuild-credential-preflight.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[macOS][Sandbox] nemohermes rebuild preflight fails with "provider credential not found" despite credential registered in gateway

1 participant