Skip to content

bug(cli): Gateway recreate invalidates CLI TLS trust — sandbox create fails with BadSignature, no auto-recovery #856

@jyaunches

Description

@jyaunches

Summary

After a gateway is destroyed and recreated (openshell gateway destroy -g nemoclaw && openshell gateway start --name nemoclaw), the openshell CLI retains a cached TLS certificate from the previous gateway instance. Subsequent openshell sandbox create calls fail with invalid peer certificate: BadSignature because the new gateway generated fresh TLS certs that don't match the cached trust material.

The failure occurs after the image has been successfully built and uploaded to the gateway (~5 minutes of work wasted), making the impact worse than a fast-fail.

Environment

  • Hardware: NVIDIA DGX Spark (Founders Edition), GB10, 128 GB unified memory
  • Architecture: aarch64
  • OS: Ubuntu 24.04.4 LTS (Noble Numbat)
  • Kernel: 6.8.0-57-generic
  • OpenShell version: 0.0.26
  • Docker: 29.2.1 with NVIDIA Container Toolkit 1.19.0

Reproduction Steps

  1. Create a gateway and sandbox successfully:

    openshell gateway start --name nemoclaw
    openshell sandbox create my-sandbox --image <image>
  2. Destroy the gateway (e.g., to fix containerd issues or reconfigure):

    openshell gateway destroy -g nemoclaw
  3. Recreate the gateway:

    openshell gateway start --name nemoclaw
  4. Attempt to create a new sandbox:

    openshell sandbox create percy --image <image>
  5. Result: Fails with:

    Error: × status: Unavailable, message: "invalid peer certificate: BadSignature",
    │ details: [], metadata: MetadataMap { headers: {} }
    ├─▶ transport error
    ├─▶ invalid peer certificate: BadSignature
    ╰─▶ invalid peer certificate: BadSignature
    

Expected Behavior

One of:

  • openshell gateway start should automatically update the CLI's trusted certificate after generating new certs
  • openshell sandbox create should detect the TLS mismatch and auto-retry after refreshing trust
  • At minimum, the error message should suggest running openshell gateway trust -g nemoclaw as a recovery step

Workaround

Running openshell gateway trust -g nemoclaw before sandbox create should re-establish trust, but in our testing this triggered a gateway restart which created a chicken-and-egg problem during NemoClaw onboard.

Context

This was encountered during NemoClaw onboard on a DGX Spark. The gateway was recreated because of intermittent containerd layer export failures (see related issue). Each gateway recreate cycle generated new TLS certs, and NemoClaw's onboard flow does not detect or recover from this cert mismatch — it spends ~5 minutes building and uploading the sandbox image before failing at the sandbox create step.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions