Skip to content

fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3537

Merged
cv merged 1 commit into
mainfrom
fix/2604-cloudflared-status-diagnostics-signed
May 14, 2026
Merged

fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3537
cv merged 1 commit into
mainfrom
fix/2604-cloudflared-status-diagnostics-signed

Conversation

@prekshivyas
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas commented May 14, 2026

Re-signed replacement for #3534. Same code, single squashed commit, this time SSH-signed (the original branch had commit.gpgsign=false set as a local override on my clone — the prior PR's commits ended up unverified). Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3534 will be closed in favour of this one.

Summary

Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14) re-reported the same symptom on v0.0.38 / v0.0.41 — nemoclaw status prints ● cloudflared (stopped) in all three failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation. The bare command also omits the Connected: / Inference: labeled fields the original bug requested. Both symptoms are addressed here.

Root cause

  • Cloudflared diagnostic. src/lib/actions/sandbox/doctor.ts already distinguished three states and emitted matching hints. showStatus() in src/lib/tunnel/services.ts was written before that and only had isRunning() ? ok : "(stopped)" — every failure mode collapsed into the same un-actionable line.
  • Bare-status fields. PR fix(status): require verified gateway before healthy inference #2884 added Inference: / Connected: labels to the per-sandbox nemoclaw <name> status but left bare nemoclaw status showing only the model in parens.

What this PR does

  1. Share the cloudflared state machine. New readCloudflaredState(pidDir) in src/lib/tunnel/services.ts returns a discriminated union { kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }. showStatus() switches on it and emits a coloured marker + one-line remediation. cloudflaredDoctorCheck() consumes the same function and translates each state into its DoctorCheck, removing duplicated PID-file / /proc/<pid>/cmdline logic.

  2. Remediation wording — no cloudflared process; restart with .... All three failure modes lead with the cause and point at nemoclaw tunnel start. Both reporters asked for that exact shape. nemoclaw tunnel start already handles all three states because isRunning() returns false and startService proceeds and overwrites any stale PID file — so one command recovers in every case. The same wording is used in doctor.ts for consistency.

  3. Bare-status Inference: and Connected: lines. showStatusCommand in src/lib/inventory/index.ts now prints labeled Inference: <provider> / <model> and Connected: yes (N session) / no under each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing (model) rendering, [NemoClaw][Linux][CLI&UX] nemoclaw list / status shows stale model after openshell inference set — live gateway state is queried but result is discarded #2369). getActiveSessionCount is wired through buildStatusCommandDeps, mirroring the cached SSH-process probe already used by buildListCommandDeps.

  4. Tests (15 new). Three cases each for showStatus failure-mode rendering, five cases for readCloudflaredState, seven for bare-status Inference: / Connected: rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function.

Behavioural diff

Before (v0.0.41 baseline):

● cloudflared  (stopped)        # identical output for all three failure modes

After:

● cloudflared  (stopped)
    no cloudflared process; run `nemoclaw tunnel start` to start it

● cloudflared  (stale PID file)
    no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it

● cloudflared  (stale PID 999999999)
    no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it

Bare status sandbox row:

# Before
test-sandbox * (qwen2.5:7b) :18789

# After
test-sandbox * (qwen2.5:7b) :18789
  Inference: ollama-local / qwen2.5:7b
  Connected: no

Brev reproduction — 3× per case, baseline and fix

(Run on the prior PR #3534 branch, same code as here.) Fresh n2d-standard-2 (Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked registry, three PID-dir states 3× each. Re-installed from this branch.

Baseline v0.0.41 — 9/9 runs reproduce the bug
### Case: stopped — no PID file
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-file — garbage PID contents
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)

### Case: stale-pid-process — PID is dead
--- Run 1 ---
  ● cloudflared  (stopped)
--- Run 2 ---
  ● cloudflared  (stopped)
--- Run 3 ---
  ● cloudflared  (stopped)
Fix branch — 9/9 runs emit a state-specific remediation

Same three-case 3× harness, with the new wording. Identical output across reruns within each case — no flake.

Out of scope (filed/to-file as follow-ups)

Test plan

  • npm run build:cli clean
  • npx tsc -p tsconfig.cli.json clean
  • vitest run src/lib/tunnel/services.test.ts — 23 pass
  • vitest run src/lib/inventory/index.test.ts — 31 pass
  • vitest run test/cli.test.ts -t doctor — 4/4 pass (covers refactored cloudflaredDoctorCheck)
  • Brev n2d-standard-2 repro: 9/9 baseline reproduces; 9/9 fix shows the new diagnostic + remediation lines, no flake
  • Commit SSH-signed and verified by GitHub (reason: valid)

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

Release Notes

  • New Features

    • Status command now displays inference provider and model per sandbox.
    • Status command includes active SSH session count when available.
  • Bug Fixes

    • Enhanced cloudflared health checks with refined state detection (running, stopped, stale PID).
    • Improved remediation hints for cloudflared diagnostic messages.
  • Tests

    • Added tests for expanded status output with provider, model, and session information.
    • Added tests for cloudflared state detection across various scenarios.

Review Change Stack

…/Inference fields

`nemoclaw status` previously printed `● cloudflared (stopped)` in three
distinct failure modes (no PID file, garbage PID, dead/wrong-process
PID) with no cause and no remediation — exactly the symptom #2604
reported and re-reported by @wangericnv on 2026-05-11 and @cv on
2026-05-14. The doctor already distinguished the three modes; the
status renderer just never picked up the same logic.

Extract the shared check into a new `readCloudflaredState(pidDir)` in
src/lib/tunnel/services.ts that returns a discriminated union, and
have both `showStatus()` and the doctor's `cloudflaredDoctorCheck`
consume it. Each mode now emits a coloured marker plus a one-line
remediation in the shape both reporters asked for ("no cloudflared
process; run `nemoclaw tunnel start` ..."):

  ● cloudflared  (stopped)
      no cloudflared process; run `nemoclaw tunnel start` to start it

  ● cloudflared  (stale PID file)
      no cloudflared process (stored PID is invalid); run
      `nemoclaw tunnel start` to restart it

  ● cloudflared  (stale PID 999999999)
      no cloudflared process (PID 999999999 is dead or not cloudflared);
      run `nemoclaw tunnel start` to restart it

`nemoclaw tunnel start` already handles all three states — `isRunning`
returns false, `startService` proceeds and overwrites any stale PID
file — so one command recovers in every case. Doctor's hints mirror
the same wording so the two diagnostic paths stay consistent.

Bare `nemoclaw status` also surfaces the configured Inference
(provider / model) and Connected (active-session count) as labeled
fields under each sandbox row, matching what was previously only
available via the per-sandbox `nemoclaw <name> status`.
`getActiveSessionCount` is wired through `buildStatusCommandDeps`,
mirroring the cached SSH-process probe already used by
`buildListCommandDeps`.

Fixes #2604

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

Refactors cloudflared health detection to use a discriminated-union CloudflaredState type distinguishing running, stopped, stale PID file, and stale PID process conditions. Integrates SSH session counting into the status command output to display active connection counts and inference provider/model, and updates doctor and tunnel status checks to report state-specific remediation guidance.

Changes

Status and Health Reporting

Layer / File(s) Summary
cloudflared state detection type and function
src/lib/tunnel/services.ts
CloudflaredState discriminated union type and readCloudflaredState(pidDir) function classify cloudflared health via platform-specific process-command inspection into running, stopped, stale-pid-file, and stale-pid-process branches.
cloudflared state detection tests
src/lib/tunnel/services.test.ts
Test suite for readCloudflaredState() covers all outcome kinds and validates showStatus() emits nemoclaw tunnel start remediation for each failure mode.
showStatus() refactored to use state-based detection
src/lib/tunnel/services.ts
showStatus() switches on readCloudflaredState().kind to emit state-specific status text and remediation hints; gates URL/log inspection on state.kind === "running".
SSH session counting infrastructure
src/lib/status-command-deps.ts
buildStatusCommandDeps() adds getActiveSessionCount function that resolves OpenShell, caches SSH process output once per invocation, and parses output to return active session count or null on unavailability.
Status command output expansion for Inference and Connected
src/lib/inventory/index.ts, src/lib/inventory/index.test.ts
ShowStatusCommandDeps gains optional getActiveSessionCount method; showStatusCommand now displays Inference: provider / model lines (preferring live gateway values) and conditional Connected: lines with pluralized session counts.
Doctor command refactoring to use state detection
src/lib/actions/sandbox/doctor.ts
cloudflaredDoctorCheck() calls readCloudflaredState() and switches on state.kind to return appropriate checks; updates hint messages to instruct restart via tunnel start.

Sequence Diagram(s)

sequenceDiagram
    participant Status as nemoclaw status
    participant SessionCount as getActiveSessionCount
    participant SSH as SSH processes
    participant Gateway as Live gateway
    Status->>Gateway: get live provider/model
    Status->>Status: render Inference line
    Status->>SessionCount: fetch session count per sandbox
    SessionCount->>SSH: read cached SSH process output
    alt output available
        SessionCount->>SessionCount: parse with parseSshProcesses
        SessionCount-->>Status: return count or null
    else unavailable
        SessionCount-->>Status: return null
    end
    alt count is not null
        Status->>Status: log Connected: yes/no with pluralization
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3402: Both PRs extend the nemoclaw status command pipeline by modifying ShowStatusCommandDeps and integrating new optional dependencies (getActiveSessionCount vs getGatewayHealth), though targeting different observability outcomes.

Suggested labels

NemoClaw CLI, fix, observability, v0.0.40

Suggested reviewers

  • ericksoa
  • cv
  • jyaunches

Poem

🐰 The cloudflared state unfolds with care—
Running, stopped, or stale PID pair,
Sessions counted, providers shown,
Now status speaks the truth outgrown!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: fixing cloudflared-stopped diagnostics and adding Connected/Inference fields to status output, which aligns with the primary objectives of the changeset.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #2604 by implementing readCloudflaredState for diagnostics, updating showStatus with state-specific remediation text, and adding Connected/Inference fields to bare nemoclaw status output.
Out of Scope Changes check ✅ Passed All code changes directly support the stated objectives: cloudflared diagnostics improvements, remediation messaging, and status output enhancements; no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2604-cloudflared-status-diagnostics-signed

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: deployment-services-e2e, diagnostics-e2e
Optional E2E: sandbox-operations-e2e

Dispatch hint: deployment-services-e2e,diagnostics-e2e

Auto-dispatched E2E: deployment-services-e2e, diagnostics-e2e via nightly-e2e.yaml at 43a8b2c21b495e547f458a50b498afbd58adacb6

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • deployment-services-e2e (high; installs/onboards a sandbox and may install/use cloudflared, timeout 60 minutes): Exercises nemoclaw tunnel start, bare nemoclaw status polling for the cloudflared public URL, and nemoclaw tunnel stop. This is the closest existing E2E coverage for the changed tunnel service status logic and cloudflared lifecycle user flow.
  • diagnostics-e2e (medium-high; installs/onboards a sandbox and runs diagnostic archive/status checks, timeout 45 minutes): Covers diagnostic/status user flows after onboarding, including status output containing model/provider information. The PR changes status/diagnostic presentation and doctor-local-service logic, so this should be merge-blocking confidence for user-facing diagnostics.

Optional E2E

  • sandbox-operations-e2e (high; multi-sandbox lifecycle and recovery flow, timeout 60 minutes): Useful adjacent coverage for sandbox list/connect/status/logs and status-triggered recovery. It may expose regressions from the new OpenShell SSH-session process probing and status dependency wiring, although it does not specifically assert bare nemoclaw status Connected lines.

New E2E recommendations

  • bare-status-connected-inference-output (high): Existing E2E coverage appears to validate per-sandbox status and tunnel URL presence, but not the new bare nemoclaw status Inference: and Connected: lines backed by real SSH process discovery.
    • Suggested test: Extend test/e2e/test-sandbox-operations.sh or add a focused status E2E step that opens an SSH/connect session, runs bare nemoclaw status, and asserts provider/model plus Connected: yes/Connected: no behavior.
  • doctor-cloudflared-stale-pid-remediation (medium): No existing E2E was found that runs nemoclaw <sandbox> doctor or validates stale cloudflared PID-file remediation. The PR changes doctor output and state classification for stopped/stale/running cloudflared.
    • Suggested test: Add deployment-services/diagnostics coverage that creates missing, invalid, and dead cloudflared PID-file states, then asserts nemoclaw tunnel status and nemoclaw <sandbox> doctor --json report the expected stopped/stale state and nemoclaw tunnel start remediation.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: deployment-services-e2e,diagnostics-e2e

@prekshivyas prekshivyas requested a review from cv May 14, 2026 18:49
@prekshivyas prekshivyas self-assigned this May 14, 2026
@prekshivyas prekshivyas added the v0.0.42 Release target label May 14, 2026
@prekshivyas prekshivyas requested a review from jyaunches May 14, 2026 18:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inventory/index.ts`:
- Around line 425-427: The Inference log currently builds parts via [provider,
model].filter(Boolean).join(" / "), which collapses to a single ambiguous value
when one side is missing; change the construction so it always emits two parts
separated by " / " (e.g., `${provider ?? ""} / ${model ?? ""}` or equivalent)
and pass that string to log(`      Inference: ${parts}`) so the output
consistently preserves the "provider / model" structure even when one side is
empty.

In `@src/lib/tunnel/services.ts`:
- Around line 127-132: The function commandLineNamesCloudflared currently checks
every token and can misidentify wrapper commands; change it to only inspect the
executable token (argv[0]) by splitting the commandLine on \0 or whitespace,
taking the first non-empty token, normalizing with basename(token) and comparing
to "cloudflared" (also trim surrounding quotes if present) so wrapper
invocations like "sh -c cloudflared ..." no longer match.
- Around line 144-146: The PID parsing currently uses Number(raw) which accepts
non-integer notations (e.g. "1.5", "1e3"); change the logic that derives pid
from raw to first validate raw is a strict positive integer string (e.g.
/^\d+$/) and only then parse it (parseInt or Number) and check pid > 0; if the
raw string fails the integer regex return { kind: "stale-pid-file" } (instead of
letting Number produce a value that later becomes "stale-pid-process"), updating
the code around the pid/ raw checks and replacing the Number.isFinite check
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 52411f6a-099d-43c6-bc05-1ac0159716e7

📥 Commits

Reviewing files that changed from the base of the PR and between e4a2f93 and 43a8b2c.

📒 Files selected for processing (6)
  • src/lib/actions/sandbox/doctor.ts
  • src/lib/inventory/index.test.ts
  • src/lib/inventory/index.ts
  • src/lib/status-command-deps.ts
  • src/lib/tunnel/services.test.ts
  • src/lib/tunnel/services.ts

Comment on lines +425 to +427
if (provider || model) {
const parts = [provider, model].filter(Boolean).join(" / ");
log(` Inference: ${parts}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve provider / model structure in Inference: output.

When only one field exists, the current join emits an ambiguous single value. Keep the fixed two-part format so the output remains consistent and parseable.

Proposed fix
-      if (provider || model) {
-        const parts = [provider, model].filter(Boolean).join(" / ");
-        log(`      Inference: ${parts}`);
-      }
+      if (provider || model) {
+        log(`      Inference: ${provider || "unknown"} / ${model || "unknown"}`);
+      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (provider || model) {
const parts = [provider, model].filter(Boolean).join(" / ");
log(` Inference: ${parts}`);
if (provider || model) {
log(` Inference: ${provider || "unknown"} / ${model || "unknown"}`);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inventory/index.ts` around lines 425 - 427, The Inference log
currently builds parts via [provider, model].filter(Boolean).join(" / "), which
collapses to a single ambiguous value when one side is missing; change the
construction so it always emits two parts separated by " / " (e.g., `${provider
?? ""} / ${model ?? ""}` or equivalent) and pass that string to log(`     
Inference: ${parts}`) so the output consistently preserves the "provider /
model" structure even when one side is empty.

Comment on lines +127 to +132
function commandLineNamesCloudflared(commandLine: string): boolean {
return commandLine
.split(/\0|\s+/)
.filter(Boolean)
.some((token) => basename(token) === "cloudflared");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

commandLineNamesCloudflared can misclassify wrapper processes.

Matching any token means a non-cloudflared process (for example sh -c cloudflared ...) can be treated as cloudflared. Check only the executable token (argv[0] / comm) instead.

Proposed fix
 function commandLineNamesCloudflared(commandLine: string): boolean {
-  return commandLine
-    .split(/\0|\s+/)
-    .filter(Boolean)
-    .some((token) => basename(token) === "cloudflared");
+  const [argv0] = commandLine.split(/\0|\s+/).filter(Boolean);
+  return argv0 !== undefined && basename(argv0) === "cloudflared";
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function commandLineNamesCloudflared(commandLine: string): boolean {
return commandLine
.split(/\0|\s+/)
.filter(Boolean)
.some((token) => basename(token) === "cloudflared");
}
function commandLineNamesCloudflared(commandLine: string): boolean {
const [argv0] = commandLine.split(/\0|\s+/).filter(Boolean);
return argv0 !== undefined && basename(argv0) === "cloudflared";
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/tunnel/services.ts` around lines 127 - 132, The function
commandLineNamesCloudflared currently checks every token and can misidentify
wrapper commands; change it to only inspect the executable token (argv[0]) by
splitting the commandLine on \0 or whitespace, taking the first non-empty token,
normalizing with basename(token) and comparing to "cloudflared" (also trim
surrounding quotes if present) so wrapper invocations like "sh -c cloudflared
..." no longer match.

Comment on lines +144 to +146
const pid = Number(raw);
if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" };
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat non-integer PID text as stale-pid-file.

Number(raw) currently accepts values like 1.5 and 1e3, which are invalid PID-file formats but get classified as stale-pid-process. Tighten parsing to a strict positive integer string first.

Proposed fix
-  const pid = Number(raw);
-  if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" };
+  if (!/^[1-9]\d*$/.test(raw)) return { kind: "stale-pid-file" };
+  const pid = Number(raw);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const pid = Number(raw);
if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" };
try {
if (!/^[1-9]\d*$/.test(raw)) return { kind: "stale-pid-file" };
const pid = Number(raw);
try {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/tunnel/services.ts` around lines 144 - 146, The PID parsing currently
uses Number(raw) which accepts non-integer notations (e.g. "1.5", "1e3"); change
the logic that derives pid from raw to first validate raw is a strict positive
integer string (e.g. /^\d+$/) and only then parse it (parseInt or Number) and
check pid > 0; if the raw string fails the integer regex return { kind:
"stale-pid-file" } (instead of letting Number produce a value that later becomes
"stale-pid-process"), updating the code around the pid/ raw checks and replacing
the Number.isFinite check accordingly.

@prekshivyas
Copy link
Copy Markdown
Contributor Author

Cross-link / not-duplicate-but-synergistic note: #3494 reports the same cloudflared (stopped) line surfacing in nightly E2E when cloudflared exits mid-flight (different trigger than wangericnv's manual-kill repro, same un-actionable status line). #3517 closes that issue at the test-infrastructure layer by parsing /tmp/nemoclaw-services-<sandbox>/cloudflared.log to classify [NemoClaw fault] vs [Cloudflare fault].

The two PRs touch disjoint files (this one: src/lib/tunnel/services.ts + src/lib/inventory/index.ts; #3517: test/e2e/test-tunnel-lifecycle.sh + nightly-e2e.yaml) and either can land first. But once both are merged, the failure shape in #3494 would render as

● cloudflared  (stale PID <N>)
    no cloudflared process (PID <N> is dead or not cloudflared); run `nemoclaw tunnel start` to restart it

instead of the bare cloudflared (stopped) line #3494's evidence cites — so the E2E test gets a free fault-attribution datapoint straight from nemoclaw status, and #3517's log-parsing classifier can simplify if a reviewer wants to consolidate later.

cc @hunglp6d for the #3517 angle.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25878912695
Target ref: 43a8b2c21b495e547f458a50b498afbd58adacb6
Workflow ref: main
Requested jobs: deployment-services-e2e,diagnostics-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job Result
deployment-services-e2e ✅ success
diagnostics-e2e ✅ success

@cv cv merged commit 08b2ebb into main May 14, 2026
35 of 36 checks passed
@miyoungc miyoungc mentioned this pull request May 14, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 14, 2026
## Summary
Refreshes the NemoClaw documentation for the local `main` changes
included in the 0.0.42 release. The update adds release notes, updates
the affected user-facing setup and troubleshooting pages, bumps docs
metadata to 0.0.42, and regenerates the matching user skills.

## Changes
- #3537 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`: Documented host-level status
fields, cloudflared state-specific recovery hints, and Local Ollama auth
proxy status diagnostics.
- #3454 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`: Documented macOS Docker-driver
onboarding and removed the expectation that standard macOS setup needs a
VM driver helper.
- #3514 -> `docs/inference/use-local-inference.md`: Documented
compatible-endpoint retry behavior for reasoning-only smoke responses.
- #3448 -> `docs/reference/commands.md`,
`docs/manage-sandboxes/messaging-channels.md`: Documented canonical
channel names and policy preset hints after `channels add`.
- #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery
and uninstall wording in the 0.0.42 release notes.
- #3313 -> `docs/get-started/quickstart.md`,
`docs/reference/troubleshooting.md`: Documented stronger dashboard port
detection and rollback when a forward cannot start.
- #3502 -> `docs/about/release-notes.md`: Captured batched onboarding
policy preset application in the 0.0.42 release notes.
- #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level
Colima socket path.
- #3421 -> `docs/about/release-notes.md`: Captured idempotent installer
shim logging in the 0.0.42 release notes.
- Updated `docs/project.json`, `docs/versions1.json`, and regenerated
`.agents/skills/nemoclaw-user-*` outputs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes - v0.0.42

* **Documentation**
  * Enhanced macOS onboarding guidance for Docker gateway setup
  * Improved dashboard port conflict handling with automatic rollback
* Better local Ollama inference diagnostics and authentication proxy
checks
  * Clarified status command output and recovery procedures
  * Refined messaging channel setup documentation

* **Chores**
  * Version bump to 0.0.42

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.42 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[NemoClaw][DGX Spark][Ubuntu 24.04][CLI] nemoclaw status omits Connected/Inference fields and shows cloudflared stopped with no context

3 participants