fix(status): explain cloudflared-stopped reason and surface Connected/Inference fields#3537
Conversation
…/Inference fields `nemoclaw status` previously printed `● cloudflared (stopped)` in three distinct failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation — exactly the symptom #2604 reported and re-reported by @wangericnv on 2026-05-11 and @cv on 2026-05-14. The doctor already distinguished the three modes; the status renderer just never picked up the same logic. Extract the shared check into a new `readCloudflaredState(pidDir)` in src/lib/tunnel/services.ts that returns a discriminated union, and have both `showStatus()` and the doctor's `cloudflaredDoctorCheck` consume it. Each mode now emits a coloured marker plus a one-line remediation in the shape both reporters asked for ("no cloudflared process; run `nemoclaw tunnel start` ..."): ● cloudflared (stopped) no cloudflared process; run `nemoclaw tunnel start` to start it ● cloudflared (stale PID file) no cloudflared process (stored PID is invalid); run `nemoclaw tunnel start` to restart it ● cloudflared (stale PID 999999999) no cloudflared process (PID 999999999 is dead or not cloudflared); run `nemoclaw tunnel start` to restart it `nemoclaw tunnel start` already handles all three states — `isRunning` returns false, `startService` proceeds and overwrites any stale PID file — so one command recovers in every case. Doctor's hints mirror the same wording so the two diagnostic paths stay consistent. Bare `nemoclaw status` also surfaces the configured Inference (provider / model) and Connected (active-session count) as labeled fields under each sandbox row, matching what was previously only available via the per-sandbox `nemoclaw <name> status`. `getActiveSessionCount` is wired through `buildStatusCommandDeps`, mirroring the cached SSH-process probe already used by `buildListCommandDeps`. Fixes #2604 Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
📝 WalkthroughWalkthroughRefactors cloudflared health detection to use a discriminated-union ChangesStatus and Health Reporting
Sequence Diagram(s)sequenceDiagram
participant Status as nemoclaw status
participant SessionCount as getActiveSessionCount
participant SSH as SSH processes
participant Gateway as Live gateway
Status->>Gateway: get live provider/model
Status->>Status: render Inference line
Status->>SessionCount: fetch session count per sandbox
SessionCount->>SSH: read cached SSH process output
alt output available
SessionCount->>SessionCount: parse with parseSshProcesses
SessionCount-->>Status: return count or null
else unavailable
SessionCount-->>Status: return null
end
alt count is not null
Status->>Status: log Connected: yes/no with pluralization
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inventory/index.ts`:
- Around line 425-427: The Inference log currently builds parts via [provider,
model].filter(Boolean).join(" / "), which collapses to a single ambiguous value
when one side is missing; change the construction so it always emits two parts
separated by " / " (e.g., `${provider ?? ""} / ${model ?? ""}` or equivalent)
and pass that string to log(` Inference: ${parts}`) so the output
consistently preserves the "provider / model" structure even when one side is
empty.
In `@src/lib/tunnel/services.ts`:
- Around line 127-132: The function commandLineNamesCloudflared currently checks
every token and can misidentify wrapper commands; change it to only inspect the
executable token (argv[0]) by splitting the commandLine on \0 or whitespace,
taking the first non-empty token, normalizing with basename(token) and comparing
to "cloudflared" (also trim surrounding quotes if present) so wrapper
invocations like "sh -c cloudflared ..." no longer match.
- Around line 144-146: The PID parsing currently uses Number(raw) which accepts
non-integer notations (e.g. "1.5", "1e3"); change the logic that derives pid
from raw to first validate raw is a strict positive integer string (e.g.
/^\d+$/) and only then parse it (parseInt or Number) and check pid > 0; if the
raw string fails the integer regex return { kind: "stale-pid-file" } (instead of
letting Number produce a value that later becomes "stale-pid-process"), updating
the code around the pid/ raw checks and replacing the Number.isFinite check
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 52411f6a-099d-43c6-bc05-1ac0159716e7
📒 Files selected for processing (6)
src/lib/actions/sandbox/doctor.tssrc/lib/inventory/index.test.tssrc/lib/inventory/index.tssrc/lib/status-command-deps.tssrc/lib/tunnel/services.test.tssrc/lib/tunnel/services.ts
| if (provider || model) { | ||
| const parts = [provider, model].filter(Boolean).join(" / "); | ||
| log(` Inference: ${parts}`); |
There was a problem hiding this comment.
Preserve provider / model structure in Inference: output.
When only one field exists, the current join emits an ambiguous single value. Keep the fixed two-part format so the output remains consistent and parseable.
Proposed fix
- if (provider || model) {
- const parts = [provider, model].filter(Boolean).join(" / ");
- log(` Inference: ${parts}`);
- }
+ if (provider || model) {
+ log(` Inference: ${provider || "unknown"} / ${model || "unknown"}`);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (provider || model) { | |
| const parts = [provider, model].filter(Boolean).join(" / "); | |
| log(` Inference: ${parts}`); | |
| if (provider || model) { | |
| log(` Inference: ${provider || "unknown"} / ${model || "unknown"}`); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/inventory/index.ts` around lines 425 - 427, The Inference log
currently builds parts via [provider, model].filter(Boolean).join(" / "), which
collapses to a single ambiguous value when one side is missing; change the
construction so it always emits two parts separated by " / " (e.g., `${provider
?? ""} / ${model ?? ""}` or equivalent) and pass that string to log(`
Inference: ${parts}`) so the output consistently preserves the "provider /
model" structure even when one side is empty.
| function commandLineNamesCloudflared(commandLine: string): boolean { | ||
| return commandLine | ||
| .split(/\0|\s+/) | ||
| .filter(Boolean) | ||
| .some((token) => basename(token) === "cloudflared"); | ||
| } |
There was a problem hiding this comment.
commandLineNamesCloudflared can misclassify wrapper processes.
Matching any token means a non-cloudflared process (for example sh -c cloudflared ...) can be treated as cloudflared. Check only the executable token (argv[0] / comm) instead.
Proposed fix
function commandLineNamesCloudflared(commandLine: string): boolean {
- return commandLine
- .split(/\0|\s+/)
- .filter(Boolean)
- .some((token) => basename(token) === "cloudflared");
+ const [argv0] = commandLine.split(/\0|\s+/).filter(Boolean);
+ return argv0 !== undefined && basename(argv0) === "cloudflared";
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function commandLineNamesCloudflared(commandLine: string): boolean { | |
| return commandLine | |
| .split(/\0|\s+/) | |
| .filter(Boolean) | |
| .some((token) => basename(token) === "cloudflared"); | |
| } | |
| function commandLineNamesCloudflared(commandLine: string): boolean { | |
| const [argv0] = commandLine.split(/\0|\s+/).filter(Boolean); | |
| return argv0 !== undefined && basename(argv0) === "cloudflared"; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/tunnel/services.ts` around lines 127 - 132, The function
commandLineNamesCloudflared currently checks every token and can misidentify
wrapper commands; change it to only inspect the executable token (argv[0]) by
splitting the commandLine on \0 or whitespace, taking the first non-empty token,
normalizing with basename(token) and comparing to "cloudflared" (also trim
surrounding quotes if present) so wrapper invocations like "sh -c cloudflared
..." no longer match.
| const pid = Number(raw); | ||
| if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" }; | ||
| try { |
There was a problem hiding this comment.
Treat non-integer PID text as stale-pid-file.
Number(raw) currently accepts values like 1.5 and 1e3, which are invalid PID-file formats but get classified as stale-pid-process. Tighten parsing to a strict positive integer string first.
Proposed fix
- const pid = Number(raw);
- if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" };
+ if (!/^[1-9]\d*$/.test(raw)) return { kind: "stale-pid-file" };
+ const pid = Number(raw);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const pid = Number(raw); | |
| if (!Number.isFinite(pid) || pid <= 0) return { kind: "stale-pid-file" }; | |
| try { | |
| if (!/^[1-9]\d*$/.test(raw)) return { kind: "stale-pid-file" }; | |
| const pid = Number(raw); | |
| try { |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/tunnel/services.ts` around lines 144 - 146, The PID parsing currently
uses Number(raw) which accepts non-integer notations (e.g. "1.5", "1e3"); change
the logic that derives pid from raw to first validate raw is a strict positive
integer string (e.g. /^\d+$/) and only then parse it (parseInt or Number) and
check pid > 0; if the raw string fails the integer regex return { kind:
"stale-pid-file" } (instead of letting Number produce a value that later becomes
"stale-pid-process"), updating the code around the pid/ raw checks and replacing
the Number.isFinite check accordingly.
|
Cross-link / not-duplicate-but-synergistic note: #3494 reports the same The two PRs touch disjoint files (this one: instead of the bare |
Selective E2E Results — ✅ All requested jobs passedRun: 25878912695
|
## Summary Refreshes the NemoClaw documentation for the local `main` changes included in the 0.0.42 release. The update adds release notes, updates the affected user-facing setup and troubleshooting pages, bumps docs metadata to 0.0.42, and regenerates the matching user skills. ## Changes - #3537 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: Documented host-level status fields, cloudflared state-specific recovery hints, and Local Ollama auth proxy status diagnostics. - #3454 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`: Documented macOS Docker-driver onboarding and removed the expectation that standard macOS setup needs a VM driver helper. - #3514 -> `docs/inference/use-local-inference.md`: Documented compatible-endpoint retry behavior for reasoning-only smoke responses. - #3448 -> `docs/reference/commands.md`, `docs/manage-sandboxes/messaging-channels.md`: Documented canonical channel names and policy preset hints after `channels add`. - #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery and uninstall wording in the 0.0.42 release notes. - #3313 -> `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documented stronger dashboard port detection and rollback when a forward cannot start. - #3502 -> `docs/about/release-notes.md`: Captured batched onboarding policy preset application in the 0.0.42 release notes. - #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level Colima socket path. - #3421 -> `docs/about/release-notes.md`: Captured idempotent installer shim logging in the 0.0.42 release notes. - Updated `docs/project.json`, `docs/versions1.json`, and regenerated `.agents/skills/nemoclaw-user-*` outputs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - v0.0.42 * **Documentation** * Enhanced macOS onboarding guidance for Docker gateway setup * Improved dashboard port conflict handling with automatic rollback * Better local Ollama inference diagnostics and authentication proxy checks * Clarified status command output and recovery procedures * Refined messaging channel setup documentation * **Chores** * Version bump to 0.0.42 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Fixes #2604. Both @wangericnv (2026-05-11) and @cv (2026-05-14) re-reported the same symptom on v0.0.38 / v0.0.41 —
nemoclaw statusprints● cloudflared (stopped)in all three failure modes (no PID file, garbage PID, dead/wrong-process PID) with no cause and no remediation. The bare command also omits theConnected:/Inference:labeled fields the original bug requested. Both symptoms are addressed here.Root cause
src/lib/actions/sandbox/doctor.tsalready distinguished three states and emitted matching hints.showStatus()insrc/lib/tunnel/services.tswas written before that and only hadisRunning() ? ok : "(stopped)"— every failure mode collapsed into the same un-actionable line.Inference:/Connected:labels to the per-sandboxnemoclaw <name> statusbut left barenemoclaw statusshowing only the model in parens.What this PR does
Share the cloudflared state machine. New
readCloudflaredState(pidDir)insrc/lib/tunnel/services.tsreturns a discriminated union{ kind: "running" | "stopped" | "stale-pid-file" | "stale-pid-process" }.showStatus()switches on it and emits a coloured marker + one-line remediation.cloudflaredDoctorCheck()consumes the same function and translates each state into itsDoctorCheck, removing duplicated PID-file //proc/<pid>/cmdlinelogic.Remediation wording —
no cloudflared process; restart with .... All three failure modes lead with the cause and point atnemoclaw tunnel start. Both reporters asked for that exact shape.nemoclaw tunnel startalready handles all three states becauseisRunning()returns false andstartServiceproceeds and overwrites any stale PID file — so one command recovers in every case. The same wording is used indoctor.tsfor consistency.Bare-status
Inference:andConnected:lines.showStatusCommandinsrc/lib/inventory/index.tsnow prints labeledInference: <provider> / <model>andConnected: yes (N session) / nounder each sandbox row. Provider/model prefer live gateway values for the default sandbox (consistent with the existing(model)rendering, [NemoClaw][Linux][CLI&UX]nemoclaw list/statusshows stale model afteropenshell inference set— live gateway state is queried but result is discarded #2369).getActiveSessionCountis wired throughbuildStatusCommandDeps, mirroring the cached SSH-process probe already used bybuildListCommandDeps.Tests (15 new). Three cases each for
showStatusfailure-mode rendering, five cases forreadCloudflaredState, seven for bare-statusInference:/Connected:rendering across live/stored/missing dep variants. All existing cli-doctor tests still pass against the refactored shared function.Behavioural diff
Before (v0.0.41 baseline):
After:
Bare status sandbox row:
Brev reproduction — 3× per case, baseline and fix
(Run on the prior PR #3534 branch, same code as here.) Fresh
n2d-standard-2(Ubuntu 22.04 / Linux 6.8 GCP), v0.0.41 from tag, faked registry, three PID-dir states 3× each. Re-installed from this branch.Baseline v0.0.41 — 9/9 runs reproduce the bug
Fix branch — 9/9 runs emit a state-specific remediation
Same three-case 3× harness, with the new wording. Identical output across reruns within each case — no flake.
Out of scope (filed/to-file as follow-ups)
while true; cloudflared || sleep 5shim. Real design work, separate PR. Manualpkill cloudflaredis not preventable from inside the CLI either way; the right answer there is actionable status output, which this PR provides.Test plan
npm run build:clicleannpx tsc -p tsconfig.cli.jsoncleanvitest run src/lib/tunnel/services.test.ts— 23 passvitest run src/lib/inventory/index.test.ts— 31 passvitest run test/cli.test.ts -t doctor— 4/4 pass (covers refactoredcloudflaredDoctorCheck)n2d-standard-2repro: 9/9 baseline reproduces; 9/9 fix shows the new diagnostic + remediation lines, no flakereason: valid)Signed-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests