Skip to content

Add job cancellation support to the debugger command relay#1262

Merged
cjluo-nv merged 3 commits intomainfrom
feat/debug-skill
Apr 15, 2026
Merged

Add job cancellation support to the debugger command relay#1262
cjluo-nv merged 3 commits intomainfrom
feat/debug-skill

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 14, 2026

Summary

  • Add a cancel subcommand to the client that terminates the currently running command on the server
  • Server now runs commands in the background with PID tracking, enabling cancellation mid-execution
  • Client-side timeouts automatically cancel the running command on the server (previously the server process was left running)
  • Hardened against race conditions through 4 rounds of adversarial review (15 fixes total)

Key changes

server.sh:

  • Commands run in background with PID tracked in $RELAY_DIR/running (atomic tmp+mv write)
  • Cancel detection loop checks for $RELAY_DIR/cancel file with cmd_id verification
  • SIGTERM with 5s grace period, then SIGKILL escalation for stuck processes
  • .exit file written before running marker removed (ordering guarantee)
  • set -e-safe: wait uses || exit_code=$? pattern; cleanup trap fully guarded
  • Stale cancel files cleared at command start; mismatched/empty signals rejected
  • Command file read into memory and removed before execution (eliminates TOCTOU with client timeout)

client.sh:

  • New cancel subcommand: writes target cmd_id to cancel file, waits for server acknowledgment (30s timeout)
  • run timeout now sends targeted cancel signal (verifies cmd_id match to avoid killing wrong command)
  • run timeout cleans up orphaned result files
  • status shows currently running command
  • flush rejects if a command is currently running (prevents state corruption)
  • Exit code validated as numeric before use

Protocol additions

.relay/
├── running    # server writes cmd_id:pid while executing (atomic)
├── cancel     # client writes target cmd_id to request cancellation

Test plan

  • Start server in Docker, handshake from host
  • Run a command (client.sh run "sleep 30"), cancel it (client.sh cancel), verify exit code 130
  • Run a command with short timeout (--timeout 5 run "sleep 30"), verify auto-cancel
  • Run a command that exits non-zero, verify server stays alive
  • Run status during execution, verify it shows the running command
  • Attempt flush during execution, verify it is rejected
  • Cancel when nothing is running, verify clean message

🤖 Generated with Claude Code

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (bash scripts for dev tooling, tested manually)
  • Did you update Changelog?: N/A (internal tooling)

Summary by CodeRabbit

  • Documentation

    • Added a debug skill guide and protocol reference with quick CLI examples and a new “Cancelling Commands” section.
  • New Features

    • Client cancel command to stop the currently running remote command.
    • Status now reports the active command (or (idle)).
  • Improvements

    • Safer shutdown/cleanup, stricter startup/validation, streaming logs and deterministic cancel exit (130), and auto-cancel on client-side timeout.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds a debug skill doc and extends the file-relay debugger: client gains a cancel subcommand and improved run/status/flush behavior; server adds strict local modelopt validation, TOCTOU-safe command read, atomic running marker, background execution with log tailing, and deterministic cancellation handling (exit 130); docs updated.

Changes

Cohort / File(s) Summary
Skill doc
​.claude/skills/debug/SKILL.md
New skill documentation describing host↔container file-relay debug workflow, prerequisites, and quick-reference CLI commands.
Client tooling
tools/debugger/client.sh
Added cancel subcommand and helper to wait for cancellation; run prints command, validates numeric exit files, auto-cancels on timeout when appropriate, removes leftover result artifacts; status, flush, and help updated.
Server runtime
tools/debugger/server.sh
Added check_modelopt_local() and stricter install verification; TOCTOU-safe read+remove of command files, atomic running write (cmd_id:pid), clear cancel before exec, execute bash -c in new process group with setsid, tail logs for streaming, poll and handle cancel (TERM→KILL, append [cancelled], set exit 130), deterministic exit handling, and enhanced shutdown cleanup of running process tree and relay markers.
Protocol docs
tools/debugger/CLAUDE.md, tools/debugger/README.md
Documented cancel client command and .relay/running/.relay/cancel semantics, updated execution representation and client-side timeout auto-cancel behavior, and standardized cancel exit code 130.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Host Client
    participant FS as Relay Dir (shared FS)
    participant Server as Container Server
    participant Cmd as Command Process

    Client->>FS: write .relay/cmd/<id>.sh (cmd_id, cmd_content)
    FS-->>Server: server polls .relay for new command
    Server->>FS: atomically read+remove command file
    Server->>FS: write .relay/running (cmd_id:pid)
    Server->>Cmd: start bash -c "cmd_content" in new process group (bg), log -> .relay/result/<id>.log
    Server->>FS: tail logfile to stream output
    Client->>FS: optionally write .relay/cancel with target cmd_id
    Server->>FS: detect .relay/cancel and read target
    alt target matches running cmd_id
        Server->>Cmd: send SIGTERM -> wait -> send SIGKILL if needed
        Server->>FS: append "[cancelled]" to log
        Server->>FS: write .relay/result/<id>.exit (130), remove .relay/running and .relay/cancel
    else target mismatch
        Server->>FS: clear .relay/cancel and continue
    end
    Server->>FS: on normal completion write .relay/result/<id>.exit (exit_code), remove .relay/running
    Client->>FS: read .relay/result/<id>.exit and .log
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main change: adding job cancellation support to the debugger command relay, which is the core feature implemented across all modified files.
Security Anti-Patterns ✅ Passed PR contains only shell scripts and documentation changes with no Python code, dependency modifications, or security anti-patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/debug-skill

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 14, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-15 16:50 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Line 94: The current echo "[client] Running: $*" logs the entire command
(which may contain secrets); change it to log only the command identifier
(cmd_id) by default and gate full command text behind an explicit debug flag
(e.g., DEBUG or DEBUG_CLIENT). Update the client.sh location where echo
"[client] Running: $*" appears to: log "[client] Running cmd_id=<cmd_id>" by
default, and add a guarded conditional that, when DEBUG (or DEBUG_CLIENT) is set
(true/1), emits the full command string (the previous "$*") to the logs; ensure
any added flag checks are robust for both "1"/"true" values and reference the
existing cmd_id variable name so reviewers can find the change easily.

In `@tools/debugger/server.sh`:
- Around line 144-145: The current code reads the full command into cmd_content
and prints it, which may leak secrets; change the logging so it does not echo
the full payload—stop using cmd_content in the server log and instead log only
the command identifier (cmd_id) or a redacted/hashed form; update the block that
uses cmd_file, cmd_content and the echo "[server] Executing command $cmd_id:
$cmd_content" to either omit cmd_content entirely or replace it with a safe
redaction/hashing of the content before logging while still preserving cmd_id
for traceability.
- Around line 96-99: The path check using actual.startswith(expected) is
boundary-unsafe and should be replaced with a directory-boundary check using a
canonical commonpath approach (e.g., call Python's os.path.commonpath on real
paths) to ensure actual is inside expected; locate the check referencing
actual.startswith(expected) and change it to compute canonical paths (realpath)
and compare os.path.commonpath([expected, actual]) == expected. Also stop
echoing full command contents from cmd_file into logs: find the variables
cmd_file and cmd_content and replace the logging line that prints "[server]
Executing command $cmd_id: $cmd_content" with a safer log that only includes the
cmd_id and a non-sensitive summary (e.g., truncated length or a redacted
placeholder), or validate/sanitize cmd_content before logging so secrets/API
keys are never emitted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b5c9d28-466f-46b9-aa50-b44b975a4a2d

📥 Commits

Reviewing files that changed from the base of the PR and between b6c6ec3 and 08a838c.

📒 Files selected for processing (3)
  • .claude/skills/debug/SKILL.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh

Comment thread tools/debugger/client.sh
Comment thread tools/debugger/server.sh Outdated
Comment thread tools/debugger/server.sh Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.24%. Comparing base (b6c6ec3) to head (03bbf54).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1262      +/-   ##
==========================================
- Coverage   76.90%   76.24%   -0.67%     
==========================================
  Files         350      355       +5     
  Lines       40524    41416     +892     
==========================================
+ Hits        31166    31578     +412     
- Misses       9358     9838     +480     
Flag Coverage Δ
unit 55.71% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv changed the title Improve debugger skills and make it available from the top skills dir Add job cancellation support to the debugger command relay Apr 14, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tools/debugger/server.sh (1)

156-163: ⚠️ Potential issue | 🟠 Major

Raw command text is still being written to server logs.

Line 162 echoes the full payload, so any inline token/password passed through client.sh run ends up in persisted logs. Please log only cmd_id by default and keep raw payload logging behind an explicit debug flag.

🔧 Proposed change
-        echo "[server] Executing command $cmd_id: $cmd_content"
+        echo "[server] Executing command $cmd_id"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 156 - 163, The server currently logs
the full raw command payload (cmd_content) which can leak secrets; change the
logging in the command execution path to only log the command identifier
(cmd_id) by default and remove or redact cmd_content from the echo. Add
conditional logging behind an explicit debug flag (e.g., CHECK_DEBUG or
DEBUG_VERBOSE) so that raw payloads are only printed when that flag is set (use
a test like if [ -n "$DEBUG_VERBOSE" ] to echo cmd_content), and ensure the
existing echo line referencing cmd_content is replaced accordingly in the block
that reads and removes the command file (cmd_file -> cmd_id/cmd_content).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/client.sh`:
- Around line 119-120: Replace the non-atomic writes to the cancel marker in
tools/debugger/client.sh (the block that echoes "Sending cancel signal..." and
writes "$cmd_id" > "$RELAY_DIR/cancel") with an atomic write: write the payload
to a per-call temp file in $RELAY_DIR (e.g., "$RELAY_DIR/.cancel.$$" or
similar), fsync if available, then atomically mv the temp file to
"$RELAY_DIR/cancel"; apply the same change to the other occurrence around lines
199-200 to mirror how .relay/running and .exit are handled by the server.
- Around line 121-124: The loop that waits for "$RELAY_DIR/running" to disappear
can falsely time out if the cancelled command finishes and the server
immediately starts the next command; update both wait loops (used in the
timeout-cancel path and the cancel subcommand) to break not only when the
running file is missing but also when its recorded cmd_id differs from the
cancelled target_id. Implement a helper like
wait_for_running_id_to_clear(target_id, timeout) that reads
"$RELAY_DIR/running", extracts the current_id (prefix before ":"), returns
success if the file is gone or current_id != target_id, otherwise sleeps
POLL_INTERVAL and retries until timeout; replace the existing seq/sleep loops
with calls to this helper in both places.

In `@tools/debugger/README.md`:
- Around line 89-99: The Protocol docs describe an outdated pre-buffered flow;
update the README protocol to match the current behavior in
tools/debugger/server.sh and tools/debugger/client.sh by documenting that the
server reads the command file contents, immediately removes the
`.relay/cmd/<timestamp>.sh` file, then executes the command via `bash -c
"$cmd_content"` (rather than running `bash <file>` and deleting at end), and
that the client writes the target `cmd_id` into `.relay/cancel` for
cancellation; also adjust the described on-disk transitions for `.relay/cmd`,
`.relay/running`, `.relay/result/*`, and `.relay/cancel` to reflect these exact
steps and exit code 130 behavior.

In `@tools/debugger/server.sh`:
- Around line 99-116: The check_modelopt_local() helper is currently fooled by
the parent's PYTHONPATH; modify the subprocess invocation used inside
check_modelopt_local to clear PYTHONPATH and run Python in isolated mode so
import resolution ignores the working tree. Concretely, when launching the
inline python check (the python -c "... import modelopt ..." block referenced in
check_modelopt_local), set the environment variable PYTHONPATH="" for that
subprocess and invoke python with the -I flag so the check validates the
actually installed package state rather than the source tree at $WORKDIR.

---

Duplicate comments:
In `@tools/debugger/server.sh`:
- Around line 156-163: The server currently logs the full raw command payload
(cmd_content) which can leak secrets; change the logging in the command
execution path to only log the command identifier (cmd_id) by default and remove
or redact cmd_content from the echo. Add conditional logging behind an explicit
debug flag (e.g., CHECK_DEBUG or DEBUG_VERBOSE) so that raw payloads are only
printed when that flag is set (use a test like if [ -n "$DEBUG_VERBOSE" ] to
echo cmd_content), and ensure the existing echo line referencing cmd_content is
replaced accordingly in the block that reads and removes the command file
(cmd_file -> cmd_id/cmd_content).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15f5922b-82e3-4c8e-87c9-eefdcbf41ddb

📥 Commits

Reviewing files that changed from the base of the PR and between 08a838c and c4e320e.

📒 Files selected for processing (5)
  • .claude/skills/debug/SKILL.md
  • tools/debugger/CLAUDE.md
  • tools/debugger/README.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh
✅ Files skipped from review due to trivial changes (2)
  • tools/debugger/CLAUDE.md
  • .claude/skills/debug/SKILL.md

Comment thread tools/debugger/client.sh Outdated
Comment thread tools/debugger/client.sh Outdated
Comment thread tools/debugger/README.md Outdated
Comment thread tools/debugger/server.sh
Copy link
Copy Markdown
Contributor

@kaix-nv kaix-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work overall — the protocol design is sound, the TOCTOU handling is thoughtful, and the 4 rounds of adversarial review show. The set -e workarounds are all correct, the ordering invariant (exit written before running removed) is well-documented, and the pip install error handling is a genuine improvement over the old silent-continue pattern.

That said, there are a few issues worth fixing before merge. Two are correctness bugs (non-atomic cancel write, shallow process tree kill), and the rest are robustness/UX gaps that would bite users in realistic failure scenarios.

Summary of findings:

Priority Count Key items
Must fix 2 Non-atomic cancel write; shallow process tree kill
Should fix 4 Timeout race destroying valid results; empty running file silent skip; leaked running marker blocks recovery; status TOCTOU crash
Nit / docs 2 README protocol steps wrong; SKILL.md references nonexistent troubleshooting

See inline comments for details.

Comment thread tools/debugger/client.sh Outdated
Comment thread tools/debugger/server.sh Outdated
Comment thread tools/debugger/client.sh
Comment thread tools/debugger/client.sh
Comment thread tools/debugger/client.sh Outdated
Comment thread tools/debugger/client.sh Outdated
Comment thread tools/debugger/README.md Outdated
Comment thread .claude/skills/debug/SKILL.md
@cjluo-nv cjluo-nv enabled auto-merge (squash) April 15, 2026 05:36
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/debugger/server.sh`:
- Around line 66-71: The shutdown race is caused by starting the detached job
with setsid before its PID is durably published, so the trap can remove
"$RELAY_DIR/running" while the detached process group survives; change the
launch flow around the setsid call (the block that uses `setsid` and writes to
`"$RELAY_DIR/running"`) to immediately capture and atomically publish the new
leader PID (or its process group ID) before returning control (e.g., start
setsid in a way that writes the child/leader PID to the running file right away
or use `setsid --fork`/a short-lived wrapper that persists the PID), and update
the shutdown cleanup (the code that reads `running_pid` and sends signals) to
first kill the entire process group (kill -- -PGID) and have a fallback that
kills the leader PID itself and checks for lingering descendants; ensure the
trap never removes the running file until the published PID is confirmed and
killed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 59630b24-5650-4314-9d12-7b71b8eab028

📥 Commits

Reviewing files that changed from the base of the PR and between a5ab32d and 03bbf54.

📒 Files selected for processing (4)
  • .claude/skills/debug/SKILL.md
  • tools/debugger/README.md
  • tools/debugger/client.sh
  • tools/debugger/server.sh
✅ Files skipped from review due to trivial changes (1)
  • .claude/skills/debug/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/debugger/client.sh

Comment thread tools/debugger/server.sh
Comment on lines +66 to +71
# Kill any running command (guard all reads with || true to prevent set -e
# from aborting the trap and leaving stale marker files)
running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
if [[ -n "$running_pid" ]]; then
kill -- -"$running_pid" 2>/dev/null || kill "$running_pid" 2>/dev/null || true
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shutdown cleanup can still orphan the detached command tree.

Lines 173-178 start the setsid job before its PID is published anywhere durable, and Lines 66-71 only send a single SIGTERM based on .relay/running. If the server gets SIGINT/SIGTERM in that gap—or the command ignores SIGTERM—the trap removes .relay/running and exits while the process group can keep running in the container.

🔧 Suggested fix
+CURRENT_CMD_PID=""
+
+terminate_cmd_group() {
+    local pid="$1"
+    [[ -n "$pid" ]] || return 0
+
+    kill -- -"$pid" 2>/dev/null || kill "$pid" 2>/dev/null || true
+    for _ in $(seq 1 5); do
+        kill -0 "$pid" 2>/dev/null || return 0
+        sleep 1
+    done
+    kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true
+}
+
 cleanup() {
     echo "[server] Shutting down..."
-    running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
+    running_pid="${CURRENT_CMD_PID:-}"
+    [[ -n "$running_pid" ]] || running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
     if [[ -n "$running_pid" ]]; then
-        kill -- -"$running_pid" 2>/dev/null || kill "$running_pid" 2>/dev/null || true
+        terminate_cmd_group "$running_pid"
     fi
     # Kill any child processes in our process group
     pkill -P $$ 2>/dev/null || true
@@
         (cd "$WORKDIR" && exec setsid bash -c "$cmd_content") >> "$RESULT_DIR/$cmd_id.log" 2>&1 &
         cmd_pid=$!
+        CURRENT_CMD_PID="$cmd_pid"
@@
         rm -f "$RELAY_DIR/running"
         rm -f "$RELAY_DIR/cancel"
+        CURRENT_CMD_PID=""

Also applies to: 173-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/debugger/server.sh` around lines 66 - 71, The shutdown race is caused
by starting the detached job with setsid before its PID is durably published, so
the trap can remove "$RELAY_DIR/running" while the detached process group
survives; change the launch flow around the setsid call (the block that uses
`setsid` and writes to `"$RELAY_DIR/running"`) to immediately capture and
atomically publish the new leader PID (or its process group ID) before returning
control (e.g., start setsid in a way that writes the child/leader PID to the
running file right away or use `setsid --fork`/a short-lived wrapper that
persists the PID), and update the shutdown cleanup (the code that reads
`running_pid` and sends signals) to first kill the entire process group (kill --
-PGID) and have a fallback that kills the leader PID itself and checks for
lingering descendants; ensure the trap never removes the running file until the
published PID is confirmed and killed.

@cjluo-nv cjluo-nv merged commit 7c85571 into main Apr 15, 2026
37 checks passed
@cjluo-nv cjluo-nv deleted the feat/debug-skill branch April 15, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants