Add job cancellation support to the debugger command relay#1262
Add job cancellation support to the debugger command relay#1262
Conversation
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
📝 WalkthroughWalkthroughAdds a debug skill doc and extends the file-relay debugger: client gains a Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Host Client
participant FS as Relay Dir (shared FS)
participant Server as Container Server
participant Cmd as Command Process
Client->>FS: write .relay/cmd/<id>.sh (cmd_id, cmd_content)
FS-->>Server: server polls .relay for new command
Server->>FS: atomically read+remove command file
Server->>FS: write .relay/running (cmd_id:pid)
Server->>Cmd: start bash -c "cmd_content" in new process group (bg), log -> .relay/result/<id>.log
Server->>FS: tail logfile to stream output
Client->>FS: optionally write .relay/cancel with target cmd_id
Server->>FS: detect .relay/cancel and read target
alt target matches running cmd_id
Server->>Cmd: send SIGTERM -> wait -> send SIGKILL if needed
Server->>FS: append "[cancelled]" to log
Server->>FS: write .relay/result/<id>.exit (130), remove .relay/running and .relay/cancel
else target mismatch
Server->>FS: clear .relay/cancel and continue
end
Server->>FS: on normal completion write .relay/result/<id>.exit (exit_code), remove .relay/running
Client->>FS: read .relay/result/<id>.exit and .log
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/debugger/client.sh`:
- Line 94: The current echo "[client] Running: $*" logs the entire command
(which may contain secrets); change it to log only the command identifier
(cmd_id) by default and gate full command text behind an explicit debug flag
(e.g., DEBUG or DEBUG_CLIENT). Update the client.sh location where echo
"[client] Running: $*" appears to: log "[client] Running cmd_id=<cmd_id>" by
default, and add a guarded conditional that, when DEBUG (or DEBUG_CLIENT) is set
(true/1), emits the full command string (the previous "$*") to the logs; ensure
any added flag checks are robust for both "1"/"true" values and reference the
existing cmd_id variable name so reviewers can find the change easily.
In `@tools/debugger/server.sh`:
- Around line 144-145: The current code reads the full command into cmd_content
and prints it, which may leak secrets; change the logging so it does not echo
the full payload—stop using cmd_content in the server log and instead log only
the command identifier (cmd_id) or a redacted/hashed form; update the block that
uses cmd_file, cmd_content and the echo "[server] Executing command $cmd_id:
$cmd_content" to either omit cmd_content entirely or replace it with a safe
redaction/hashing of the content before logging while still preserving cmd_id
for traceability.
- Around line 96-99: The path check using actual.startswith(expected) is
boundary-unsafe and should be replaced with a directory-boundary check using a
canonical commonpath approach (e.g., call Python's os.path.commonpath on real
paths) to ensure actual is inside expected; locate the check referencing
actual.startswith(expected) and change it to compute canonical paths (realpath)
and compare os.path.commonpath([expected, actual]) == expected. Also stop
echoing full command contents from cmd_file into logs: find the variables
cmd_file and cmd_content and replace the logging line that prints "[server]
Executing command $cmd_id: $cmd_content" with a safer log that only includes the
cmd_id and a non-sensitive summary (e.g., truncated length or a redacted
placeholder), or validate/sanitize cmd_content before logging so secrets/API
keys are never emitted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9b5c9d28-466f-46b9-aa50-b44b975a4a2d
📒 Files selected for processing (3)
.claude/skills/debug/SKILL.mdtools/debugger/client.shtools/debugger/server.sh
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1262 +/- ##
==========================================
- Coverage 76.90% 76.24% -0.67%
==========================================
Files 350 355 +5
Lines 40524 41416 +892
==========================================
+ Hits 31166 31578 +412
- Misses 9358 9838 +480
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
tools/debugger/server.sh (1)
156-163:⚠️ Potential issue | 🟠 MajorRaw command text is still being written to server logs.
Line 162 echoes the full payload, so any inline token/password passed through
client.sh runends up in persisted logs. Please log onlycmd_idby default and keep raw payload logging behind an explicit debug flag.🔧 Proposed change
- echo "[server] Executing command $cmd_id: $cmd_content" + echo "[server] Executing command $cmd_id"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tools/debugger/server.sh` around lines 156 - 163, The server currently logs the full raw command payload (cmd_content) which can leak secrets; change the logging in the command execution path to only log the command identifier (cmd_id) by default and remove or redact cmd_content from the echo. Add conditional logging behind an explicit debug flag (e.g., CHECK_DEBUG or DEBUG_VERBOSE) so that raw payloads are only printed when that flag is set (use a test like if [ -n "$DEBUG_VERBOSE" ] to echo cmd_content), and ensure the existing echo line referencing cmd_content is replaced accordingly in the block that reads and removes the command file (cmd_file -> cmd_id/cmd_content).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/debugger/client.sh`:
- Around line 119-120: Replace the non-atomic writes to the cancel marker in
tools/debugger/client.sh (the block that echoes "Sending cancel signal..." and
writes "$cmd_id" > "$RELAY_DIR/cancel") with an atomic write: write the payload
to a per-call temp file in $RELAY_DIR (e.g., "$RELAY_DIR/.cancel.$$" or
similar), fsync if available, then atomically mv the temp file to
"$RELAY_DIR/cancel"; apply the same change to the other occurrence around lines
199-200 to mirror how .relay/running and .exit are handled by the server.
- Around line 121-124: The loop that waits for "$RELAY_DIR/running" to disappear
can falsely time out if the cancelled command finishes and the server
immediately starts the next command; update both wait loops (used in the
timeout-cancel path and the cancel subcommand) to break not only when the
running file is missing but also when its recorded cmd_id differs from the
cancelled target_id. Implement a helper like
wait_for_running_id_to_clear(target_id, timeout) that reads
"$RELAY_DIR/running", extracts the current_id (prefix before ":"), returns
success if the file is gone or current_id != target_id, otherwise sleeps
POLL_INTERVAL and retries until timeout; replace the existing seq/sleep loops
with calls to this helper in both places.
In `@tools/debugger/README.md`:
- Around line 89-99: The Protocol docs describe an outdated pre-buffered flow;
update the README protocol to match the current behavior in
tools/debugger/server.sh and tools/debugger/client.sh by documenting that the
server reads the command file contents, immediately removes the
`.relay/cmd/<timestamp>.sh` file, then executes the command via `bash -c
"$cmd_content"` (rather than running `bash <file>` and deleting at end), and
that the client writes the target `cmd_id` into `.relay/cancel` for
cancellation; also adjust the described on-disk transitions for `.relay/cmd`,
`.relay/running`, `.relay/result/*`, and `.relay/cancel` to reflect these exact
steps and exit code 130 behavior.
In `@tools/debugger/server.sh`:
- Around line 99-116: The check_modelopt_local() helper is currently fooled by
the parent's PYTHONPATH; modify the subprocess invocation used inside
check_modelopt_local to clear PYTHONPATH and run Python in isolated mode so
import resolution ignores the working tree. Concretely, when launching the
inline python check (the python -c "... import modelopt ..." block referenced in
check_modelopt_local), set the environment variable PYTHONPATH="" for that
subprocess and invoke python with the -I flag so the check validates the
actually installed package state rather than the source tree at $WORKDIR.
---
Duplicate comments:
In `@tools/debugger/server.sh`:
- Around line 156-163: The server currently logs the full raw command payload
(cmd_content) which can leak secrets; change the logging in the command
execution path to only log the command identifier (cmd_id) by default and remove
or redact cmd_content from the echo. Add conditional logging behind an explicit
debug flag (e.g., CHECK_DEBUG or DEBUG_VERBOSE) so that raw payloads are only
printed when that flag is set (use a test like if [ -n "$DEBUG_VERBOSE" ] to
echo cmd_content), and ensure the existing echo line referencing cmd_content is
replaced accordingly in the block that reads and removes the command file
(cmd_file -> cmd_id/cmd_content).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 15f5922b-82e3-4c8e-87c9-eefdcbf41ddb
📒 Files selected for processing (5)
.claude/skills/debug/SKILL.mdtools/debugger/CLAUDE.mdtools/debugger/README.mdtools/debugger/client.shtools/debugger/server.sh
✅ Files skipped from review due to trivial changes (2)
- tools/debugger/CLAUDE.md
- .claude/skills/debug/SKILL.md
kaix-nv
left a comment
There was a problem hiding this comment.
Good work overall — the protocol design is sound, the TOCTOU handling is thoughtful, and the 4 rounds of adversarial review show. The set -e workarounds are all correct, the ordering invariant (exit written before running removed) is well-documented, and the pip install error handling is a genuine improvement over the old silent-continue pattern.
That said, there are a few issues worth fixing before merge. Two are correctness bugs (non-atomic cancel write, shallow process tree kill), and the rest are robustness/UX gaps that would bite users in realistic failure scenarios.
Summary of findings:
| Priority | Count | Key items |
|---|---|---|
| Must fix | 2 | Non-atomic cancel write; shallow process tree kill |
| Should fix | 4 | Timeout race destroying valid results; empty running file silent skip; leaked running marker blocks recovery; status TOCTOU crash |
| Nit / docs | 2 | README protocol steps wrong; SKILL.md references nonexistent troubleshooting |
See inline comments for details.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
a5ab32d to
03bbf54
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tools/debugger/server.sh`:
- Around line 66-71: The shutdown race is caused by starting the detached job
with setsid before its PID is durably published, so the trap can remove
"$RELAY_DIR/running" while the detached process group survives; change the
launch flow around the setsid call (the block that uses `setsid` and writes to
`"$RELAY_DIR/running"`) to immediately capture and atomically publish the new
leader PID (or its process group ID) before returning control (e.g., start
setsid in a way that writes the child/leader PID to the running file right away
or use `setsid --fork`/a short-lived wrapper that persists the PID), and update
the shutdown cleanup (the code that reads `running_pid` and sends signals) to
first kill the entire process group (kill -- -PGID) and have a fallback that
kills the leader PID itself and checks for lingering descendants; ensure the
trap never removes the running file until the published PID is confirmed and
killed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 59630b24-5650-4314-9d12-7b71b8eab028
📒 Files selected for processing (4)
.claude/skills/debug/SKILL.mdtools/debugger/README.mdtools/debugger/client.shtools/debugger/server.sh
✅ Files skipped from review due to trivial changes (1)
- .claude/skills/debug/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/debugger/client.sh
| # Kill any running command (guard all reads with || true to prevent set -e | ||
| # from aborting the trap and leaving stale marker files) | ||
| running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true | ||
| if [[ -n "$running_pid" ]]; then | ||
| kill -- -"$running_pid" 2>/dev/null || kill "$running_pid" 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
Shutdown cleanup can still orphan the detached command tree.
Lines 173-178 start the setsid job before its PID is published anywhere durable, and Lines 66-71 only send a single SIGTERM based on .relay/running. If the server gets SIGINT/SIGTERM in that gap—or the command ignores SIGTERM—the trap removes .relay/running and exits while the process group can keep running in the container.
🔧 Suggested fix
+CURRENT_CMD_PID=""
+
+terminate_cmd_group() {
+ local pid="$1"
+ [[ -n "$pid" ]] || return 0
+
+ kill -- -"$pid" 2>/dev/null || kill "$pid" 2>/dev/null || true
+ for _ in $(seq 1 5); do
+ kill -0 "$pid" 2>/dev/null || return 0
+ sleep 1
+ done
+ kill -9 -- -"$pid" 2>/dev/null || kill -9 "$pid" 2>/dev/null || true
+}
+
cleanup() {
echo "[server] Shutting down..."
- running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
+ running_pid="${CURRENT_CMD_PID:-}"
+ [[ -n "$running_pid" ]] || running_pid=$(cut -d: -f2 "$RELAY_DIR/running" 2>/dev/null) || true
if [[ -n "$running_pid" ]]; then
- kill -- -"$running_pid" 2>/dev/null || kill "$running_pid" 2>/dev/null || true
+ terminate_cmd_group "$running_pid"
fi
# Kill any child processes in our process group
pkill -P $$ 2>/dev/null || true
@@
(cd "$WORKDIR" && exec setsid bash -c "$cmd_content") >> "$RESULT_DIR/$cmd_id.log" 2>&1 &
cmd_pid=$!
+ CURRENT_CMD_PID="$cmd_pid"
@@
rm -f "$RELAY_DIR/running"
rm -f "$RELAY_DIR/cancel"
+ CURRENT_CMD_PID=""Also applies to: 173-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tools/debugger/server.sh` around lines 66 - 71, The shutdown race is caused
by starting the detached job with setsid before its PID is durably published, so
the trap can remove "$RELAY_DIR/running" while the detached process group
survives; change the launch flow around the setsid call (the block that uses
`setsid` and writes to `"$RELAY_DIR/running"`) to immediately capture and
atomically publish the new leader PID (or its process group ID) before returning
control (e.g., start setsid in a way that writes the child/leader PID to the
running file right away or use `setsid --fork`/a short-lived wrapper that
persists the PID), and update the shutdown cleanup (the code that reads
`running_pid` and sends signals) to first kill the entire process group (kill --
-PGID) and have a fallback that kills the leader PID itself and checks for
lingering descendants; ensure the trap never removes the running file until the
published PID is confirmed and killed.
Summary
cancelsubcommand to the client that terminates the currently running command on the serverKey changes
server.sh:
$RELAY_DIR/running(atomic tmp+mv write)$RELAY_DIR/cancelfile with cmd_id verification.exitfile written beforerunningmarker removed (ordering guarantee)set -e-safe:waituses|| exit_code=$?pattern; cleanup trap fully guardedclient.sh:
cancelsubcommand: writes target cmd_id to cancel file, waits for server acknowledgment (30s timeout)runtimeout now sends targeted cancel signal (verifies cmd_id match to avoid killing wrong command)runtimeout cleans up orphaned result filesstatusshows currently running commandflushrejects if a command is currently running (prevents state corruption)Protocol additions
Test plan
client.sh run "sleep 30"), cancel it (client.sh cancel), verify exit code 130--timeout 5 run "sleep 30"), verify auto-cancelstatusduring execution, verify it shows the running commandflushduring execution, verify it is rejected🤖 Generated with Claude Code
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/ASummary by CodeRabbit
Documentation
New Features
cancelcommand to stop the currently running remote command.(idle)).Improvements