fix: stacktrace-explorer sidecar hangs after main container exits#1134
Merged
jamOne- merged 5 commits intoAI-Hypercomputer:mainfrom Mar 19, 2026
Merged
fix: stacktrace-explorer sidecar hangs after main container exits#1134jamOne- merged 5 commits intoAI-Hypercomputer:mainfrom
jamOne- merged 5 commits intoAI-Hypercomputer:mainfrom
Conversation
jamOne-
requested changes
Mar 18, 2026
Collaborator
jamOne-
left a comment
There was a problem hiding this comment.
Review Report
- Verdict: Request Changes
- Action Items:
- Fix Critical Bug (Missing Stacktraces on Fast Exit): The logic
[ -f /shared-volume/ stacktrace_signal ] && exit 0exits the sidecar immediately if the signal file is created before the script reaches thetailcommand (e.g. if the main job finishes very quickly after
generating traces or if the sidecar was sleeping). Replace this with a check that gracefully
cats the files before exiting if the signal is present. - Fix Critical Bug (Premature
tailKill): Oncetailstarts, if the signal arrives,
the script kills thetailprocess instantly (kill $TAIL_PID). This drops any un-flushed
output. Addsleep 2;beforekill $TAIL_PID 2>/dev/null;to lettailfinish printing existing logs. - Suggestion (Code Formatting): Rather than writing the entire script as a single-line
string with semi-colons, consider using a YAML multiline block (|) for readability. This m
akes the bash logic much easier to review and maintain.
- Fix Critical Bug (Missing Stacktraces on Fast Exit): The logic
Recommended Shell Script Structure
Here is the logic implementing the two fixes, formulated as a YAML multiline string for readability (recommended):
args:
- /bin/sh
- -c
- |
while [ ! -d /tmp/debugging ] && [ ! -f /shared-volume/stacktrace_signal ];
do sleep 5; done
while ! ls /tmp/debugging/* >/dev/null 2>&1 && [ ! -f /shared-volume/stackt
race_signal ]; do sleep 5; done
if ls /tmp/debugging/* >/dev/null 2>&1; then
if [ -f /shared-volume/stacktrace_signal ]; then
cat /tmp/debugging/*
else
tail -n+1 -f /tmp/debugging/* & TAIL_PID=$!
while [ ! -f /shared-volume/stacktrace_signal ]; do sleep 1; done
sleep 2
kill $TAIL_PID 2>/dev/null
fi
fi
exit 0(Note: As python strings do not treat $ specially, you can use $! directly in your Python code string, avoiding any curly braces escape issues)
- Overall Impression: This PR effectively fixes a severe hanging issue in the previous s$
decar script by correctly switching frompidofto$!and making directory checks robust.
However, it introduces a regression where stacktraces can be dropped entirely if the main con
tainer exits too fast. Fixing the order of operations as recommended will ensure no stacktrac
es are dropped.
jamOne-
approved these changes
Mar 19, 2026
Collaborator
There was a problem hiding this comment.
Review Report
- Verdict: Approve
- Action Items: two suggestions that I removed
- Overall Impression: This is an excellent, highly robust PR. It effectively eliminates the racy, brittle behavior of the sidecar script. By switching to simple file-polling and direct pid management (
$!), it fixes thepidofrace conditions, sidesteps the broken$$expansion inbusybox:1.28subshells, and prevents infinite hangs caused by syntax errors with the previous[ ! -e /tmp/debugging/* ]busybox evaluation. The Bash script is logically sound, completely POSIX compliant forash, and the YAML block scalar is properly formatted. Great work!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix stuck
stacktrace-explorersidecar in--deploy-stacktrace-sidecarfor TPU workloads.Problem
I haven't investigated this issue very deeply. But as I understand, the sidecar uses busybox 1.28 ash which has broken
$$expansion in subshells andpidofraces. When the main container finishes beforetailstarts, the sidecar hangs forever.Fix
Replaced the script with simple polling - each loop checks the signal file directly, no
$$/trap/pidof/subshells needed. Only$!(last background PID) is used to killtailon exit.Testing