Skip to content

fix: stacktrace-explorer sidecar hangs after main container exits#1134

Merged
jamOne- merged 5 commits intoAI-Hypercomputer:mainfrom
kryvokhyzha:fix/sidecar-stuck
Mar 19, 2026
Merged

fix: stacktrace-explorer sidecar hangs after main container exits#1134
jamOne- merged 5 commits intoAI-Hypercomputer:mainfrom
kryvokhyzha:fix/sidecar-stuck

Conversation

@kryvokhyzha
Copy link
Copy Markdown
Contributor

Description

Fix stuck stacktrace-explorer sidecar in --deploy-stacktrace-sidecar for TPU workloads.

Problem

I haven't investigated this issue very deeply. But as I understand, the sidecar uses busybox 1.28 ash which has broken $$ expansion in subshells and pidof races. When the main container finishes before tail starts, the sidecar hangs forever.

image

Fix

Replaced the script with simple polling - each loop checks the signal file directly, no $$/trap/pidof/subshells needed. Only $! (last background PID) is used to kill tail on exit.

Testing

  • Verified sidecar exits cleanly when main container completes

Copy link
Copy Markdown
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Report

  • Verdict: Request Changes
  • Action Items:
    1. Fix Critical Bug (Missing Stacktraces on Fast Exit): The logic [ -f /shared-volume/ stacktrace_signal ] && exit 0 exits the sidecar immediately if the signal file is created before the script reaches the tail command (e.g. if the main job finishes very quickly after
      generating traces or if the sidecar was sleeping). Replace this with a check that gracefully
      cats the files before exiting if the signal is present.
    2. Fix Critical Bug (Premature tail Kill): Once tail starts, if the signal arrives,
      the script kills the tail process instantly (kill $TAIL_PID). This drops any un-flushed
      output. Add sleep 2; before kill $TAIL_PID 2>/dev/null; to let tail finish printing existing logs.
    3. Suggestion (Code Formatting): Rather than writing the entire script as a single-line
      string with semi-colons, consider using a YAML multiline block (|) for readability. This m
      akes the bash logic much easier to review and maintain.

Recommended Shell Script Structure

Here is the logic implementing the two fixes, formulated as a YAML multiline string for readability (recommended):

                args:
                - /bin/sh
                - -c
                - |
                  while [ ! -d /tmp/debugging ] && [ ! -f /shared-volume/stacktrace_signal ];
 do sleep 5; done
                  while ! ls /tmp/debugging/* >/dev/null 2>&1 && [ ! -f /shared-volume/stackt
race_signal ]; do sleep 5; done

                  if ls /tmp/debugging/* >/dev/null 2>&1; then
                    if [ -f /shared-volume/stacktrace_signal ]; then
                      cat /tmp/debugging/*
                    else
                      tail -n+1 -f /tmp/debugging/* & TAIL_PID=$!
                      while [ ! -f /shared-volume/stacktrace_signal ]; do sleep 1; done
                      sleep 2
                      kill $TAIL_PID 2>/dev/null
                    fi
                  fi
                  exit 0

(Note: As python strings do not treat $ specially, you can use $! directly in your Python code string, avoiding any curly braces escape issues)

  • Overall Impression: This PR effectively fixes a severe hanging issue in the previous s$
    decar script by correctly switching from pidof to $! and making directory checks robust.
    However, it introduces a regression where stacktraces can be dropped entirely if the main con
    tainer exits too fast. Fixing the order of operations as recommended will ensure no stacktrac
    es are dropped.

@kryvokhyzha kryvokhyzha requested a review from jamOne- March 19, 2026 10:44
Copy link
Copy Markdown
Collaborator

@jamOne- jamOne- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Report

  • Verdict: Approve
  • Action Items: two suggestions that I removed
  • Overall Impression: This is an excellent, highly robust PR. It effectively eliminates the racy, brittle behavior of the sidecar script. By switching to simple file-polling and direct pid management ($!), it fixes the pidof race conditions, sidesteps the broken $$ expansion in busybox:1.28 subshells, and prevents infinite hangs caused by syntax errors with the previous [ ! -e /tmp/debugging/* ] busybox evaluation. The Bash script is logically sound, completely POSIX compliant for ash, and the YAML block scalar is properly formatted. Great work!

@jamOne- jamOne- added this pull request to the merge queue Mar 19, 2026
Merged via the queue into AI-Hypercomputer:main with commit 4a9d683 Mar 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants