-
Notifications
You must be signed in to change notification settings - Fork 132
Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience #1295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9163a11
ffe80ec
cfbc023
5742030
7edb7c3
773f5ad
1311cbe
644c9e4
a02f4b2
ba91673
438627e
3e773ff
fae2e6a
3224931
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #!/bin/bash | ||
| # Run monitor_slurm_job.sh and recover if the monitor is killed (e.g. SIGKILL | ||
| # from the runner OS) before the SLURM job completes. When the monitor exits | ||
| # non-zero, sacct is used to verify the job's actual final state; if the SLURM | ||
| # job succeeded we exit 0 so the CI step is not falsely marked as failed. | ||
| # | ||
| # Usage: run_monitored_slurm_job.sh <job_id> <output_file> | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| if [ $# -ne 2 ]; then | ||
| echo "Usage: $0 <job_id> <output_file>" | ||
| exit 1 | ||
| fi | ||
|
|
||
| job_id="$1" | ||
| output_file="$2" | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| monitor_exit=0 | ||
| bash "$SCRIPT_DIR/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$? | ||
|
|
||
| if [ "$monitor_exit" -ne 0 ]; then | ||
| echo "Monitor exited with code $monitor_exit; re-checking SLURM job $job_id final state..." | ||
| # Give the SLURM epilog time to finalize if the job just finished | ||
| sleep 30 | ||
| final_state=$(sacct -j "$job_id" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1 | tr -d ' ' || echo "UNKNOWN") | ||
| final_exit=$(sacct -j "$job_id" -X --format=ExitCode --noheader --parsable2 2>/dev/null | head -n1 | tr -d ' ' || echo "") | ||
| echo "Final SLURM state=$final_state exit=$final_exit" | ||
| if [ "$final_state" = "COMPLETED" ] && [ "$final_exit" = "0:0" ]; then | ||
| echo "SLURM job $job_id completed successfully despite monitor failure — continuing." | ||
| else | ||
| echo "ERROR: SLURM job $job_id did not complete successfully (state=$final_state exit=$final_exit)" | ||
| exit 1 | ||
| fi | ||
sbryngelson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| fi | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -88,7 +88,7 @@ jobs: | |
| runs-on: | ||
| group: ${{ matrix.group }} | ||
| labels: ${{ matrix.labels }} | ||
| timeout-minutes: 480 | ||
| timeout-minutes: 240 | ||
sbryngelson marked this conversation as resolved.
Show resolved
Hide resolved
sbryngelson marked this conversation as resolved.
Show resolved
Hide resolved
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Timeout value contradicts PR objectives. The PR description states "Restores , |
||
| steps: | ||
| - name: Clone - PR | ||
| uses: actions/checkout@v4 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 227
🏁 Script executed:
cat -n .github/scripts/monitor_slurm_job.sh | head -30Repository: MFlowCode/MFC
Length of output: 1111
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 1821
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 2652
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 3546
🏁 Script executed:
cat -n .github/scripts/monitor_slurm_job.sh | tail -40Repository: MFlowCode/MFC
Length of output: 1502
This wrapper's recovery mechanism is defeated by the child monitor's auto-cancel.
monitor_slurm_job.shsetstrap cleanup EXITat line 19, which callsscancelif the monitor exits abnormally (line 16). When the monitor is killed before completion (e.g., SIGKILL) or exits at line 98 (terminal state reached without output file), this cleanup trap cancels the SLURM job. The wrapper's recovery logic atrun_monitored_slurm_job.sh:24-36then attempts to query the final job state viasacct, but the job has already been canceled by the child, making the recovery check unreliable.The defensive error handling in
get_job_state()with|| truemeans transient squeue/sacct failures won't trigger cancellation, but the real problem is the auto-cancel policy itself undermines the wrapper's fallback verification.Recommended fix: Introduce a
MONITOR_CANCEL_ON_ERRORenvironment variable that the wrapper can set to disable auto-cancel in the child, allowing the parent to decide job fate based on final SLURM state.