Skip to content

Consolidate CI infrastructure and add NFS-resilient build cache#1285

Merged
sbryngelson merged 10 commits intoMFlowCode:masterfrom
sbryngelson:pause-coverage
Mar 3, 2026
Merged

Consolidate CI infrastructure and add NFS-resilient build cache#1285
sbryngelson merged 10 commits intoMFlowCode:masterfrom
sbryngelson:pause-coverage

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 2, 2026

Summary

  • Add a new case-optimization CI job that builds and runs all 5 benchmark cases with --case-optimization on Phoenix (acc/omp), Frontier (acc/omp), and Frontier AMD (omp), validating output contains no NaN/Inf
  • Add check_case_optimization_output.py validator and --steps CLI override to benchmark cases
  • Replace 4 duplicated frontier_amd/ scripts with symlinks to frontier/ (cluster auto-detected from directory name via BASH_SOURCE)
  • Extract 3 shared helpers into .github/scripts/: gpu-opts.sh, detect-gpus.sh, retry-build.sh
  • Refactor 6 CI scripts to source the helpers, removing duplicated GPU opts blocks, GPU detection, and retry loops
  • Add NFS-resilient build cache for Phoenix self-hosted runners: pre-flight health check detects stale NFS handles before builds start, and retry-build.sh escalates to mv-based cache nuke when rm -rf fails during retry cleanup

Fixes #1275

Test plan

  • ./mfc.sh format passes
  • ./mfc.sh precheck passes (all 5 lint gates)
  • Symlinks verified: dirname resolves to frontier_amd/ preserving cluster detection
  • Helpers spot-checked: job_device=gpu job_interface=acc source gpu-opts.sh--gpu acc
  • CI jobs (test, bench, case-optimization) pass on Phoenix, Frontier, Frontier AMD
  • Manual test on Phoenix: corrupt cache dir, verify health check detects and nuke recovers

🤖 Generated with Claude Code

@sbryngelson sbryngelson changed the title Consolidate CI dispatch: symlink frontier_amd, extract shared helpers Add case-optimization CI tests and consolidate CI dispatch infrastructure Mar 2, 2026
@sbryngelson sbryngelson marked this pull request as ready for review March 2, 2026 03:53
Copilot AI review requested due to automatic review settings March 2, 2026 03:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a dedicated case-optimization correctness CI job and refactors HPC CI scripts by extracting shared GPU detection/build-retry helpers, while extending benchmark cases with a --steps override and adding NaN/Inf validation.

Changes:

  • Add a new case-optimization job to the test workflow plus scripts to prebuild, run, and validate case-optimized benchmark runs.
  • Consolidate duplicated CI shell logic (GPU opts, GPU detection, build retries) into shared .github/scripts/* helpers; replace Frontier AMD duplicates with symlinks.
  • Extend the packer/test tooling to detect both NaN and Inf; add --steps to benchmark case scripts.

Reviewed changes

Copilot reviewed 26 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
toolchain/mfc/test/test.py Switch test validation from NaN-only to NaN/Inf detection.
toolchain/mfc/packer/pack.py Rename has_NaNs() to has_bad_values() and include Inf detection.
benchmarks/5eq_rk3_weno3_hllc/case.py Add --steps override; change parallel_io setting.
benchmarks/viscous_weno5_sgb_acoustic/case.py Add --steps override; change parallel_io setting.
benchmarks/hypo_hll/case.py Add --steps override for timestep control.
benchmarks/ibm/case.py Add --steps override; change parallel_io setting.
benchmarks/igr/case.py Add --steps override; change parallel_io setting.
.github/workflows/test.yml Use centralized test retry wrapper; add case-optimization CI job.
.github/workflows/phoenix/test.sh Refactor to use shared GPU opts, GPU detection, and retry-build helper.
.github/workflows/phoenix/bench.sh Refactor to use shared bench preamble + retry-build helper.
.github/workflows/frontier/test.sh Refactor to shared GPU detection and GPU opts helper.
.github/workflows/frontier/build.sh Refactor to shared GPU opts and retry-build helper.
.github/workflows/frontier/bench.sh Refactor to use shared bench preamble.
.github/workflows/frontier_amd/test.sh Replace with symlink to ../frontier/test.sh.
.github/workflows/frontier_amd/submit.sh Replace with symlink to ../frontier/submit.sh.
.github/workflows/frontier_amd/build.sh Replace with symlink to ../frontier/build.sh.
.github/workflows/frontier_amd/bench.sh Replace with symlink to ../frontier/bench.sh.
.github/scripts/run_case_optimization.sh New: runs 5 benchmark cases with --case-optimization and validates output.
.github/scripts/check_case_optimization_output.py New: validates D/*.dat contain no NaN/Inf via packer.
.github/scripts/run-tests-with-retry.sh New: centralizes “retry up to 5 sporadic failures” logic for test workflow.
.github/scripts/retry-build.sh New: shared 3-attempt build retry helper with optional cleanup/validation hooks.
.github/scripts/prebuild-case-optimization.sh New: prebuild benchmark cases with --case-optimization on login node.
.github/scripts/gpu-opts.sh New: shared translation from job_device/job_interface → `--gpu {acc
.github/scripts/detect-gpus.sh New: shared NVIDIA/AMD GPU detection setting ngpus and gpu_ids.
.github/scripts/bench-preamble.sh New: shared benchmark script preamble setting ranks/build/device opts.
.github/file-filter.yml Ensure .github/scripts/** changes trigger CI file-change detection.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (1)
.github/scripts/run_case_optimization.sh (1)

23-29: Use a single source of truth for the case list across prebuild/run scripts.

The hardcoded list here can drift from .github/scripts/prebuild-case-optimization.sh discovery behavior. Centralizing this list avoids mismatched “built vs validated” coverage.


ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee892c and 83454c9.

📒 Files selected for processing (30)
  • .github/file-filter.yml
  • .github/scripts/bench-preamble.sh
  • .github/scripts/check_case_optimization_output.py
  • .github/scripts/detect-gpus.sh
  • .github/scripts/gpu-opts.sh
  • .github/scripts/prebuild-case-optimization.sh
  • .github/scripts/retry-build.sh
  • .github/scripts/run-tests-with-retry.sh
  • .github/scripts/run_case_optimization.sh
  • .github/workflows/frontier/bench.sh
  • .github/workflows/frontier/build.sh
  • .github/workflows/frontier/test.sh
  • .github/workflows/frontier_amd/bench.sh
  • .github/workflows/frontier_amd/bench.sh
  • .github/workflows/frontier_amd/build.sh
  • .github/workflows/frontier_amd/build.sh
  • .github/workflows/frontier_amd/submit.sh
  • .github/workflows/frontier_amd/submit.sh
  • .github/workflows/frontier_amd/test.sh
  • .github/workflows/frontier_amd/test.sh
  • .github/workflows/phoenix/bench.sh
  • .github/workflows/phoenix/test.sh
  • .github/workflows/test.yml
  • benchmarks/5eq_rk3_weno3_hllc/case.py
  • benchmarks/hypo_hll/case.py
  • benchmarks/ibm/case.py
  • benchmarks/igr/case.py
  • benchmarks/viscous_weno5_sgb_acoustic/case.py
  • toolchain/mfc/packer/pack.py
  • toolchain/mfc/test/test.py

sbryngelson and others added 5 commits March 2, 2026 10:44
…MFlowCode#1281)

Replace 4 duplicated frontier_amd/ scripts with symlinks to frontier/
(cluster auto-detected from directory name via BASH_SOURCE).

Extract 3 shared helpers into .github/scripts/:
- gpu-opts.sh: sets $gpu_opts from $job_device/$job_interface
- detect-gpus.sh: vendor-agnostic GPU detection (NVIDIA + AMD)
- retry-build.sh: retry_build() with configurable cleanup

Refactor 6 CI scripts to source the helpers, removing duplicated
GPU opts blocks, GPU detection, and retry loops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new 'case-optimization' job to test.yml that builds and runs all
benchmark cases with --case-optimization on Phoenix (acc/omp), Frontier
(acc/omp), and Frontier AMD (omp). Each case runs a small grid (1 GBPP)
for 10 timesteps and validates output contains no NaN/Inf values.

- Add check_case_optimization_output.py validator script
- Add --steps CLI override to all 5 benchmark case files
- Update file-filter.yml to trigger CI on .github/scripts/ changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RETRY_VALIDATE_CMD hook to retry-build.sh for post-build validation
- Replace 37-line inline retry loop in phoenix/test.sh with retry_build()
- Derive module flag from cluster name in prebuild-case-optimization.sh,
  removing redundant flag field from case-optimization matrix
- Extract GitHub job test retry logic to run-tests-with-retry.sh
- Extract shared bench GPU/device preamble to bench-preamble.sh
- Standardize source order: detect-gpus.sh before gpu-opts.sh

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enchmarks

- Set parallel_io to F in all benchmark cases so simulation writes D/*.dat
  text files readable by the packer (parallel_io=T writes binary to
  restart_data/ instead, which neither the packer nor the validation
  script could read)
- Rewrite check_case_optimization_output.py to use pack.compile() +
  has_bad_values() instead of reimplementing the same parsing logic
- Rename Pack.has_NaNs() to has_bad_values(), adding math.isinf() check
- Call validation via build/venv/bin/python3 for toolchain dependencies

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, hidden env dependency

- run-tests-with-retry.sh: extract --test-all from "$@" instead of relying
  on $TEST_ALL env var for retry path
- check_case_optimization_output.py: restore argument validation, add
  per-file NaN/Inf diagnostic reporting
- run_case_optimization.sh: check venv exists before loop, fix misleading
  error message, normalize exit code
- pack.py: fix typo in Pack class comment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing cleanup

When NFS stale file handles occur on Phoenix, cached files become both
unreadable and undeletable, causing all retry attempts to fail identically.

Layer 1: Pre-flight health check in setup-build-cache.sh probes the cache
(ls, stat, touch/rm) and nukes immediately if stale, before the build starts.

Layer 2: Resilient cleanup in retry-build.sh escalates to cache nuke (mv-based
rename) when rm -rf fails during retry, so the next attempt gets a fresh cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson changed the title Add case-optimization CI tests and consolidate CI dispatch infrastructure Consolidate CI infrastructure and add NFS-resilient build cache Mar 3, 2026
sbryngelson and others added 4 commits March 2, 2026 22:49
…enix path

Frontier runners failed because the cache root /storage/coda1/... is
Phoenix-specific. Select cache root via case statement on cluster name:
Phoenix -> /storage/coda1/..., Frontier -> /lustre/orion/....

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Disable build retries (max_attempts 1) across all CI jobs so failures
surface immediately. Test --max-attempts remains at 3 for sporadic
test failures.

Case-optimized pre-builds reduced to -j 2: Phoenix login nodes have a
4GB per-user cgroup limit (confirmed via dmesg: CONSTRAINT_MEMCG).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phoenix login nodes have a 4GB per-user cgroup memory limit that OOM-kills
case-optimized GPU builds (confirmed via dmesg: CONSTRAINT_MEMCG). Route
the pre-build through submit.sh on Phoenix so it runs on a compute node
with full memory. Frontier continues to pre-build on the login node.

Reverts retry/parallelism changes (max_attempts back to 3, -j back to 8)
since the root cause was the cgroup, not parallelism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from Copilot AI Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from qodo-code-review bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from coderabbitai bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from codecov bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Mar 3, 2026
@sbryngelson sbryngelson merged commit ce98373 into MFlowCode:master Mar 3, 2026
42 of 55 checks passed
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.95%. Comparing base (7c806be) to head (f6918fa).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1285   +/-   ##
=======================================
  Coverage   44.95%   44.95%           
=======================================
  Files          70       70           
  Lines       20503    20503           
  Branches     1946     1946           
=======================================
  Hits         9217     9217           
  Misses      10164    10164           
  Partials     1122     1122           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Separate case-optimization correctness testing from performance benchmarks

2 participants