Skip to content

Fix SIGILL crashes on GitHub runners via CPU-aware build cache keys#1278

Merged
sbryngelson merged 10 commits intoMFlowCode:masterfrom
sbryngelson:fix/disable-march-native-github-runners
Feb 28, 2026
Merged

Fix SIGILL crashes on GitHub runners via CPU-aware build cache keys#1278
sbryngelson merged 10 commits intoMFlowCode:masterfrom
sbryngelson:fix/disable-march-native-github-runners

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Feb 27, 2026

Summary

  • Add CPU model + compiler version hash to GitHub Actions build cache keys
  • Prevents cross-CPU cache collisions that cause SIGILL crashes when -march=native binaries are restored on runners with different instruction sets
  • Fix --test-all being silently ignored in the Build step, causing post_process (+ hdf5/silo/fftw) to rebuild during the Test step
  • Set FC=ifx and CC=icx explicitly for Intel CI jobs

Fixes #1277

Details

CPU-aware cache keys

GitHub-hosted runners have heterogeneous CPU architectures (different Intel/AMD generations). MFC builds with -march=native, so a binary cached on a runner with AVX-512 will crash with SIGILL when restored on a runner without it.

The fix hashes uname -m, CPU model name, and Fortran/C compiler versions into the cache key. Different hardware or compiler → different key → no cache reuse.

Fix --test-all in Build step

When matrix.precision is empty, --${{ matrix.precision }} expands to bare -- (the POSIX end-of-options marker), causing --test-all after it to be treated as a positional argument. This meant post_process was never built in the dry-run Build step, so the Test step had to build it from scratch — including fftw, hdf5, and silo dependencies.

Fixed by moving precision to an env var that only emits --single when non-empty.

Intel compiler env vars

The Intel setvars.sh sets PATH/LD_LIBRARY_PATH but never exports FC or CC. Without these, the sys-info cache key step was hashing gfortran/gcc versions instead of ifx/icx, and MFC's build relied on CMake finding compilers via PATH alone.

Test plan

  • All GitHub-hosted CI jobs pass (especially ubuntu, mpi, no-debug, true Chemistry tests)
  • ubuntu, no-mpi, single, no-debug, false passes (was broken by previous approach)
  • macOS jobs still work (uses shasum fallback + sysctl for CPU info)
  • post_process built in Build step (not Test step) when --test-all is active

sbryngelson and others added 2 commits February 27, 2026 18:18
GitHub-hosted runners have heterogeneous CPU architectures. When a build
cached on a runner with AVX-512 is restored on a runner without it, the
binary crashes with SIGILL (illegal instruction). This affected all 9
Chemistry tests on ubuntu-mpi-no-debug.

Add MFC_NATIVE_CPU CMake option (default ON) gated by MFC_NO_MARCH_NATIVE
env var. Set the env var in GitHub-hosted CI jobs only — self-hosted HPC
runners (Phoenix, Frontier) continue using -march=native.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GitHub-hosted runners have heterogeneous CPU architectures. When a build
cached on a runner with AVX-512 is restored on a runner without it, the
binary crashes with SIGILL (illegal instruction). This affected Chemistry
tests on ubuntu-mpi-no-debug.

Include a hash of CPU model and compiler versions in the GitHub Actions
build cache key so that builds compiled with -march=native are only
restored on runners with matching hardware and toolchain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson changed the title Disable -march=native on GitHub-hosted runners Fix SIGILL crashes on GitHub runners via CPU-aware build cache keys Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from codecov bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
The cache key hash must reflect the actual compiler (e.g. ifx for Intel
jobs, gfortran-15 for macOS). Moving the sys-info step after all setup
steps ensures $FC and $CC are set correctly before hashing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
sbryngelson and others added 2 commits February 28, 2026 00:09
…precision

When matrix.precision is empty, --${{ matrix.precision }} expands to bare --,
which is the POSIX end-of-options marker. This causes --test-all after it to
be treated as a positional argument, so post_process is never built in the
Build (dry-run) step. It then gets built in the Test step instead, including
its fftw/hdf5/silo dependencies.

Move precision to an env var that only emits --single when non-empty.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ler versions

The Intel setvars.sh sets PATH/LD_LIBRARY_PATH but never exports FC or CC,
so the sys-info cache key step was falling back to gfortran/gcc versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
…the C compiler

The Intel CI only installs the Fortran compiler (ifx), not the C/C++ compiler
(icx). Setting CC=icx causes CMake to fail with "Could not find compiler".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
Setting FC=ifx in GITHUB_ENV caused CMake's lapack Fortran/C interop check
to fail (ifx + gcc linking). The Intel setvars.sh already puts ifx on PATH,
so CMake finds it without FC being set. Instead, detect ifx on PATH in the
sys-info step to include its version in the cache hash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@codecov
Copy link

codecov bot commented Feb 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.04%. Comparing base (35b2134) to head (10a10e9).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1278   +/-   ##
=======================================
  Coverage   44.04%   44.04%           
=======================================
  Files          70       70           
  Lines       20499    20499           
  Branches     1993     1993           
=======================================
  Hits         9028     9028           
  Misses      10330    10330           
  Partials     1141     1141           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 28, 2026
@sbryngelson sbryngelson marked this pull request as ready for review February 28, 2026 16:49
Copilot AI review requested due to automatic review settings February 28, 2026 16:49
@sbryngelson sbryngelson merged commit ab5082e into MFlowCode:master Feb 28, 2026
29 of 34 checks passed
@github-actions
Copy link

Claude Code Review

Head SHA: 10a10e9
Files changed: 2 — .github/workflows/coverage.yml, .github/workflows/test.yml

Summary

  • Adds CPU model + compiler version hash to build cache keys to prevent SIGILL crashes from cross-CPU binary reuse with -march=native
  • Fixes the --test-all flag being silently swallowed when matrix.precision is empty (bare -- POSIX end-of-options issue)
  • Moves cache restore step after system info collection, which is the correct ordering
  • Coverage workflow gets the same CPU-aware cache key treatment

Findings

1. PR description mentions Intel FC/CC export — not visible in diff (.github/workflows/test.yml)
The PR summary states "Set FC=ifx and CC=icx explicitly for Intel CI jobs" but no explicit FC=ifx/CC=icx assignment is present in the diff. The sys-info step handles Fortran detection via command -v ifx, but C compiler detection still falls back to ${CC:-gcc}. If this was intentional (relying on command -v instead), the PR description should be updated. If not, the CC detection for Intel jobs may hash gcc instead of icx.

2. Inconsistency between coverage.yml and test.yml Intel detection (.github/workflows/coverage.yml, line ~41)
test.yml uses if command -v ifx &>/dev/null; then ifx --version ... for smarter Intel detection. coverage.yml uses the simpler ${FC:-gfortran} without this guard. If coverage ever adds an Intel matrix entry, it would hash gfortran's version instead of ifx's. Low risk now, but worth unifying.

3. /tmp/sys-hash instead of $RUNNER_TEMP (both files, sys-info step)
/tmp/sys-hash is functional — GitHub Actions runs one job per runner, so no collision risk. Using $RUNNER_TEMP/sys-hash would be more idiomatic and avoids any hypothetical issue on self-hosted runners that may share /tmp.

4. ARM Linux: /proc/cpuinfo | grep 'model name' may return empty (both files)
On Linux ARM runners, the CPU info field is not model name. The fallback chain handles macOS ARM via sysctl but Linux ARM would produce an empty string, leaving uname -m (returning aarch64) as the only differentiator. For GitHub-hosted ARM runners this is likely fine (uniform hardware), but worth noting if self-hosted ARM runners with different microarchitectures are ever added.

No blocking issues

The core logic is correct — the PRECISION env-var fix and CPU-aware cache key approach are sound. Findings 3 and 4 are low-risk. Finding 1 should be clarified.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 28fc258 and 10a10e9.

📒 Files selected for processing (2)
  • .github/workflows/coverage.yml
  • .github/workflows/test.yml

📝 Walkthrough

Walkthrough

The pull request modifies two GitHub Actions workflows to make build cache keys system-dependent. Both coverage.yml and test.yml add a new "system-info" step that computes a hash based on system architecture, CPU model, and compiler versions. This hash is prepended to existing cache keys. Additionally, test.yml updates the Build step to pass the precision value as an environment variable instead of inline expansion, ensuring consistent usage across the build and test execution blocks.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intermittent SIGILL crashes on GitHub-hosted runners (issue #1277) caused by -march=native binaries being cached on one CPU architecture and restored on another. It also fixes a related build-step bug where --test-all was silently dropped.

Changes:

  • Add a CPU + compiler version hash (sys-hash) to GitHub Actions build cache keys, preventing cross-CPU cache collisions
  • Fix --test-all being swallowed by moving matrix.precision from an inline CLI expansion (--${{ matrix.precision }}) to an env var ($PRECISION), preventing a bare -- (POSIX end-of-options) when precision is empty
  • Apply the same CPU-aware cache key to coverage.yml

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/test.yml Adds Get system info for cache key step (CPU model + Fortran/C compiler version hash); updates cache key to include sys-hash; moves precision flag to $PRECISION env var in Build step
.github/workflows/coverage.yml Adds the same Get system info for cache key step and incorporates sys-hash into the coverage cache key

Two issues were found:

  1. test.yml line 141 — CC still hashes gcc on Intel builds: The PR description states "FC=ifx and CC=icx explicitly for Intel CI jobs," but the Intel setup step never writes CC=icx to $GITHUB_ENV (since setvars.sh doesn't set CC). The ifx detection on line 140 works correctly via PATH, but the C compiler hash always resolves to gcc. This is partially mitigated by matrix.intel already being part of the cache key, but it's an inaccuracy and a discrepancy with the stated fix.

  2. coverage.yml lines 38–47 — Get system info runs before Setup Ubuntu: The compiler hash is computed before apt install gcc g++ runs. If the installed version differs from the pre-installed default, the hash won't reflect the actual build compiler. In test.yml this step is correctly placed after all OS setup steps.

uname -m
cat /proc/cpuinfo 2>/dev/null | grep 'model name' | head -1 || sysctl -n machdep.cpu.brand_string 2>/dev/null || true
if command -v ifx &>/dev/null; then ifx --version 2>/dev/null | head -1; else ${FC:-gfortran} --version 2>/dev/null | head -1 || true; fi
${CC:-gcc} --version 2>/dev/null | head -1 || true
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states "Set FC=ifx and CC=icx explicitly for Intel CI jobs", but the code does not explicitly write FC=ifx or CC=icx to $GITHUB_ENV in the Intel setup step. The ifx Fortran compiler is detected correctly on line 140 via command -v ifx (since setvars.sh updates PATH which is exported via printenv >> $GITHUB_ENV). However, CC is not set by setvars.sh, so line 141 always hashes gcc's version rather than icx's on Intel runs. While this is mitigated by ${{ matrix.intel }} already being part of the cache key, the hash does not accurately reflect the Intel C compiler version. Adding echo "CC=icx" >> $GITHUB_ENV in the Intel setup step (alongside the printenv call) would fix this inconsistency.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +47
- name: Get system info for cache key
id: sys-info
run: |
{
uname -m
cat /proc/cpuinfo 2>/dev/null | grep 'model name' | head -1 || sysctl -n machdep.cpu.brand_string 2>/dev/null || true
${FC:-gfortran} --version 2>/dev/null | head -1 || true
${CC:-gcc} --version 2>/dev/null | head -1 || true
} | (sha256sum 2>/dev/null || shasum -a 256) | cut -c1-16 > /tmp/sys-hash
echo "sys-hash=$(cat /tmp/sys-hash)" >> "$GITHUB_OUTPUT"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Get system info for cache key" step runs before "Setup Ubuntu" (which installs gcc, g++, etc.). If apt install gcc installs or upgrades to a different compiler version than what is pre-installed on ubuntu-latest, the hash will reflect the pre-installation version rather than the version used for the build. For correctness, this step should be placed after "Setup Ubuntu", consistent with the ordering in test.yml.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Chemistry tests crash with SIGILL on GitHub-hosted runners due to -march=native + build cache

2 participants