Coverage-based test selection (done right) — shadow mode by sbryngelson · Pull Request #1461 · MFlowCode/MFC

sbryngelson · 2026-05-30T03:36:46Z

What

Re-introduces execution-coverage-based test selection (./mfc.sh test --only-changes), replacing the gcov coverage-cache removed in #1460. It ships in shadow mode (PR jobs print what they would select but still run the full suite); enforcement is a separate later change, gated on shadow evidence.

Design spec: docs/superpowers/specs/2026-05-29-coverage-test-selection-design.md. Plan: docs/superpowers/plans/2026-05-29-coverage-test-selection.md.

Why this one is sound where the old one wasn't

The old cache rotted invisibly (UUID-keyed, never auto-refreshed, silent fallback). The post-mortem established that only execution coverage is sound — static labels miss transitive side effects. This rebuild fixes the root causes:

Keyed by param_hash (SHA-256 of the test's full resolved params), not CRC32(trace) — stable to cosmetic edits, invalidates on real behavior changes, decoupled from the golden-file UUID.
Single authoritative committed map (tests/coverage_map.json.gz), refreshed by a bot on master — freshness is a git-visible fact, not an evictable cache.
Loud anti-rot: a scheduled coverage-health workflow fails red if the map is stale/under-covers (kept off the PR path so it can't wedge PRs).
Two invariants enforce everything: (1) the selector may only over-include, never under-include; (2) staleness is always loud.

The conservative ladder (soundness)

A test runs if any rung holds: changed-file detection failed → all · macro/codegen/build/src/**/include/ input → all · changed .f90/.f → all · changed .fpp covered by zero tests → all · test param_hash absent → run it · coverage overlap → run it · else skip. Every uncertainty resolves to run.

Validated end-to-end against a real gcov map (574 tests)

change	selected
bubbles feature	101/610
riemann (core)	589/610
macro / include	610/610 (run all)
docs only	0/610

Known gcov caveat (handled)

gcov rolls #:include'd .fpp into the parent compilation unit, so include files aren't reliably attributed. Closed by rule: any src/**/include/ change forces a full run, so the attribution gap can never cause under-inclusion. (Map built on macOS gfortran 15.2; CI's Linux build may attribute differently — the map rebuilds there, and the include-rule keeps it sound regardless.)

Test plan

33 unit tests for the selection logic (param_hash, the ladder, changed-file parsing, health)
Full toolchain suite green
Real --gcov build + --build-coverage-map over all tests (collector validated at runtime)
End-to-end selection verified against the committed map
Shadow-mode CI evidence that selection never under-selects (this PR's purpose)
Re-verify gcov attribution on Linux gfortran
Follow-up PR: self-hosted job wiring + --select-enforce

…run (soundness)

…paths-filter)

…ibute includes)

github-actions · 2026-05-30T03:45:33Z

Claude Code Review

Head SHA: 6d068b0

Files changed:

13
.github/scripts/check_coverage_map_health.py
.github/workflows/common/coverage-refresh.sh
.github/workflows/coverage-health.yml
.github/workflows/coverage-refresh.yml
.github/workflows/test.yml
toolchain/mfc/cli/commands.py
toolchain/mfc/test/case.py
toolchain/mfc/test/coverage.py
toolchain/mfc/test/coverage_build.py
toolchain/mfc/test/test.py

Findings:

select_tests: empty changed_files set silently skips all tests — future soundness risk

toolchain/mfc/test/coverage.py lines 104–106 (select_tests, rung-7 branch):

changed_fpp = {f for f in changed_files if f.endswith(".fpp")}
if not changed_fpp:
    return [], list(cases), "rung7: no Fortran source changed"

When changed_files is an empty set() (not None), changed_fpp is also empty and all cases land in skipped with to_run=[]. In shadow mode this is safe — the full suite runs regardless. However, changed_files=set() can arise from the CI plumbing today: get_changed_files returns set() when explicit="" (an empty string), which is the value produced when CHANGED_FILES is unset or the paths-filter step returns no matches and --changed-files "" is passed.

The spec's first invariant is "every uncertainty resolves to run". An empty explicit list (--changed-files "") is currently treated as certain knowledge that nothing Fortran-relevant changed (rung 7, skip all) rather than as unknown (rung 1, run all). This distinction is benign in shadow mode. But when --select-enforce is activated in the follow-up PR, a misconfigured paths-filter, a silent step failure, or an environment where CHANGED_FILES is unset would cause the full test suite to be skipped rather than running conservatively.

The minimal fix is to treat explicit="" the same as explicit=None in get_changed_files (i.e. return None → rung 1) rather than returning set(). The two inputs are semantically different — "no flag provided" vs "flag provided with empty value" — and only the latter risks under-selection when enforce mode is active.

sbryngelson · 2026-05-30T03:49:51Z

Good catch — fixed in f7a692f. An empty/whitespace --changed-files now falls through to git detection (→ None → rung 1, run all) instead of returning an empty set (which read as rung 7, skip all). You're right that it's benign in shadow but a silent full-skip under --select-enforce; treating it as uncertainty→run matches the soundness invariant. Added a regression test.

…sts via rung 5 (soundness)

Fix 1: in build_coverage_map Phase 2, a test with non-empty failures produced only partial .gcda files (crash mid-pipeline). Previously those tests were still added to test_results, and their truncated coverage was cached. A later .fpp change that ran only in the missing stage would be silently skipped. Fix: when failures is non-empty, record in all_failures and continue without adding to test_results — absent entries are conservatively included by select_tests (rung 5). Fix 2: in _parse_gcov_json_output, a mid-stream json.JSONDecodeError returned the partial result set, which is untrustworthy (the truncated JSON stream may be missing coverage for .fpp files that were not yet serialised). Fix: return None on that error path so the caller omits the test from the map entirely. Fix 6 (comment): correct the post_process comment (~line 347) — the regular suite runs post_process only under --test-all (which CI sets), not never.

…ching Fix 3: ALWAYS_RUN_ALL_EXACT + prefixes enumerated only a handful of toolchain files, missing case.py, build.py, common.py, state.py, sched.py, etc. Any toolchain/mfc/*.py change (except cases.py) affects every test's generation or execution, so under-enumeration was unsound. Replace with a catch-all: any(f.startswith('toolchain/mfc/') and f.endswith('.py') and f != CASES_PY). Drop the now-redundant individual file entries and toolchain/mfc/params/ and toolchain/mfc/run/ prefixes (all subsumed). Keep CMakeLists.txt, toolchain/cmake/, toolchain/bootstrap/, and src include rules. Fix 4: rung 3 matched only .f90 and .f, missing .F90, .F95, .F03, .F08, .FOR and all other uppercase/mixed variants. Changed files ending in those extensions under src/ would fall through to per-test selection against a coverage map that only tracks .fpp, causing silent under-inclusion. Fix: case-insensitive match against the full tuple (.f90, .f, .f95, .f03, .f08, .for). Tests: add three new unit tests covering the above fixes.

… caveat Add 'permissions: contents: write' at the workflow top level so the coverage-refresh job is authorized to commit and push the updated coverage_map.json.gz back to master. Without this, the GITHUB_TOKEN has only read permissions in newer default permission settings. Also add a comment on the git push step noting that branch protection may still reject the default GITHUB_TOKEN and that a PAT or GitHub App with bypass-branch-protection permission may be needed.

…changed files)

Copilot

Pull request overview

Re-introduces execution-coverage-based test selection for ./mfc.sh test --only-changes in shadow mode (prints what would run on PRs while still executing the full suite), along with tooling to build/refresh the committed coverage map and workflows to keep it healthy.

Changes:

Adds coverage-map-driven selection logic (param_hash-keyed), integrates it into the test command, and wires PR CI to run the selector in shadow mode.
Adds a gcov-based coverage map builder and scheduled workflows to refresh the committed map and fail loudly if it becomes stale/under-covering.
Adds unit tests covering the selection ladder, map I/O, and changed-file detection behavior.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
toolchain/mfc/test/test.py	Integrates `--only-changes` selection (shadow/enforce) and adds `--build-coverage-map` entrypoint.
toolchain/mfc/test/test_coverage_unit.py	New unit tests for coverage selection, hashing, map I/O, and changed-file detection.
toolchain/mfc/test/coverage.py	Implements param-hash keying, selection ladder, changed-file detection, summary formatting, and map health checks.
toolchain/mfc/test/coverage_build.py	Adds gcov-based builder to generate `tests/coverage_map.json.gz` from per-test execution coverage.
toolchain/mfc/test/case.py	Adds `coverage_key()` to cases/builders to support map lookups by `param_hash`.
toolchain/mfc/cli/commands.py	Adds CLI flags: `--only-changes`, `--select-enforce`, `--changed-files`, `--changes-branch`, `--build-coverage-map`.
.github/workflows/test.yml	Enables shadow-mode selector on PR CI and passes paths-filter output as `--changed-files`.
.github/workflows/coverage-refresh.yml	Scheduled / on-push workflow to rebuild and commit refreshed coverage map.
.github/workflows/coverage-health.yml	Scheduled workflow to fail if the committed map is stale or under-covers current tests.
.github/workflows/common/test.sh	Enables shadow-mode selector in the common SLURM test wrapper on PRs.
.github/workflows/common/coverage-refresh.sh	SLURM-side script to build with gcov and run the map builder.
.github/scripts/check_coverage_map_health.py	Health-check script used by the coverage-health workflow.

+ALWAYS_RUN_ALL_EXACT = frozenset(
+    [
+        "CMakeLists.txt",
+    ]
+)
+ALWAYS_RUN_ALL_PREFIXES = (
+    "src/common/include/",  # GPU/Fypp macro & include files (CPU map can't line-attribute)
+    "toolchain/cmake/",  # build system
+    "toolchain/bootstrap/",  # build/run scripts
+)


+def load_map(path: Path) -> Tuple[Optional[dict], Optional[dict]]:
+    """Return (entries_without_meta, meta), or (None, None) if missing/corrupt."""
+    if not Path(path).exists():
+        return None, None
+    try:
+        with gzip.open(path, "rt", encoding="utf-8") as f:
+            data = json.load(f)
+    except (OSError, gzip.BadGzipFile, json.JSONDecodeError, UnicodeDecodeError):
+        return None, None
+    if not isinstance(data, dict) or "_meta" not in data:
+        return None, None
+    meta = data.pop("_meta")
+    return data, meta


+        Argument(name="changed-files", dest="changed_files", type=str, default=None, help="Newline- or comma-separated changed-file list (from CI paths-filter). Overrides git detection."),
+        Argument(name="changes-branch", dest="changes_branch", type=str, default="master", help="Branch to diff against for --only-changes."),


+          SELECT=""
+          [ "${{ github.event_name }}" = "pull_request" ] && SELECT="--only-changes"
+          /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) $SELECT --changed-files "$CHANGED_FILES" $TEST_ALL $TEST_PCT $PRECISION


+    # MPI-compiled binaries must be launched via an MPI launcher (even ppn=1).
+    # Use --bind-to none to avoid binding issues with concurrent launches.
+    if shutil.which("mpirun"):
+        mpi_cmd = ["mpirun", "--bind-to", "none", "-np", str(ppn)]
+    elif shutil.which("srun"):
+        mpi_cmd = ["srun", "--ntasks", str(ppn)]
+    else:
+        raise MFCException("No MPI launcher found (mpirun or srun). MFC binaries require an MPI launcher.\n  On Ubuntu: sudo apt install openmpi-bin\n  On macOS:  brew install open-mpi")


…validate map shape Fix 1: Replace the allowlist of run-all files with an inverted design: define a small, conservative allowlist of provably test-irrelevant files (docs/*.md, LICENSE, etc.) and treat any changed file that is not .fpp, not cases.py, and not in that allowlist as unattributable -> run all. This closes the gap where mfc.sh, .github/**, tests/**, toolchain/pyproject.toml, and similar files would silently fall through to rung-7 (skip-all) under --select-enforce. Fix 2: In load_map, validate that every entry is str -> list[str] after popping _meta. A malformed entry returns (None, None) so the caller runs the full suite rather than silently misrouting tests. Tests: expand test_docs_only_still_skips_all to cover docs/, LICENSE, .claude/; add test_unattributable_nonsource_change_runs_all for mfc.sh / pyproject.toml / tests/** / .github/workflows; add test_load_map_rejects_malformed_entry.

…only)

…run and srun

codecov · 2026-05-30T07:41:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.31%. Comparing base (574e53d) to head (b8f89bd).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1461   +/-   ##
=======================================
  Coverage   61.31%   61.31%           
=======================================
  Files          72       72           
  Lines       19771    19771           
  Branches     2852     2852           
=======================================
  Hits        12123    12123           
  Misses       5699     5699           
  Partials     1949     1949

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sbryngelson and others added 19 commits May 29, 2026 22:24

docs: design spec for coverage-based test selection (done right)

bfcd0de

docs: implementation plan for coverage-based test selection

5aa4e4a

feat(test): add stable param_hash key for coverage selection

216e04c

feat(test): coverage map load/save with freshness metadata

94d5e48

feat(test): ALWAYS_RUN_ALL classification for coverage selection

61ed14d

feat(test): conservative coverage selection ladder

9849ec9

feat(test): TestCase.coverage_key() = param_hash(params)

c99289b

feat(test): robust changed-file detection (CI list + self-healing git)

ea0343e

feat(test): coverage selection summary line

505eca2

feat(test): coverage map health check (loud anti-rot)

8f46b3f

fix(test): key coverage_key on full params, not mods (soundness)

9f06bbe

feat(test): gcov coverage-map collector, keyed by param_hash

7581bb4

feat(test): wire --only-changes (shadow) and --build-coverage-map

3c98384

ci: coverage map refresh + health workflows

8232576

fix(test): coverage_key on TestCaseBuilder + treat empty coverage as …

9de8e5e

…run (soundness)

feat(test): accept space/comma/newline separated --changed-files (CI …

81e723f

…paths-filter)

fix(test): force full run on src/**/include/ changes (gcov can't attr…

6d068b0

…ibute includes)

Delete docs/superpowers/plans/2026-05-29-coverage-test-selection.md

dee8b5c

Delete docs/superpowers directory

36b3c71

fix(test): empty --changed-files is uncertainty (run all), not skip-all

f7a692f

sbryngelson added 5 commits May 29, 2026 23:58

fix(test): ALWAYS_RUN_ALL covers test/run infra; cases.py runs new te…

9811cfc

…sts via rung 5 (soundness)

ci: coverage selection shadow mode on self-hosted jobs (git-detected …

a1152d1

…changed files)

sbryngelson marked this pull request as ready for review May 30, 2026 04:48

sbryngelson requested a review from Copilot May 30, 2026 04:52

Copilot started reviewing on behalf of sbryngelson May 30, 2026 04:52 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

sbryngelson added 4 commits May 30, 2026 01:29

fix(cli): clarify --changed-files help to mention space-separated input

03b2bbe

fix(ci): gate --changed-files with --only-changes via bash array (PR-…

d208a03

…only)

fix(coverage-build): add mpiexec as fallback MPI launcher between mpi…

b8f89bd

…run and srun

sbryngelson merged commit f02f5f2 into MFlowCode:master May 30, 2026
86 checks passed

This was referenced May 30, 2026

ci: wire CACHE_PUSH_TOKEN so the coverage-map refresh actually pushes #1462

Merged

[smoke test] confirm coverage selection prunes in shadow (comment-only .fpp change) #1463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coverage-based test selection (done right) — shadow mode#1461

Coverage-based test selection (done right) — shadow mode#1461
sbryngelson merged 29 commits into
MFlowCode:masterfrom
sbryngelson:coverage-test-selection

sbryngelson commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

sbryngelson commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

		Argument(name="changed-files", dest="changed_files", type=str, default=None, help="Newline- or comma-separated changed-file list (from CI paths-filter). Overrides git detection."),
		Argument(name="changes-branch", dest="changes_branch", type=str, default="master", help="Branch to diff against for --only-changes."),

Conversation

sbryngelson commented May 30, 2026

What

Why this one is sound where the old one wasn't

The conservative ladder (soundness)

Validated end-to-end against a real gcov map (574 tests)

Known gcov caveat (handled)

Test plan

Uh oh!

github-actions Bot commented May 30, 2026

Claude Code Review

Uh oh!

sbryngelson commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov Bot commented May 30, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants