Skip to content

[CI] Fix chronic timeouts in Jenkins Code coverage stage#2382

Merged
bogdan-petkovic merged 1 commit into
bpetkovi/jenkins-codecov-exclude-test-dirsfrom
bpetkovi/jenkins-coverage-fix-timeout
May 22, 2026
Merged

[CI] Fix chronic timeouts in Jenkins Code coverage stage#2382
bogdan-petkovic merged 1 commit into
bpetkovi/jenkins-codecov-exclude-test-dirsfrom
bpetkovi/jenkins-coverage-fix-timeout

Conversation

@bogdan-petkovic
Copy link
Copy Markdown
Contributor

@bogdan-petkovic bogdan-petkovic commented May 21, 2026

Motivation

The Jenkins Code coverage stage in mlir/utils/jenkins/Jenkinsfile has been chronically failing on PR builds and merges to develop, but the failure has been invisible for a long time:

  • The whole stage runs inside a 60-minute activity-based timeout.
  • Two parts of the pipeline produce little or no output that Jenkins sees:
    • lit buffers progress for the duration of ninja check-rocmlir (~60 minutes on the coverage matrix codepaths) and only flushes at the very end.
    • llvm-profdata merge -sparse operates on ~125 GB of *.profraw and runs silently for several minutes.
  • The activity timer fires before llvm-profdata merge finishes → coverage_<cpath>.lcov is never produced → the Codecov uploader never runs → archiveArtifacts ... onlyIfSuccessful: true skips the report/lcov/html artifacts.
  • The outer catchError(buildResult: 'SUCCESS', stageResult: 'FAILURE') plus an inner try { ... } catch (e) { echo "NOTE: Code coverage stage had an error or timeout: ..." } swallow the FlowInterruptedException → the build is green and the matrix bodies show ✓ even though no LCOV reached codecov.io.

Concrete evidence collected while investigating https://github.com/ROCm/rocMLIR/pull/ (the test-exclusion change):

[2026-05-20T12:09:03.701Z] Total Discovered Tests: 919
[2026-05-20T12:14:07.193Z] Sending interrupt signal to process
[2026-05-20T12:14:13.284Z] Terminated
[2026-05-20T12:14:13.333Z] script returned exit code 143
                            Timeout has been exceeded
[2026-05-20T12:14:13.572Z] NOTE: Code coverage stage had an error or timeout:
[2026-05-20T12:14:13.572Z] org.jenkinsci.plugins.workflow.steps.FlowInterruptedException: Timeout has been exceeded

The same FlowInterruptedException pattern was confirmed on a recent merge to develop (build from 2026-05-16), so codecov.io has been displaying a stale snapshot from whichever older build last produced a successful upload. The biweekly Teams report (.github/workflows/codecov-report.yml) reads totals.coverage from that stale data.

Technical Details

Single-line change inside the body stage of collectCoverageData(...) orchestration in mlir/utils/jenkins/Jenkinsfile:

-                                                                timeout(time: 60, activity: true, unit: 'MINUTES') {
+                                                                timeout(time: 180, unit: 'MINUTES') {
  • Switch from activity-based to wall-clock timeout — removes spurious firings caused by lit's buffered output and llvm-profdata merge's silent runtime.
  • Bump the budget to 180 minutes: ~60 min for ninja check-rocmlir + several minutes for llvm-profdata merge over ~125 GB + three llvm-cov invocations (report, export, show) + uploader download and upload, with comfortable margin.
  • A short comment is added inline to explain why activity: was dropped, so this isn't reintroduced inadvertently in a future cleanup.

Test Plan

  • Codecov on PR CI

Test Result

  • PR CI with Codecov

Submission Checklist

Signed-off-by: bogdan-petkovic <bpetkovi@amd.com>
@bogdan-petkovic bogdan-petkovic self-assigned this May 21, 2026
@bogdan-petkovic bogdan-petkovic marked this pull request as ready for review May 21, 2026 09:40
@bogdan-petkovic bogdan-petkovic requested a review from causten as a code owner May 21, 2026 09:40
@bogdan-petkovic bogdan-petkovic merged commit 3e56161 into bpetkovi/jenkins-codecov-exclude-test-dirs May 22, 2026
4 of 11 checks passed
@bogdan-petkovic bogdan-petkovic deleted the bpetkovi/jenkins-coverage-fix-timeout branch May 22, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant