CI Build Optimization — Timing Baselines and Measured Impact #8364
paoloantinori
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
CI Build Optimization — Timing Baselines and Measured Impact
This document tracks the measured performance impact of Maven CI build optimizations applied to the Apicurio Registry project. It serves as a living reference for the team to evaluate whether each optimization is worth keeping.
Related: Discussion #8365 — Cross-Project Comparison (44 OSS Projects) | Issue #8348 — ttl.sh replacement
Baseline v1 — Pre-Optimization (June 22, 2026)
Three consecutive green runs on
mainbefore any optimizations:Baseline v2 — Post First Wave (June 23-24, 2026)
After PRs #8341 (checkstyle skip), #8343 (git-commit-id skip), #8353 (source/javadoc skip + MAVEN_OPTS):
Total shard time: 59m55s → 52m09s (-13%)
Baseline v3 — Post Second Wave (June 25, 2026)
After PRs #8347, #8375, #8377, #8369 +
.mvn/maven.config -T 1C:Total shard time: 52m09s → 44m19s (-15% vs v2, -26% vs v1)
Baseline v4 — Post Third Wave (June 29, 2026)
After PRs #8378 (-Dlocal), #8367 (.mvn/maven.config), #8388 (surefire retry), #8390 (operator timeouts), #8349 (ttl.sh replacement):
Total shard time: 44m19s → 43m59s (stable, -27% vs v1)
Cumulative improvement from v1: 59m55s → 43m59s = 15m56s saved per CI run (-27%)
Key Observations (v4)
What Merged
-Dfast→-Dlocalwith plugin skip properties.mvn/maven.configwith-T 1Cdefault parallel-Daether.syncContext.named.factory=noop)-T 1C(force-T1in Makefile)Open PRs
-Dskip.zip.assembly=true)Note on #8439 — Maven 3.9+ Lock Contention
Maven 3.9+ introduced
maven-resolver-named-locksfor file-based resolver locking. Under-Tparallel builds, multiple threads contend for these locks and the acquisition itself deadlocks — producingCould not acquire lock(s)failures. This is not a timing fluke: PR #8435 hit it in 2 of 3 CI runs, and main branch runs also fail intermittently.Fix:
-Daether.syncContext.named.factory=noopdisables the file-based locking entirely. This is safe in CI because each GitHub Actions runner has its own.m2/repository— there is no cross-process contention. The locking was designed for multi-process scenarios (e.g., two Maven instances sharing a local repo), not multi-thread within a single process.Scope: Applied to all 9 CI workflow files (verify, operator, release, reusable-docker-build) — only on Maven commands that use parallel threads. NOT added to
.mvn/maven.configto avoid affecting local development. Previously validated in PRs #8417 and #8430 but was bundled with cache-splitting changes that were rejected — this PR applies the fix standalone.Note on #8442 — Scalpel Affected-Module Detection
Scalpel (v0.3.7) is a Maven core extension that analyzes git diffs against the POM dependency graph to skip tests on modules unaffected by a PR's changes. No static mapping files — fully dynamic from POM structure. Used by Quarkus and Apache Camel.
Local stress testing (8 edge cases):
.mvn/**changeserdes/kafka/avro-serde/onlycommon/changeapp/src/onlypom.xmlcomment-onlypom.xmlversion property bump-Dscalpel.enabled=false-Dtest=filter + ScalpelArchitectural finding:
apptest-depends on all serdes modules (<scope>test</scope>) for embedded@QuarkusTestintegration tests (e.g.,AvroSerdeTest,ProtobufSerdeTest). These 11 test classes start an embedded Quarkus registry and test the full serialize→register→deserialize roundtrip. This is a standard Quarkus testing pattern — fast (seconds, no Docker) and complementary to the container-basedintegration-tests/.This means Scalpel correctly identifies
appas affected by any serdes change, so all 6 app shards still run for serdes-only PRs. The savings are concentrated elsewhere:app/src/onlyjava-sdk/onlycommon/serdes/onlypom.xmlversion bumpRecommendation: Keep the current architecture. Restructuring (moving serdes tests to
integration-tests/or a separate module) would sacrifice fast embedded testing for CI savings — inverted priorities. The Scalpel savings from app-only and java-sdk-only PRs are valuable enough on their own.Planned
Tried & Rejected (with evidence)
.locks/directory). Fixed with-Daether.syncContext.named.factory=noop— build passes. BUT: not worth implementing. Research across real incidents (Angular cache poisoning $31K bounty, Ultralytics crypto miner, TanStack 84 malicious npm versions) shows cache splitting is motivated by: (1) security withpull_request_targettriggers, (2) cache budget exhaustion at 100+ PRs, (3) ~30s save time per PR. At our scale (~20 PRs, standardpull_request), none of these justify the complexity. Closed as won't-do.centraltogoogle-maven-centralforces Maven to re-verify all cached artifacts. Build went from 3m43s to 13m53s (+274%). Works for projects that use the mirror from the start (gRPC Java) but causes cache thrashing when adopted mid-flight.http.response.status_codeand marks 5xx asStatusCode.ERROR. Verified by Carles with test in #8415.Visual Summary
Shard Timing Progression (v1 → v2 → v3 → v4)
--- config: xyChart: width: 800 height: 400 --- xychart-beta title "Unit Test Shard Duration (seconds)" x-axis ["Build", "app-rest", "app-sql", "kafkasql", "app-gitops", "app-k8sops", "app-other", "non-app"] y-axis "Duration (seconds)" 0 --> 850 bar [451, 418, 369, 465, 325, 270, 829, 468] bar [269, 352, 349, 430, 296, 266, 789, 376] bar [218, 319, 270, 362, 217, 177, 775, 319] bar [223, 322, 296, 356, 216, 160, 771, 295]Cumulative Time Saved
--- config: xyChart: width: 800 height: 400 --- xychart-beta title "Total Shard Time (seconds) — Lower is Better" x-axis ["v1 Baseline", "#8341 checkstyle", "#8343 commitid", "#8353 src/javadoc", "#8347 cli-javadoc", "Jun 25 wave", "Jun 29 wave"] y-axis "Total shard time (seconds)" 2400 --> 3700 line [3595, 3525, 3490, 3129, 3129, 2659, 2639]Where Time Is Still Spent (v4)
pie title Current CI Time Distribution (v4 — 43m59s total) "app-other (12m51s)" : 771 "app-kafkasql (5m56s)" : 356 "app-rest (5m22s)" : 322 "app-sql (4m56s)" : 296 "non-app (4m55s)" : 295 "Build (3m43s)" : 223 "app-gitops (3m36s)" : 216 "app-k8sops (2m40s)" : 160Last updated: 2026-07-02 (#8439 and #8448 merged, #8435 rebased). 14 PRs merged total. This is a living document — comments and suggestions welcome.
Beta Was this translation helpful? Give feedback.
All reactions