Maven CI Optimization Techniques — Cross-Project Comparison (44 OSS Projects) #8365
paoloantinori
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Maven CI Optimization Techniques — Cross-Project Comparison
Exhaustive research across 44 major open-source Java projects to identify CI build optimization techniques, who uses them, and what Apicurio Registry should adopt next. All claims verified against actual repository source code (June 2026).
Related: Discussion #8364 — Timing Baselines
Projects Surveyed (44 total)
Original 17 Projects
Expanded Survey (27 additional projects)
Major Maven/Gradle Projects (verified):
K8s Operator Testing (24 operators analyzed):
Technique Comparison Matrices
Plugin Skip Techniques
Build Parallelism
Affected-Module Detection
Caching and Artifact Transfer
New: gRPC Java uses
maven-central.storage-download.googleapis.com— "less flaky than mavenCentral()"Reliability and Flaky Tests
Gradle-Specific Patterns (for reference)
Observability
K8s Operator CI Patterns
Key finding: Apicurio's operator is one of only 3/24 operators still using minikube. 96% have moved to kind for faster, more reliable CI.
Gap Analysis
What we already do that others don't
What most projects do that we don't yet
Techniques only one project does (innovation opportunities)
Techniques nobody does yet
What We Have Already Applied
Cumulative: 26% reduction in total shard time (59m55s → 44m19s, 15m36s saved per CI run)
Still Open
What We Plan to Do Next
Known Issues & FAQ — Maven CI Pitfalls
Issues we encountered (and resolved) that affect any Maven project. Documented here so other teams don't have to rediscover them.
Maven 3.9+ Lock Contention with
-TParallel BuildsSymptom: Random
Could not acquire lock(s)/java.lang.IllegalStateException: Could not acquire lock(s)failures. Non-deterministic — may pass on retry.Affected versions: Maven 3.9.0+ (introduced
maven-resolver-named-locks)Root cause: Maven 3.9+ added file-based resolver locking (
maven-resolver-named-locks) to prevent concurrent modification of the local.m2/repository. Under-T(parallel module builds), multiple threads within a single Maven process contend for these file locks. The lock acquisition itself can deadlock — the locking was designed for multi-process access (e.g., two separatemvninvocations sharing a repo), not multi-thread within one process.Fix: Add
-Daether.syncContext.named.factory=noopto Maven commands that use-T. This disables the file-based locking entirely.Why it's safe in CI: Each CI runner has its own
.m2/repository. There's no cross-process contention — the scenario the locking was designed to protect against doesn't exist. Disabling it in CI is a no-op from a safety perspective.Where NOT to apply: Local development environments where a developer might run two Maven builds in parallel against the same
~/.m2/repository. Don't add this to.mvn/maven.config— keep it in CI workflow commands only.Our evidence: PR #8435 hit this in 2 of 3 CI runs (failed in different modules each time —
Registry :: Operator,Registry :: Common). Main branch also affected. Fix validated in PRs #8417, #8430, and #8439.Reference: MRESOLVER-392, Maven Resolver Named Locks docs
GCS Maven Central Mirror — Repository ID Cache Thrashing
Symptom: Build time increases dramatically (+274%) after adding a Maven Central mirror.
Root cause: Maven caches artifact metadata per repository ID. Changing the repository ID from
centralto a mirror-specific ID (e.g.,google-maven-central) forces Maven to re-verify ALL cached artifacts against the new ID. Every artifact gets a fresh HTTP HEAD request.Who it affects: Projects adopting a mirror mid-flight with an existing Maven cache. Projects that use the mirror from day one (e.g., gRPC Java) are fine — their cache is built with the mirror's ID from the start.
Workaround: If you must add a mirror, use
<mirrorOf>central</mirrorOf>with<id>central</id>to preserve the repository ID. Or accept the one-time cache rebuild cost.Surefire
forkCount— Not Always FasterSymptom: Increasing
forkCountfrom 1 to 2+ makes tests slower, not faster.Root cause: JVM fork startup (~300ms per fork) dominates when individual tests are fast (~1ms). The parallelism gain only exceeds fork overhead when
avg_test_duration × test_count > fork_startup × forkCount.Rule of thumb:
forkCount > 1helps when tests take seconds (like Flink, ZooKeeper). For millisecond-range tests (like Apicurio's unit tests), stick withforkCount=1.Maven Daemon (mvnd) — Local Dev Only, Not CI
Symptom: mvnd is slower than plain
mvnon CI first builds.Root cause: mvnd's speed comes from JVM warmup, JIT retention, and classloader caching across repeated builds. CI containers are ephemeral — no daemon survives between runs. Cold daemon start adds ~20% overhead vs plain
mvn.Who it helps: Developers iterating locally (10-26x speedup on warm daemon). Apache Camel Quarkus (1,336 modules) is the flagship user.
Co-creator's take: Peter Palaga (mvnd co-creator): "I see little potential for mvnd in the area of continuous integration."
GitHub Actions Lifecycle Bot Race — CI Silently Skips on PR Creation
Symptom: PR CI shows all jobs as "skipping" despite correct labels and path filters.
Decidestep passes but every downstream job is skipped.Root cause: When a PR is created with a label like
orchestrator/disabled, two events fire in rapid succession: (1)openedevent triggers Verify correctly, (2) lifecycle bot addslifecycle/newlabel →labeledevent triggers a second Verify run. The second run's Decide step sees a lifecycle label and skips everything. GitHub overwrites the first run's results with the second run's "all skipped" results.Fix: After creating a PR, wait 15 seconds and check
gh pr checks <number>. If everything shows "skipping", force-push (git commit --amend --no-edit && git push --force-with-lease). Thesynchronizeevent from the force-push triggers a clean run.How to detect lifecycle-skip vs path-skip: Check the Decide step logs.
lifecycle-ready=falsemeans the bot race happened.lifecycle-ready=truewith specificrun-*=falsemeans path filters correctly skipped irrelevant jobs.Our evidence: Hit on PRs #8375, #8345, #8350, #8435, #8439 — every single PR creation. We now have a skill (
pr-lifecycle) and hook to catch this automatically.GitHub Actions Silently Skips PRs with Merge Conflicts
Symptom: PR's Verify workflow never triggers — no run appears at all, not even a "skipped" one. Force-pushes, label changes, close/reopen all fail to trigger CI.
Root cause: GitHub Actions silently drops
pull_requestevents for PRs inCONFLICTINGstate. No error, no log, no "skipped" run — it simply doesn't fire.Fix: Before any CI debugging, check
gh pr view <number> --json mergeable --jq '.mergeable'. IfCONFLICTING, rebase the branch. Only investigate labels, queue, or workflow config if the PR isMERGEABLE.Our evidence: Spent hours debugging #8353 — tried label changes, force pushes, empty commits, close/reopen, investigated zombie queued runs. The actual problem was a one-line merge conflict in
verify-unit-tests.yaml. A simple rebase fixed it instantly.CI Service Readiness — Retry Loops, Not Fixed Sleep
Symptom: CI steps that start a Docker container then immediately test against it fail intermittently with connection refused or timeout.
Root cause:
sleep Nfollowed by a single healthcheck is a race condition. On slow runners, the fixed sleep isn't enough. On fast runners, you're wasting time waiting for a service that's already up.Fix: Always use a retry loop:
This is both faster (succeeds immediately when ready) and more reliable (waits longer if needed).
CI-Only Flags — Don't Pollute Shared Config
Symptom: Developers report broken local builds after a CI optimization merges.
Root cause: Adding CI-specific flags to
.mvn/maven.configorpom.xmlprofiles affects ALL builds — including local developer environments. Example: adding-s .mvn/settings.xmltomaven.configsilently overrides every developer's~/.m2/settings.xml, breaking custom repos, credentials, proxies, and private mirrors.Rule: CI-only changes go in workflow YAML files (command-line flags,
env:variables). Shared config (.mvn/maven.config,pom.xmlprofiles) must be safe for all environments. If a change has developer-facing tradeoffs, document them explicitly in the PR and get team buy-in.Scalpel Affected-Module Detection — Test-Scope Dependencies Limit Savings
Symptom: Scalpel (or GIB) is integrated for affected-module test skipping, but most PRs still run nearly all tests because the main application module test-depends on many library modules.
Root cause: In Quarkus-based projects,
@QuarkusTestintegration tests often live in theappmodule and import library modules (serdes, SDKs, schema utilities) in<scope>test</scope>. Scalpel correctly treats test-scope dependencies as "affected" — a change toavro-serdemeansapp's serdes tests might behave differently, soapptests must run.Impact on savings:
app/src/only (leaf module)common/)Could you restructure? Moving
@QuarkusTestserdes tests fromapp/src/test/to a separate module or tointegration-tests/would decoupleappfrom serdes — but you lose fast embedded testing (seconds, no Docker). This trades developer productivity for CI savings, which is usually the wrong tradeoff.Rule: Before integrating Scalpel, audit your
appmodule's test-scope dependencies. The savings are real but concentrated in leaf-module and independent-module PRs, not in library PRs that the app tests against. Set expectations accordingly.Our evidence: Stress-tested 8 scenarios.
app/src/change → 1/41 tested.serdes/change → 2/41 in non-app shard but all app shards still run.common/change → 30/41 tested (11 independent modules correctly skipped). See PR #8442 for full results.Parallel Maven Builds Break Operator Integration Tests — Resource Exhaustion
Symptom: Adding
-T 1C(parallel threads) to.mvn/maven.configcauses Kubernetes operator integration tests to fail withConditionTimeoutException— deployments never become ready, tests time out after ~120 seconds.Root cause (corrected after investigation): The failure is caused by resource exhaustion, not shared Java memory state. The operator Makefile runs
-pl controller -am— only ONE module's tests execute, so cross-module test parallelism is not the issue. Under-T 1C, 8 parallel Maven threads perform Quarkus augmentation and compilation while Minikube pods are also running. The CPU/memory pressure on the CI runner causes pods to start slower, exceeding the awaitility timeout.Initial (incorrect) hypothesis: We initially attributed the failure to shared
staticfields in the test base class (ITBase) and CDI singleton conflicts between concurrent test modules. While these ARE code quality issues worth fixing, they're not the cause of the CI failure — the modules don't actually run tests concurrently under-pl controller -am.Fix: Override
-T 1Cwith-T1in the operator Makefile to force single-threaded operator builds. This reduces resource pressure during test execution. The compilation speedup from parallelism on 3 operator modules (~2-3 seconds) isn't worth the resource contention.Code quality improvement (separate): We also prototyped converting
ITBasefrom static to@TestInstance(PER_CLASS)instance lifecycle (PR #8452). This eliminates unnecessary static mutable state and future-proofs for intra-module test parallelism, but it does not fix the CI failure — that's a resource problem, not a memory sharing problem.Lesson for other projects: When operator tests fail under parallel Maven builds, check whether the failure is:
-T1or resource limits@TestInstance(PER_CLASS), namespace isolation,@ResourceLockMisdiagnosing (2) when the actual cause is (1) leads to unnecessary refactoring that doesn't fix the problem.
Our evidence: 5 consecutive main branch failures (June 29–July 2). Fixed by PR #8448 (
-T1override). Root cause analysis in issue #8447. Code quality prototype in PR #8452.Cross-Project Technique Adoption — Measure Before Copying
Symptom: A CI optimization that works great for Project A makes things worse (or adds pointless complexity) for Project B.
Root cause: Techniques are designed for specific constraints. Maven cache splitting was motivated by cache poisoning security at 100+ PRs and 10GB budget exhaustion — neither applies to a project with 20 PRs.
forkCount=2works when tests take seconds (Flink) but hurts when tests take milliseconds. GCS mirrors work when adopted from day one but cause cache thrashing mid-flight.Rule: Before implementing: ask "What problem does this solve? Do we have that problem?" Check the original PR/issue — the motivation is there. Compare scale. If the benefit is marginal (<1 minute), the implementation must be trivially simple. Document negative results to prevent future re-attempts.
Verification: All claims verified against actual repository source code and CI evidence (July 2026). This is a living document — comments, corrections, and suggestions welcome.
Beta Was this translation helpful? Give feedback.
All reactions