Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload. by IanButterworth · Pull Request #419 · JuliaCI/BenchmarkTools.jl

IanButterworth · 2026-04-10T01:56:12Z

Reduces harness overhead for benchmarks that don't allocate (the common case), and ships a precompile workload so the first @benchmark is fast.

Changes

Sampling hot path (src/execution.jl)

Write sample results into a Ref{SampleResult} instead of returning a 5-tuple — eliminates heap-allocating the return tuple on every sample (the samplefunc is stored as Function, so returns go through dynamic dispatch)
Skip gcscrub() before each sample when the previous sample reported zero allocations — nothing to collect
Skip gcscrub() before the trial when the warmup reported zero allocations
sizehint! the Trial vectors after the first sample, estimating the number of remaining samples from elapsed time
Avoid allocating a new Parameters for warmup; temporarily mutate evals and restore
Use resize! instead of slice-copy (estimates[1:completed]) in _lineartrial
Use explicit push!(trial, s[1], s[2], s[3], s[4]) instead of s[1:(end-1)]... to avoid intermediate tuple allocation

Result capture (src/execution.jl)

Add capture_result kwarg to _run — when false (the run(::Benchmark) path), the extra samplefunc call to capture the user's return value is skipped entirely
run_result / @btime / @btimed still capture the return value via one additional call at the end

PrecompileTools (src/BenchmarkTools.jl, Project.toml)

@compile_workload runs @benchmark 1+1 and show() at precompile time
Adds ~2s to precompilation but saves ~1.5s on every first use

CI (.github/workflows/CI.yml)

Add concurrency / cancel-in-progress to avoid redundant CI runs

Tests (test/ExecutionTests.jl)

Assert gcscrub is absent from the profile of a zero-allocation benchmark (GC skip works)
Assert gcscrub is present when benchmarking Ref(1) (GC still fires when there are allocations)
Assert the harness reports zero allocations for sin(1)

Comparison vs main (Julia 1.12, Apple Silicon)

Single benchmark (@benchmark foo($1))

	main	PR	speedup
First run + display	2.00 s (9.7M allocs, 474 MiB)	0.50 s (1.5M allocs, 76 MiB)	4×
Second run (new expr)	0.40 s (103k allocs, 3.6 MiB)	0.03 s (1.3k allocs, 230 KiB)	13×
Steady-state	0.40 s (102k allocs, 3.6 MiB)	0.03 s (97 allocs, 172 KiB)	15×

BaseBenchmarks.jl (scalar + tuple suites, tune=false)

	main	PR	speedup
Wall time	311 s	94 s	3.3×
Allocations	156M (6.0 GiB)	41M (2.5 GiB)	3.8× fewer
GC time	93.4%	86.2%

The steady-state improvement comes from eliminating per-sample heap allocations (the Ref write-through) and skipping GC scrubs that were collecting nothing. The first-run improvement is from the PrecompileTools workload. The BaseBenchmarks speedup is dominated by the GC skip — most benchmarks in these suites don't allocate, so the old code was spending >90% of its time in unnecessary gcscrub() calls.

With help from Claude.

- Make `Benchmark` parametric (`Benchmark{F,Q}`) so `samplefunc` and `quote_vals` have concrete types, eliminating dynamic dispatch and boxing on every sample call - Skip `gcscrub()` before `gctrial`/`gcsample` when the previous sample (or warmup) reported zero allocations — nothing to collect - Pre-allocate `Trial` vectors with `sizehint!` based on the first real sample time, avoiding repeated heap growth and GC churn from the harness itself during the run - Add a test asserting the harness itself reports zero allocations for a zero-allocation benchmark

Revert Benchmark to non-parametric (easier to pass around) and use function barriers (_run_inner, _lineartrial_inner) so Julia specializes the sampling loops on concrete samplefunc/quote_vals types without parameterizing the struct. - Skip GC scrub when warmup/sample reported zero allocations - Temporarily set evals=1 for warmup instead of allocating new Parameters - Use explicit push!(trial, s[1], s[2], s[3], s[4]) instead of s[1:(end-1)]... to avoid intermediate tuple allocation - resize! instead of slice-copy in _lineartrial_inner Reduces steady-state allocations from ~102K to ~96 per benchmark run.

IanButterworth · 2026-04-10T03:59:57Z

currently running "scalar" and "tuple" BaseBenchmarks takes 28s with this branch and 18s with main... So maybe the compilation overhead is too expensive

The samplefunc is stored as `Function` (abstract type), so every call returned a heap-allocated tuple through dynamic dispatch. With ~10K samples per benchmark, this was ~10K allocs per run. Now the caller passes a pre-allocated Ref{SampleResult} that the samplefunc writes into, reducing per-benchmark allocs from ~10K to ~12 (all structural: Parameters copy, Trial, vectors, etc).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

IanButterworth · 2026-04-11T00:15:07Z

 end

-function Base.show(io::IO, group::BenchmarkGroup)
+function Base.show(@nospecialize(io::IO), group::BenchmarkGroup)


This is to make sure precompilation covers IOContext{REPL.LimitIO{TTY}}. Alternatively we'd make an extension on REPL just to precompile it... which seems excessive.

IanButterworth · 2026-04-15T00:38:11Z

There's support here, I'm happy and a few different agents are too, so I'll go ahead and release a minor bump.

IanButterworth force-pushed the ib/skip_unnecessary_gc branch 3 times, most recently from c67bdf6 to e9fccd4 Compare April 10, 2026 02:41

IanButterworth added 2 commits April 9, 2026 22:45

add a PrecompileTools workload

e4ed166

IanButterworth force-pushed the ib/skip_unnecessary_gc branch from e9fccd4 to e4ed166 Compare April 10, 2026 02:45

IanButterworth marked this pull request as ready for review April 10, 2026 02:48

IanButterworth requested review from vtjnash April 10, 2026 02:55

IanButterworth force-pushed the ib/skip_unnecessary_gc branch from e7855fe to 024bfbd Compare April 10, 2026 03:46

add concurrency cancel in progress

3feb7a4

IanButterworth added 3 commits April 10, 2026 12:04

don't capture the returned result if not exposed

1edf9fa

Update .gitignore

111eabd

github-actions Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/execution.jl Outdated

IanButterworth and others added 2 commits April 10, 2026 12:48

format

f17f26f

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

don't specialize show methods on IO type

59efbef

IanButterworth commented Apr 11, 2026

View reviewed changes

IanButterworth mentioned this pull request Apr 12, 2026

codegen: Propagate ipo_purity_bits to LLVM function attributes JuliaLang/julia#61394

Merged

IanButterworth merged commit c562362 into main Apr 15, 2026
26 checks passed

IanButterworth deleted the ib/skip_unnecessary_gc branch April 15, 2026 00:38

IanButterworth mentioned this pull request Apr 30, 2026

Make benchmark run faster JuliaCI/Nanosoldier.jl#232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419

Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419
IanButterworth merged 9 commits intomainfrom
ib/skip_unnecessary_gc

IanButterworth commented Apr 10, 2026 •

edited

Loading

Uh oh!

IanButterworth commented Apr 10, 2026

Uh oh!

Uh oh!

IanButterworth Apr 11, 2026

Uh oh!

IanButterworth commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

IanButterworth commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Comparison vs main (Julia 1.12, Apple Silicon)

Uh oh!

IanButterworth commented Apr 10, 2026

Uh oh!

Uh oh!

IanButterworth Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

IanButterworth commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

IanButterworth commented Apr 10, 2026 •

edited

Loading