Skip to content

Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419

Merged
IanButterworth merged 9 commits intomainfrom
ib/skip_unnecessary_gc
Apr 15, 2026
Merged

Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419
IanButterworth merged 9 commits intomainfrom
ib/skip_unnecessary_gc

Conversation

@IanButterworth
Copy link
Copy Markdown
Member

@IanButterworth IanButterworth commented Apr 10, 2026

Reduces harness overhead for benchmarks that don't allocate (the common case), and ships a precompile workload so the first @benchmark is fast.

Changes

Sampling hot path (src/execution.jl)

  • Write sample results into a Ref{SampleResult} instead of returning a 5-tuple — eliminates heap-allocating the return tuple on every sample (the samplefunc is stored as Function, so returns go through dynamic dispatch)
  • Skip gcscrub() before each sample when the previous sample reported zero allocations — nothing to collect
  • Skip gcscrub() before the trial when the warmup reported zero allocations
  • sizehint! the Trial vectors after the first sample, estimating the number of remaining samples from elapsed time
  • Avoid allocating a new Parameters for warmup; temporarily mutate evals and restore
  • Use resize! instead of slice-copy (estimates[1:completed]) in _lineartrial
  • Use explicit push!(trial, s[1], s[2], s[3], s[4]) instead of s[1:(end-1)]... to avoid intermediate tuple allocation

Result capture (src/execution.jl)

  • Add capture_result kwarg to _run — when false (the run(::Benchmark) path), the extra samplefunc call to capture the user's return value is skipped entirely
  • run_result / @btime / @btimed still capture the return value via one additional call at the end

PrecompileTools (src/BenchmarkTools.jl, Project.toml)

  • @compile_workload runs @benchmark 1+1 and show() at precompile time
  • Adds ~2s to precompilation but saves ~1.5s on every first use

CI (.github/workflows/CI.yml)

  • Add concurrency / cancel-in-progress to avoid redundant CI runs

Tests (test/ExecutionTests.jl)

  • Assert gcscrub is absent from the profile of a zero-allocation benchmark (GC skip works)
  • Assert gcscrub is present when benchmarking Ref(1) (GC still fires when there are allocations)
  • Assert the harness reports zero allocations for sin(1)

Comparison vs main (Julia 1.12, Apple Silicon)

Single benchmark (@benchmark foo($1))

main PR speedup
First run + display 2.00 s (9.7M allocs, 474 MiB) 0.50 s (1.5M allocs, 76 MiB)
Second run (new expr) 0.40 s (103k allocs, 3.6 MiB) 0.03 s (1.3k allocs, 230 KiB) 13×
Steady-state 0.40 s (102k allocs, 3.6 MiB) 0.03 s (97 allocs, 172 KiB) 15×

BaseBenchmarks.jl (scalar + tuple suites, tune=false)

main PR speedup
Wall time 311 s 94 s 3.3×
Allocations 156M (6.0 GiB) 41M (2.5 GiB) 3.8× fewer
GC time 93.4% 86.2%

The steady-state improvement comes from eliminating per-sample heap allocations (the Ref write-through) and skipping GC scrubs that were collecting nothing. The first-run improvement is from the PrecompileTools workload. The BaseBenchmarks speedup is dominated by the GC skip — most benchmarks in these suites don't allocate, so the old code was spending >90% of its time in unnecessary gcscrub() calls.

With help from Claude.

@IanButterworth IanButterworth force-pushed the ib/skip_unnecessary_gc branch 3 times, most recently from c67bdf6 to e9fccd4 Compare April 10, 2026 02:41
- Make `Benchmark` parametric (`Benchmark{F,Q}`) so `samplefunc` and
  `quote_vals` have concrete types, eliminating dynamic dispatch and
  boxing on every sample call

- Skip `gcscrub()` before `gctrial`/`gcsample` when the previous
  sample (or warmup) reported zero allocations — nothing to collect

- Pre-allocate `Trial` vectors with `sizehint!` based on the first
  real sample time, avoiding repeated heap growth and GC churn from
  the harness itself during the run

- Add a test asserting the harness itself reports zero allocations for
  a zero-allocation benchmark
@IanButterworth IanButterworth force-pushed the ib/skip_unnecessary_gc branch from e9fccd4 to e4ed166 Compare April 10, 2026 02:45
@IanButterworth IanButterworth marked this pull request as ready for review April 10, 2026 02:48
Revert Benchmark to non-parametric (easier to pass around) and use
function barriers (_run_inner, _lineartrial_inner) so Julia specializes
the sampling loops on concrete samplefunc/quote_vals types without
parameterizing the struct.

- Skip GC scrub when warmup/sample reported zero allocations
- Temporarily set evals=1 for warmup instead of allocating new Parameters
- Use explicit push!(trial, s[1], s[2], s[3], s[4]) instead of
  s[1:(end-1)]... to avoid intermediate tuple allocation
- resize! instead of slice-copy in _lineartrial_inner

Reduces steady-state allocations from ~102K to ~96 per benchmark run.
@IanButterworth IanButterworth force-pushed the ib/skip_unnecessary_gc branch from e7855fe to 024bfbd Compare April 10, 2026 03:46
@IanButterworth
Copy link
Copy Markdown
Member Author

currently running "scalar" and "tuple" BaseBenchmarks takes 28s with this branch and 18s with main... So maybe the compilation overhead is too expensive

The samplefunc is stored as `Function` (abstract type), so every
call returned a heap-allocated tuple through dynamic dispatch.
With ~10K samples per benchmark, this was ~10K allocs per run.

Now the caller passes a pre-allocated Ref{SampleResult} that the
samplefunc writes into, reducing per-benchmark allocs from ~10K
to ~12 (all structural: Parameters copy, Trial, vectors, etc).
Comment thread src/execution.jl Outdated
IanButterworth and others added 2 commits April 10, 2026 12:48
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Comment thread src/groups.jl
end

function Base.show(io::IO, group::BenchmarkGroup)
function Base.show(@nospecialize(io::IO), group::BenchmarkGroup)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make sure precompilation covers IOContext{REPL.LimitIO{TTY}}. Alternatively we'd make an extension on REPL just to precompile it... which seems excessive.

@IanButterworth
Copy link
Copy Markdown
Member Author

There's support here, I'm happy and a few different agents are too, so I'll go ahead and release a minor bump.

@IanButterworth IanButterworth merged commit c562362 into main Apr 15, 2026
26 checks passed
@IanButterworth IanButterworth deleted the ib/skip_unnecessary_gc branch April 15, 2026 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant