Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419
Merged
IanButterworth merged 9 commits intomainfrom Apr 15, 2026
Merged
Only GC when needed. Reduce allocs from mechanics. Add PrecompileTools workload.#419IanButterworth merged 9 commits intomainfrom
IanButterworth merged 9 commits intomainfrom
Conversation
c67bdf6 to
e9fccd4
Compare
- Make `Benchmark` parametric (`Benchmark{F,Q}`) so `samplefunc` and
`quote_vals` have concrete types, eliminating dynamic dispatch and
boxing on every sample call
- Skip `gcscrub()` before `gctrial`/`gcsample` when the previous
sample (or warmup) reported zero allocations — nothing to collect
- Pre-allocate `Trial` vectors with `sizehint!` based on the first
real sample time, avoiding repeated heap growth and GC churn from
the harness itself during the run
- Add a test asserting the harness itself reports zero allocations for
a zero-allocation benchmark
e9fccd4 to
e4ed166
Compare
Revert Benchmark to non-parametric (easier to pass around) and use function barriers (_run_inner, _lineartrial_inner) so Julia specializes the sampling loops on concrete samplefunc/quote_vals types without parameterizing the struct. - Skip GC scrub when warmup/sample reported zero allocations - Temporarily set evals=1 for warmup instead of allocating new Parameters - Use explicit push!(trial, s[1], s[2], s[3], s[4]) instead of s[1:(end-1)]... to avoid intermediate tuple allocation - resize! instead of slice-copy in _lineartrial_inner Reduces steady-state allocations from ~102K to ~96 per benchmark run.
e7855fe to
024bfbd
Compare
Member
Author
|
currently running "scalar" and "tuple" BaseBenchmarks takes 28s with this branch and 18s with main... So maybe the compilation overhead is too expensive |
The samplefunc is stored as `Function` (abstract type), so every
call returned a heap-allocated tuple through dynamic dispatch.
With ~10K samples per benchmark, this was ~10K allocs per run.
Now the caller passes a pre-allocated Ref{SampleResult} that the
samplefunc writes into, reducing per-benchmark allocs from ~10K
to ~12 (all structural: Parameters copy, Trial, vectors, etc).
IanButterworth
commented
Apr 11, 2026
| end | ||
|
|
||
| function Base.show(io::IO, group::BenchmarkGroup) | ||
| function Base.show(@nospecialize(io::IO), group::BenchmarkGroup) |
Member
Author
There was a problem hiding this comment.
This is to make sure precompilation covers IOContext{REPL.LimitIO{TTY}}. Alternatively we'd make an extension on REPL just to precompile it... which seems excessive.
Member
Author
|
There's support here, I'm happy and a few different agents are too, so I'll go ahead and release a minor bump. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduces harness overhead for benchmarks that don't allocate (the common case), and ships a precompile workload so the first
@benchmarkis fast.Changes
Sampling hot path (
src/execution.jl)Ref{SampleResult}instead of returning a 5-tuple — eliminates heap-allocating the return tuple on every sample (the samplefunc is stored asFunction, so returns go through dynamic dispatch)gcscrub()before each sample when the previous sample reported zero allocations — nothing to collectgcscrub()before the trial when the warmup reported zero allocationssizehint!theTrialvectors after the first sample, estimating the number of remaining samples from elapsed timeParametersfor warmup; temporarily mutateevalsand restoreresize!instead of slice-copy (estimates[1:completed]) in_lineartrialpush!(trial, s[1], s[2], s[3], s[4])instead ofs[1:(end-1)]...to avoid intermediate tuple allocationResult capture (
src/execution.jl)capture_resultkwarg to_run— whenfalse(therun(::Benchmark)path), the extra samplefunc call to capture the user's return value is skipped entirelyrun_result/@btime/@btimedstill capture the return value via one additional call at the endPrecompileTools (
src/BenchmarkTools.jl,Project.toml)@compile_workloadruns@benchmark 1+1andshow()at precompile timeCI (
.github/workflows/CI.yml)concurrency/cancel-in-progressto avoid redundant CI runsTests (
test/ExecutionTests.jl)gcscrubis absent from the profile of a zero-allocation benchmark (GC skip works)gcscrubis present when benchmarkingRef(1)(GC still fires when there are allocations)sin(1)Comparison vs main (Julia 1.12, Apple Silicon)
Single benchmark (
@benchmark foo($1))BaseBenchmarks.jl (scalar + tuple suites,
tune=false)The steady-state improvement comes from eliminating per-sample heap allocations (the
Refwrite-through) and skipping GC scrubs that were collecting nothing. The first-run improvement is from the PrecompileTools workload. The BaseBenchmarks speedup is dominated by the GC skip — most benchmarks in these suites don't allocate, so the old code was spending >90% of its time in unnecessarygcscrub()calls.With help from Claude.