bench(kernel): KernelMatmulBench — scalar vs Panama (M5 evidence)#558
Merged
michalharakal merged 1 commit intodevelopfrom Apr 28, 2026
Merged
bench(kernel): KernelMatmulBench — scalar vs Panama (M5 evidence)#558michalharakal merged 1 commit intodevelopfrom
michalharakal merged 1 commit intodevelopfrom
Conversation
Direct Fp32MatmulKernel.matmul JMH harness, sizes 256/512/1024, provider param toggles ScalarMatmulKernel vs PanamaVectorMatmulKernel. Used to validate the M5 milestone target (Panama ≥1.5× scalar) without entanglement from the rest of the op pipeline. Local run on JDK 21.0.10 (M-series macOS) clears the target comfortably: size scalar panama speedup 256 9.454ms 1.356ms 6.97x 512 79.68ms 13.62ms 5.85x 1024 862.8ms 118.2ms 7.30x Adds skainet-backend-api as a direct dep on the bench module so the JMH source set can see the kernel SPI types, and documents the new bench in docs/.../perf/jvm-cpu.adoc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
KernelMatmulBenchto:skainet-backends:benchmarks:jvm-cpu-jmh— a directFp32MatmulKernel.matmulJMH harness withprovider ∈ {scalar, panama}andsize ∈ {256, 512, 1024}.ctx.ops.matmulrouting — that routing change is the next follow-up.skainet-backend-apias a direct dep on the bench module so JMH sources can see the SPI types directly.docs/.../perf/jvm-cpu.adoc.Local run — JDK 21.0.10, M-series macOS
JMH config:
--enable-preview --add-modules jdk.incubator.vector, 3 warmup × 10s + 5 measurement × 10s, 1 fork. Same input seeding as the existingMatmulBenchso cross-bench comparison is meaningful.Reference:
MatmulBench(full op-level path, BLAS off, vector on) on the same machine clocks 9.74 ms @ 512², slightly faster than this kernel's 13.62 ms @ 512². The gap is the cache-blocked tiled implementation inJvmVectorKernels.matmulFloatBlockedthat the production routing currently calls; the SPI kernel uses a simpler FMA + B^T pack. Closing that gap by porting the tiling into the SPI kernel is a fair follow-up if the production bench numbers regress after the routing change.Why direct kernel benching
MatmulBenchexercisesctx.ops.matmul, which today still callsJvmVectorKernelsdirectly (not the SPI). Until that routing change lands, only this new bench reflects scalar-vs-Panama through the kernel SPI in isolation. Once routing flips, the existingMatmulBenchwill exercise the same provider end-to-end and we can decide whether to keep both benches or fold one in.Test plan
./gradlew :skainet-backends:benchmarks:jvm-cpu-jmh:jmhCompileGeneratedClasses— compiles cleanly../gradlew :skainet-backends:benchmarks:jvm-cpu-jmh:jmh -Pjmh.include=KernelMatmulBench— produces the numbers above.Follow-ups (still in M5 hopper)
DefaultCpuOpsJvm.matmulthroughKernelRegistry.ServiceLoaderauto-discovery for kernel providers.PanamaVectorMatmulKernelif/when production routing exposes a regression vs the currentmatmulFloatBlockedpath.🤖 Generated with Claude Code