Context
Inference and training runtimes both need short-lived workspace buffers (intermediate activations, masks, RoPE tables, attention slices, padding scratch). Today every call path allocates fresh FloatArrays — measurable allocation pressure on the per-token decode hot path.
In downstream SKaiNET-transformers work this surfaces concretely: a single Gemma 4 forward triggers ~6,650 FloatArray allocations per forward (35 layers × 2 sliceView arrays × ~95 forward steps), before the recent leak fix at commit 319c394 routed them to heap. With those allocations now on the GC heap, the next bottleneck is allocation rate / p95 latency.
The right home for this is upstream, alongside sk.ainet.lang.tensor.data, because workspace allocation is generic — it benefits CNN, encoder, embedding, training, and any future model family — not only transformers.
Proposal
Add a ScratchPool SPI in a new package sk.ainet.lang.tensor.scratch:
public interface ScratchPool {
public fun acquireFloat(minSize: Int): FloatArray // contents undefined
public fun acquireFloatZeroed(minSize: Int): FloatArray // [0, minSize) zeroed
public fun <R> scope(block: () -> R): R // recycles on exit
public fun stats(): ScratchStats
}
public object NoopScratchPool : ScratchPool // default; allocates fresh
public class SizeClassedScratchPool(maxBuffersPerClass: Int = 8) : ScratchPool
public data class ScratchStats(...)
And add a backward-compatible accessor on ExecutionContext:
public val scratch: ScratchPool get() = NoopScratchPool
Default is NoopScratchPool, so every existing ExecutionContext impl keeps working without modification. Implementations that want pooling override the property (or wrap an existing context).
SizeClassedScratchPool design
- Power-of-two slabs starting at 64 floats; up to 20 classes (covers ~33M floats / 128 MB top bucket).
- Stack of scope frames:
acquire adds to the top frame; scope { ... } drains the frame on exit, returning buffers to per-class free lists.
- Single-threaded by intent — concurrent forwards use separate pools (per-thread carriers can be added downstream).
- Surplus buffers (above
maxBuffersPerClass) drop to GC, capping retained memory.
Scope of this issue
- New SPI in
sk.ainet.lang.tensor.scratch (commonMain).
ScratchPool.scratch accessor on ExecutionContext with default NoopScratchPool.
- Unit tests: scope semantics, size-class rounding, stats correctness, surplus drop.
- No call-site migrations in this repo — those land downstream in
SKaiNET-transformers.
Out of scope
- Per-thread carrier (
ScratchPoolContext) — only needed downstream where call sites don't take an ExecutionContext everywhere. With the upstream ctx.scratch field, downstream can pass it through naturally.
- Direct-memory variants (
MemorySegment-backed pool) — future iteration.
Acceptance
- New SPI compiles on all KMP targets.
- Unit tests pass.
NoopScratchPool is a drop-in default; no existing test fails.
Context
Inference and training runtimes both need short-lived workspace buffers (intermediate activations, masks, RoPE tables, attention slices, padding scratch). Today every call path allocates fresh
FloatArrays — measurable allocation pressure on the per-token decode hot path.In downstream
SKaiNET-transformerswork this surfaces concretely: a single Gemma 4 forward triggers ~6,650FloatArrayallocations per forward (35 layers × 2 sliceView arrays × ~95 forward steps), before the recent leak fix at commit 319c394 routed them to heap. With those allocations now on the GC heap, the next bottleneck is allocation rate / p95 latency.The right home for this is upstream, alongside
sk.ainet.lang.tensor.data, because workspace allocation is generic — it benefits CNN, encoder, embedding, training, and any future model family — not only transformers.Proposal
Add a
ScratchPoolSPI in a new packagesk.ainet.lang.tensor.scratch:And add a backward-compatible accessor on
ExecutionContext:Default is
NoopScratchPool, so every existingExecutionContextimpl keeps working without modification. Implementations that want pooling override the property (or wrap an existing context).SizeClassedScratchPooldesignacquireadds to the top frame;scope { ... }drains the frame on exit, returning buffers to per-class free lists.maxBuffersPerClass) drop to GC, capping retained memory.Scope of this issue
sk.ainet.lang.tensor.scratch(commonMain).ScratchPool.scratchaccessor onExecutionContextwith defaultNoopScratchPool.SKaiNET-transformers.Out of scope
ScratchPoolContext) — only needed downstream where call sites don't take anExecutionContexteverywhere. With the upstreamctx.scratchfield, downstream can pass it through naturally.MemorySegment-backed pool) — future iteration.Acceptance
NoopScratchPoolis a drop-in default; no existing test fails.