Skip to content

Add ScratchPool SPI for runtime workspace allocation #549

@michalharakal

Description

@michalharakal

Context

Inference and training runtimes both need short-lived workspace buffers (intermediate activations, masks, RoPE tables, attention slices, padding scratch). Today every call path allocates fresh FloatArrays — measurable allocation pressure on the per-token decode hot path.

In downstream SKaiNET-transformers work this surfaces concretely: a single Gemma 4 forward triggers ~6,650 FloatArray allocations per forward (35 layers × 2 sliceView arrays × ~95 forward steps), before the recent leak fix at commit 319c394 routed them to heap. With those allocations now on the GC heap, the next bottleneck is allocation rate / p95 latency.

The right home for this is upstream, alongside sk.ainet.lang.tensor.data, because workspace allocation is generic — it benefits CNN, encoder, embedding, training, and any future model family — not only transformers.

Proposal

Add a ScratchPool SPI in a new package sk.ainet.lang.tensor.scratch:

public interface ScratchPool {
    public fun acquireFloat(minSize: Int): FloatArray         // contents undefined
    public fun acquireFloatZeroed(minSize: Int): FloatArray   // [0, minSize) zeroed
    public fun <R> scope(block: () -> R): R                   // recycles on exit
    public fun stats(): ScratchStats
}
public object NoopScratchPool : ScratchPool                    // default; allocates fresh
public class SizeClassedScratchPool(maxBuffersPerClass: Int = 8) : ScratchPool
public data class ScratchStats(...)

And add a backward-compatible accessor on ExecutionContext:

public val scratch: ScratchPool get() = NoopScratchPool

Default is NoopScratchPool, so every existing ExecutionContext impl keeps working without modification. Implementations that want pooling override the property (or wrap an existing context).

SizeClassedScratchPool design

  • Power-of-two slabs starting at 64 floats; up to 20 classes (covers ~33M floats / 128 MB top bucket).
  • Stack of scope frames: acquire adds to the top frame; scope { ... } drains the frame on exit, returning buffers to per-class free lists.
  • Single-threaded by intent — concurrent forwards use separate pools (per-thread carriers can be added downstream).
  • Surplus buffers (above maxBuffersPerClass) drop to GC, capping retained memory.

Scope of this issue

  • New SPI in sk.ainet.lang.tensor.scratch (commonMain).
  • ScratchPool.scratch accessor on ExecutionContext with default NoopScratchPool.
  • Unit tests: scope semantics, size-class rounding, stats correctness, surplus drop.
  • No call-site migrations in this repo — those land downstream in SKaiNET-transformers.

Out of scope

  • Per-thread carrier (ScratchPoolContext) — only needed downstream where call sites don't take an ExecutionContext everywhere. With the upstream ctx.scratch field, downstream can pass it through naturally.
  • Direct-memory variants (MemorySegment-backed pool) — future iteration.

Acceptance

  • New SPI compiles on all KMP targets.
  • Unit tests pass.
  • NoopScratchPool is a drop-in default; no existing test fails.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions