Add ScratchPool SPI for runtime workspace allocation

## Context

Inference and training runtimes both need short-lived workspace buffers (intermediate activations, masks, RoPE tables, attention slices, padding scratch). Today every call path allocates fresh `FloatArray`s — measurable allocation pressure on the per-token decode hot path.

In downstream `SKaiNET-transformers` work this surfaces concretely: a single Gemma 4 forward triggers ~6,650 `FloatArray` allocations per forward (35 layers × 2 sliceView arrays × ~95 forward steps), before the recent leak fix at commit 319c394 routed them to heap. With those allocations now on the GC heap, the next bottleneck is allocation rate / p95 latency.

The right home for this is **upstream**, alongside `sk.ainet.lang.tensor.data`, because workspace allocation is generic — it benefits CNN, encoder, embedding, training, and any future model family — not only transformers.

## Proposal

Add a `ScratchPool` SPI in a new package `sk.ainet.lang.tensor.scratch`:

```kotlin
public interface ScratchPool {
    public fun acquireFloat(minSize: Int): FloatArray         // contents undefined
    public fun acquireFloatZeroed(minSize: Int): FloatArray   // [0, minSize) zeroed
    public fun <R> scope(block: () -> R): R                   // recycles on exit
    public fun stats(): ScratchStats
}
public object NoopScratchPool : ScratchPool                    // default; allocates fresh
public class SizeClassedScratchPool(maxBuffersPerClass: Int = 8) : ScratchPool
public data class ScratchStats(...)
```

And add a backward-compatible accessor on `ExecutionContext`:

```kotlin
public val scratch: ScratchPool get() = NoopScratchPool
```

Default is `NoopScratchPool`, so every existing `ExecutionContext` impl keeps working without modification. Implementations that want pooling override the property (or wrap an existing context).

### `SizeClassedScratchPool` design

- Power-of-two slabs starting at 64 floats; up to 20 classes (covers ~33M floats / 128 MB top bucket).
- Stack of scope frames: `acquire` adds to the top frame; `scope { ... }` drains the frame on exit, returning buffers to per-class free lists.
- Single-threaded by intent — concurrent forwards use separate pools (per-thread carriers can be added downstream).
- Surplus buffers (above `maxBuffersPerClass`) drop to GC, capping retained memory.

## Scope of this issue

- New SPI in `sk.ainet.lang.tensor.scratch` (commonMain).
- `ScratchPool.scratch` accessor on `ExecutionContext` with default `NoopScratchPool`.
- Unit tests: scope semantics, size-class rounding, stats correctness, surplus drop.
- No call-site migrations in this repo — those land downstream in `SKaiNET-transformers`.

## Out of scope

- Per-thread carrier (`ScratchPoolContext`) — only needed downstream where call sites don't take an `ExecutionContext` everywhere. With the upstream `ctx.scratch` field, downstream can pass it through naturally.
- Direct-memory variants (`MemorySegment`-backed pool) — future iteration.

## Acceptance

- New SPI compiles on all KMP targets.
- Unit tests pass.
- `NoopScratchPool` is a drop-in default; no existing test fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ScratchPool SPI for runtime workspace allocation #549

Context

Proposal

`SizeClassedScratchPool` design

Scope of this issue

Out of scope

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add ScratchPool SPI for runtime workspace allocation #549

Description

Context

Proposal

SizeClassedScratchPool design

Scope of this issue

Out of scope

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`SizeClassedScratchPool` design