Context
Part of the priority-ordered NPU / IREE roadmap.
P0-1 is "make quantization first-class in the graph IR so StableHLO export doesn't erase it." Today `Q8_0TensorData` / `Q4_KTensorData` / `TernaryTensorData` live as `TensorData` subclasses and `QuantizedMatmul.matmulAutoDispatch()` discovers them via runtime `is`-checks on the weight buffer. The `ComputeGraph` / `GraphNode` / `TensorSpec` layer is completely blind to quantization, which is why any StableHLO export silently re-materializes everything as FP32.
Scoping uncovered two existing hooks we can lean on:
- `TensorSpec` already carries `metadata: Map<String, Any>` (skainet-lang-core/.../tensor/ops/TensorSpec.kt). No schema change needed.
- `sealed interface TensorEncoding` already models storage encodings: `Dense`, `Q4_K`, `Q8_0`, `TernaryPacked`, `TurboQuantPolar`, `TurboQuantPolarQjl`, `Opaque` (skainet-lang-core/.../tensor/storage/TensorEncoding.kt). We can reuse it — no new enum.
This PR
- Typed accessor helper: add `TensorSpec.tensorEncoding` get/set helpers (extension functions) that read/write a single well-known key on `metadata` and return `TensorEncoding?`. Default `null` means "unknown / not carried" — not the same as `Dense`.
- Populate in the GGUF loader: in `StreamingGgufParametersLoader.load()`, alongside the existing `when (tensorInfo.tensorType)` dispatch that constructs `Q4_KBlockTensorData` / `Q8_0BlockTensorData` / etc., set the corresponding `TensorEncoding` on the `TensorSpec` that surfaces the loaded tensor.
- Preserve through tracing: in `TraceToGraphBuilder` (`buildInputSpecs` / `buildOutputSpecs` / inline fallback sites), propagate `tensorEncoding` from source to derived specs so a node whose input is a `Q4_K` weight carries that metadata onto its `GraphNode.inputs` entry.
- Unit test: load a small synthetic GGUF-like fixture with a Q4_K weight, trace a `matmul`, assert the resulting `GraphNode` input spec's `tensorEncoding` is `TensorEncoding.Q4_K`.
Out of scope (follow-up PRs in the P0-1 track)
- Teaching `StableHloConverter` / quant-aware emitters to read `tensorEncoding` and emit `stablehlo.uniform_quantize` or quant dialect ops. The metadata has to flow first, then the emitter reads it.
- Parameterizing `Tensor<DType, V>` on a quantization type parameter. That's a much larger redesign and probably the wrong choice — keeping it as metadata on `TensorSpec` is lighter and composes better.
- Changing `QuantizedMatmul.matmulAutoDispatch()`. Runtime dispatch stays for CPU; the goal is only that compile-time (IR) paths stop being blind.
Why this is the right first step
- Purely additive: no existing API changes, no breaking call sites.
- The metadata channel already exists; we're just typing and populating it.
- Later PRs (StableHLO emitter, quant dialect lowering) become one-file local changes instead of cross-cutting refactors.
Context
Part of the priority-ordered NPU / IREE roadmap.
P0-1 is "make quantization first-class in the graph IR so StableHLO export doesn't erase it." Today `Q8_0TensorData` / `Q4_KTensorData` / `TernaryTensorData` live as `TensorData` subclasses and `QuantizedMatmul.matmulAutoDispatch()` discovers them via runtime `is`-checks on the weight buffer. The `ComputeGraph` / `GraphNode` / `TensorSpec` layer is completely blind to quantization, which is why any StableHLO export silently re-materializes everything as FP32.
Scoping uncovered two existing hooks we can lean on:
This PR
Out of scope (follow-up PRs in the P0-1 track)
Why this is the right first step