Skip to content

Plumb TensorEncoding into TensorSpec.metadata (P0-1 step 1) #469

@michalharakal

Description

@michalharakal

Context

Part of the priority-ordered NPU / IREE roadmap.

P0-1 is "make quantization first-class in the graph IR so StableHLO export doesn't erase it." Today `Q8_0TensorData` / `Q4_KTensorData` / `TernaryTensorData` live as `TensorData` subclasses and `QuantizedMatmul.matmulAutoDispatch()` discovers them via runtime `is`-checks on the weight buffer. The `ComputeGraph` / `GraphNode` / `TensorSpec` layer is completely blind to quantization, which is why any StableHLO export silently re-materializes everything as FP32.

Scoping uncovered two existing hooks we can lean on:

  1. `TensorSpec` already carries `metadata: Map<String, Any>` (skainet-lang-core/.../tensor/ops/TensorSpec.kt). No schema change needed.
  2. `sealed interface TensorEncoding` already models storage encodings: `Dense`, `Q4_K`, `Q8_0`, `TernaryPacked`, `TurboQuantPolar`, `TurboQuantPolarQjl`, `Opaque` (skainet-lang-core/.../tensor/storage/TensorEncoding.kt). We can reuse it — no new enum.

This PR

  1. Typed accessor helper: add `TensorSpec.tensorEncoding` get/set helpers (extension functions) that read/write a single well-known key on `metadata` and return `TensorEncoding?`. Default `null` means "unknown / not carried" — not the same as `Dense`.
  2. Populate in the GGUF loader: in `StreamingGgufParametersLoader.load()`, alongside the existing `when (tensorInfo.tensorType)` dispatch that constructs `Q4_KBlockTensorData` / `Q8_0BlockTensorData` / etc., set the corresponding `TensorEncoding` on the `TensorSpec` that surfaces the loaded tensor.
  3. Preserve through tracing: in `TraceToGraphBuilder` (`buildInputSpecs` / `buildOutputSpecs` / inline fallback sites), propagate `tensorEncoding` from source to derived specs so a node whose input is a `Q4_K` weight carries that metadata onto its `GraphNode.inputs` entry.
  4. Unit test: load a small synthetic GGUF-like fixture with a Q4_K weight, trace a `matmul`, assert the resulting `GraphNode` input spec's `tensorEncoding` is `TensorEncoding.Q4_K`.

Out of scope (follow-up PRs in the P0-1 track)

  • Teaching `StableHloConverter` / quant-aware emitters to read `tensorEncoding` and emit `stablehlo.uniform_quantize` or quant dialect ops. The metadata has to flow first, then the emitter reads it.
  • Parameterizing `Tensor<DType, V>` on a quantization type parameter. That's a much larger redesign and probably the wrong choice — keeping it as metadata on `TensorSpec` is lighter and composes better.
  • Changing `QuantizedMatmul.matmulAutoDispatch()`. Runtime dispatch stays for CPU; the goal is only that compile-time (IR) paths stop being blind.

Why this is the right first step

  • Purely additive: no existing API changes, no breaking call sites.
  • The metadata channel already exists; we're just typing and populating it.
  • Later PRs (StableHLO emitter, quant dialect lowering) become one-file local changes instead of cross-cutting refactors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions