Context
Surfaced during the int8 NPU readiness audit on 2026-04-13. The `skainet.tensor_encodings` module-level attribute (#478) tells a downstream consumer which tensors are quantized and how. But `ConstantOperationsConverter` in `skainet-compile-hlo` emits constant tensor bodies as float dense literals:
```mlir
%w = stablehlo.constant dense<[0.123, 0.456, ...]> : tensor<8x4xf32>
```
…even when the source tensor's `TensorData` is a `Q8_0BlockTensorData` or `Q4_KBlockTensorData`. The module attribute separately says `skainet.tensor_encodings = {w = "Q8_0"}` so a consumer can see the "intent" — but the actual constant in the IR is a fictional placeholder because the emitter does not have a path to serialize quantized block bytes into the MLIR.
Why this is P1 — the load-bearing IREE gap
This is the single biggest blocker for a real "SKaiNET → IREE → NPU quantized inference" pipeline. Until it's fixed:
- IREE sees a dense float weight and dequantizes at compile time (loses the storage benefit).
- The emitted MLIR is numerically wrong if the "dense float" values are just placeholders rather than faithfully dequantized from the Q8_0 / Q4_K blocks.
- The `skainet.tensor_encodings` attribute is informational only, with no ground-truth data in the IR backing it.
Design options
Three approaches, listed from cheapest to most rigorous:
Option A — dequantize at emit time (correct but defeats the point)
`ConstantOperationsConverter` detects `tensor.data is PackedBlockStorage`, calls the existing GGML dequant routine to produce a float buffer, and emits that as the dense constant. Byte-wastefully correct: the emitted MLIR is numerically equivalent to the quantized source. IREE can then re-quantize at compile time if it wants. Simplest to implement (~a day) but leaves 4× storage on the table and the compile-time re-quant may not match the source quant scheme.
Option B — emit raw int8 constant + separate scale/zp attribute (matches TFLite flatbuffer style)
Emit the constant as a `stablehlo.constant dense<...> : tensor<8x4xi8>` with the raw quantized codes, and carry the dequant params (scales, zero points, block structure) in an `#skainet.quant_params` attribute on the constant op. A downstream pass reads the combination and materializes `!quant.uniform` types itself. Preserves storage, preserves intent, moderate implementation complexity. Requires extending `TypeMapper` and `ConstantOperationsConverter` to understand per-tensor quant types.
Option C — emit as `!quant.uniform<i8:f32, 0.1:128>` directly (the "right" MLIR way)
Use MLIR's `quant` dialect element types natively. Maximum downstream tool compatibility: IREE, XLA, TFLite, ONNXRuntime all understand `!quant.uniform`. But Q4_K is a block-quantized format with per-block scales and a super-block shape that doesn't map cleanly to the `uniform` type. Would require either:
- reducing to per-tensor / per-channel scale (lossy for Q4_K)
- emitting a custom `#skainet.q4_k` dialect type the consumer has to extend to parse
- decomposing Q4_K into a stablehlo.custom_call stub that wraps the raw bytes
Not a 1-PR project. Start with A or B, plan toward C.
Recommended starting point
Do Option A first. Correctness beats storage efficiency for the first working IREE round-trip. Once a SKaiNET → IREE → NPU pipeline is running end-to-end on any quantized model, revisit for Option B or C based on whichever downstream consumer is actually in play.
Scope of the first PR
- Extend `ConstantOperationsConverter` (or wherever weight-constant emission happens) to detect `is PackedBlockStorage` on the tensor's `TensorData`.
- Call the existing GGML dequant routine for each quantized block format and materialize a float buffer.
- Emit the float buffer as the dense constant body, keeping the existing `skainet.tensor_encodings` module attribute on the header as the metadata hint for future Option B/C consumers.
- Unit test: build a graph with a Q8_0 weight, emit, parse the constant body back, compare against a reference dequantized FloatArray.
- `./gradlew :skainet-compile:skainet-compile-hlo:allTests` before pushing.
Out of scope
- Option B / C. Follow-ups.
- Per-channel or per-block quant dialect emission.
- Changing `TensorEncoding` or `TensorSpec`.
Relationship to other IREE work
Second of two gaps surfaced in the 2026-04-13 audit. The other gap ("TensorEncoding not propagated through intermediate ops") is tracked in its own issue. Both are tagged into the project at https://github.com/orgs/SKaiNET-developers/projects/1.
Context
Surfaced during the int8 NPU readiness audit on 2026-04-13. The `skainet.tensor_encodings` module-level attribute (#478) tells a downstream consumer which tensors are quantized and how. But `ConstantOperationsConverter` in `skainet-compile-hlo` emits constant tensor bodies as float dense literals:
```mlir
%w = stablehlo.constant dense<[0.123, 0.456, ...]> : tensor<8x4xf32>
```
…even when the source tensor's `TensorData` is a `Q8_0BlockTensorData` or `Q4_KBlockTensorData`. The module attribute separately says `skainet.tensor_encodings = {w = "Q8_0"}` so a consumer can see the "intent" — but the actual constant in the IR is a fictional placeholder because the emitter does not have a path to serialize quantized block bytes into the MLIR.
Why this is P1 — the load-bearing IREE gap
This is the single biggest blocker for a real "SKaiNET → IREE → NPU quantized inference" pipeline. Until it's fixed:
Design options
Three approaches, listed from cheapest to most rigorous:
Option A — dequantize at emit time (correct but defeats the point)
`ConstantOperationsConverter` detects `tensor.data is PackedBlockStorage`, calls the existing GGML dequant routine to produce a float buffer, and emits that as the dense constant. Byte-wastefully correct: the emitted MLIR is numerically equivalent to the quantized source. IREE can then re-quantize at compile time if it wants. Simplest to implement (~a day) but leaves 4× storage on the table and the compile-time re-quant may not match the source quant scheme.
Option B — emit raw int8 constant + separate scale/zp attribute (matches TFLite flatbuffer style)
Emit the constant as a `stablehlo.constant dense<...> : tensor<8x4xi8>` with the raw quantized codes, and carry the dequant params (scales, zero points, block structure) in an `#skainet.quant_params` attribute on the constant op. A downstream pass reads the combination and materializes `!quant.uniform` types itself. Preserves storage, preserves intent, moderate implementation complexity. Requires extending `TypeMapper` and `ConstantOperationsConverter` to understand per-tensor quant types.
Option C — emit as `!quant.uniform<i8:f32, 0.1:128>` directly (the "right" MLIR way)
Use MLIR's `quant` dialect element types natively. Maximum downstream tool compatibility: IREE, XLA, TFLite, ONNXRuntime all understand `!quant.uniform`. But Q4_K is a block-quantized format with per-block scales and a super-block shape that doesn't map cleanly to the `uniform` type. Would require either:
Not a 1-PR project. Start with A or B, plan toward C.
Recommended starting point
Do Option A first. Correctness beats storage efficiency for the first working IREE round-trip. Once a SKaiNET → IREE → NPU pipeline is running end-to-end on any quantized model, revisit for Option B or C based on whichever downstream consumer is actually in play.
Scope of the first PR
Out of scope
Relationship to other IREE work
Second of two gaps surfaced in the 2026-04-13 audit. The other gap ("TensorEncoding not propagated through intermediate ops") is tracked in its own issue. Both are tagged into the project at https://github.com/orgs/SKaiNET-developers/projects/1.