Skip to content

Emit TensorEncoding into StableHLO output (P0-1 step 2) #473

@michalharakal

Description

@michalharakal

Context

Follow-up to #469. `TensorSpec` now carries `tensorEncoding: TensorEncoding?` end-to-end through `TraceToGraphBuilder`, so by the time a `GraphNode` reaches `StableHloConverter` its weight / constant operand specs already know whether they're Q4_K / Q8_0 / TernaryPacked / TurboQuant / Dense / unknown.

The StableHLO emitter currently ignores this metadata entirely. Even if a weight arrives with `tensorEncoding == TensorEncoding.Q8_0`, the emitted `.mlir` is indistinguishable from one built out of dense FP32 constants. That erases quantization at the compile-path boundary, which is exactly the P0-1 gap the whole track is meant to close.

This PR

Minimal useful emitter hook — the simplest thing that lets downstream tools (and humans reading the MLIR) see that quantization flowed through.

  1. Emit a metadata comment next to encoded operands. In `StableHloConverter`'s main dispatch loop, after an operation has been emitted, check its `GraphNode` inputs / outputs for any spec with a non-null `tensorEncoding` and emit a preceding MLIR comment line like:

    ```mlir
    // tensor_encoding: operand=1 name=w encoding=Q8_0
    ```

    The comment carries the operand index, the tensor name, and the `TensorEncoding.name` string. MLIR tools ignore comments but text round-trips preserve them, so the information survives into consumer pipelines.

  2. Expose an optional typed hook. Add a small helper to `ConversionContext` — `emitEncodingAnnotation(spec: TensorSpec)` — that individual converters can call if they want finer-grained comment placement than the default post-emit sweep.

  3. Unit test. Build a `ComputeGraph` with a weight node whose output spec has `TensorEncoding.Q8_0`, run it through `StableHloConverter`, assert the emitted text contains the expected `// tensor_encoding: ... encoding=Q8_0` comment near the weight's emission site.

Why comments and not real quant dialect ops

StableHLO's `quant.` dialect uses typed quant element types (`!quant.uniform<i8:f32, 0.1:128>`) that are fiddly to emit as text and are not yet consumed anywhere in the SKaiNET pipeline. Emitting them prematurely would just produce MLIR that no existing tool in this repo validates. Comments are the cheapest reversible first hop and unblock the next PR, which can: (a) grow the comment to a structured `#skainet.tensor_encoding` attribute, or (b) cut over to real `stablehlo.custom_call @dequantize_q8_0` stubs matching the style already used by `ReductionOperationsConverter`.

Out of scope

  • `stablehlo.uniform_quantize` / real quant dialect emission.
  • Teaching IREE or any downstream tool to read the comments. That's downstream work.
  • Changing the shape of `TensorEncoding` or `TensorSpec`.
  • Conv / attention / softmax lowerings (Fix softmax StableHLO lowering to use real reductions #467 is separate).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions