Skip to content

GGUF DEQUANTIZE_TO_FP32 over-allocates: 1.1B Q4_K_M needs >12 GB heap transiently (~4.4 GB legit) #782

Description

@michalharakal

Summary

Loading a 1.1B-parameter Q4_K_M GGUF via LlamaNetworkLoader.fromGguf(..., QuantPolicy.DEQUANTIZE_TO_FP32) over-allocates badly: the legitimate resident cost is ~4.4 GB (dense FP32), but the dequant path transiently needs >12 GB heap to get there. The model only loads cleanly with a 32 GB heap — unreasonable for a 1.1B model.

This is a transient-allocation / extra-copy problem in the dequant path, not a correctness bug — the produced tensors are correct.

Numbers

  • Model: TinyLlama-1.1B-Chat v1.0, Q4_K_M (637 MB GGUF on disk).
  • Parses correctly: 202 parameter tensors, ~1.1B params, correct shapes (token_embd [32000, 2048], GQA k_proj [256, 2048]).
  • Legitimate dense FP32 footprint: ~4.4 GB (1.1e9 params × 4 bytes).
  • Observed peak heap to complete the load: >12 GB transient (fails below; needs 32 GB to be safe) — roughly ~3× the dense floor.

Likely cause

The DEQUANTIZE_TO_FP32 path (QuantPolicy.ktskainet-io-gguf dequant, Quants.kt) appears to materialize boxed Float / intermediate copies rather than unpacking each Q4_K/Q6_K block directly into a primitive FloatArray. With ~4.4 GB of final data, even one extra full-size intermediate (plus boxing overhead) blows past 12 GB.

Why it matters

This blocks adding TinyLlama-1.1B as a real-weights conformance/export reference on the IREE conformance side. The export use case (trace → StableHLO with weights baked as constants) fundamentally needs the ~4.4 GB FP32 resident, so the lazy RowDequantSource / ops.gather row-dequant path (a24f21d) does not help here — the issue is specifically the transient overshoot above the 4.4 GB floor during eager full materialization.

Suggested fix

  • Stream the Q4_K/Q6_K unpack block-by-block straight into the destination FloatArray (no boxed Float, no full-size intermediate copy).
  • Target: peak heap ≈ dense FP32 footprint + a small per-block scratch (i.e. ~5–6 GB, not >12 GB, for this model).

Notes / related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions