Skip to content

[burn-fork] End-to-end GGUF inference parity: Llama 3 8B + Qwen 2.5 7B vs llama.cpp #135

@AdaWorldAPI

Description

@AdaWorldAPI

Worker: E1 (ensemble: model frameworks; integral validator of the spine pattern)
Mnemonic IDs blocking this: D1 (parity harness — provides per-kernel baseline)
Mnemonic IDs blocked by this: E2 (candle-fork ONNX/ViT/Whisper parity follows the same shape)

Why

crates/burn/ in this repo vendors AdaWorldAPI/burn (pinned at rev 9b2b671, see crates/burn/Cargo.toml) and rewires the burn-ndarray backend through ndarray = { path = "../.." } — gemm, BF16 matmul, AMX dispatch all come from this repo's SIMD + HPC stack now. The standing claim is: burn-fork can reliably load and execute any GGUF, because everything beneath it is the spine.

That claim is currently asserted at the per-kernel level (D1 parity harness) but never integrated end-to-end. Per-kernel parity is necessary but not sufficient — kernel composition, KV-cache layout, RoPE numerics, attention masking, and quantization dequant paths all interact, and BF16 mantissa-exact at one kernel does not guarantee BF16-tolerant logits at token N=50.

This issue is the integral of D1: full-stack GGUF inference, run identically on burn-fork (using ndarray kernels) and on llama.cpp reference, top-N logits compared at every token position. If the parity holds, the spine pattern is validated for real production models. If not, the BF16 mantissa-exact claim has a hole and we know exactly where the model-level numerics diverge from per-kernel numerics.

Clinical relevance. GGUF generative models (Llama 3, Qwen 2.5/3.5) are the generative-reasoning side of the MedCare-rs stack: anamnesis summarization, patient-facing translation (incl. German), clinical reasoning chains. They must produce stable, reproducible outputs across SPR-AMX cloud and NEON-Pi edge. A logit drift at token 30 of a 50-token discharge summary is a clinical safety incident, not a numerics curiosity.

What

A nightly CI job that runs identical prompts through burn-fork-via-ndarray and llama.cpp-reference, and asserts top-5 logit parity within a documented BF16 tolerance band, on pinned weights, pinned reference, pinned ISA paths.

Test models (CI smoke + production-representative)

Model Quantizations Approx size (Q4_K_M / Q8_0)
Llama 3 8B Q4_K_M, Q8_0 ~4.7 GB / ~8.5 GB
Qwen 2.5 7B Q4_K_M, Q8_0 ~4.4 GB / ~7.6 GB

Two quantizations per model exercise different precision regimes — Q4_K_M (k-quant block-scaled, lossy, common in deployment) and Q8_0 (8-bit, near-lossless, the parity ceiling).

Pinning (must be in tests/fixtures/gguf_parity.toml or equivalent)

  • GGUF weights: SHA-256 of each .gguf file, source URL, license note
  • llama.cpp reference: exact upstream commit SHA (e.g. ggerganov/llama.cpp@), build flags, and the llama-cli invocation used to produce the reference logits
  • Tokenizer: SHA-256 of the tokenizer.json / merges baked into each GGUF's metadata
  • Prompts: the 10-prompt suite frozen in-repo (no live downloads)

Test suite (10 prompts × 50 generated tokens × top-5 logits per position)

Bucket Prompts Why
Short factual 2 "Capital of France?" — fast smoke, deterministic
Long generative 2 200-token continuation — accumulates drift
Code 2 Rust + Python snippet completion — different token distribution
Multilingual 2 German + one non-Latin (e.g. Mandarin or Arabic) — clinical-translation surface
Reasoning 2 Short chain-of-thought — exposes attention-mask + KV-cache bugs

For each (model, quant, prompt, position), record top-5 token IDs and their logit values from both engines, assert:

  • top-1 token ID matches exactly (greedy stability)
  • top-5 token ID set matches (rank-stability allowed within set)
  • per-token logit deltas within BF16 tolerance band from D1, per quantization — Q4_K_M will be looser than Q8_0; tolerances are documented, not guessed

ISA coverage (per D2's CI matrix)

  • SPR AMX path — explicit, not polyfill. The whole point of the spine is that AMX dispatch comes from ndarray; this is where it must be exercised under a real model.
  • AVX2 path — baseline x86, the floor every cloud runner hits.
  • NEON path (Pi 5 8GB if runner allows) — model size is the binding constraint here; 8B BF16 weights are ~16 GB which does not fit, so NEON coverage is realistically Q4_K_M only and may need a smaller model substituted (e.g. Llama 3.2 3B Q4_K_M) as a proxy. Decision needed at design pass.

Architecture

        ┌── 10-prompt suite (pinned) ──┐
        │                              │
        ▼                              ▼
┌──────────────────┐         ┌──────────────────┐
│   burn-fork      │         │   llama.cpp      │
│   (this repo's   │         │   (pinned SHA)   │
│   crates/burn)   │         │                  │
│        │         │         │        │         │
│        ▼         │         │        ▼         │
│  burn-ndarray    │         │   ggml kernels   │
│  backend         │         │   (reference)    │
│        │         │         └────────┬─────────┘
│        ▼         │                  │
│  ndarray spine   │                  │
│  (gemm, BF16     │                  │
│   matmul, AMX,   │                  │
│   AVX2, NEON)    │                  │
└────────┬─────────┘                  │
         │ top-5 logits @ every pos   │ top-5 logits @ every pos
         └─────────────┬──────────────┘
                       ▼
        ┌──────────────────────────────┐
        │  parity assert harness       │
        │  (BF16 tolerance per quant)  │
        │  ── from D1                  │
        └──────────────────────────────┘

The point: burn-fork is the validation surface, ndarray is what's actually being validated. burn-fork being correct end-to-end is evidence the spine substitution worked.

Acceptance criteria

  • tests/gguf_parity/ (or crates/burn/tests/gguf_parity/) scaffolded with the 10-prompt fixture, pinned-weights manifest, pinned llama.cpp commit SHA recorded in a LLAMA_CPP_REF constant or fixture file.
  • Llama 3 8B Q4_K_M — top-5 logit parity vs llama.cpp on the 10-prompt suite × 50 tokens, within Q4_K_M BF16 tolerance band, on at least SPR AMX + AVX2 + (NEON or NEON-proxy) paths.
  • Llama 3 8B Q8_0 — same, with tighter Q8_0 tolerance band.
  • Qwen 2.5 7B Q4_K_M — same as Llama Q4_K_M.
  • Qwen 2.5 7B Q8_0 — same as Llama Q8_0.
  • Per-quantization tolerance bands documented in fixture (Q4_K_M looser than Q8_0; numbers traceable to D1's per-kernel baseline).
  • Reproducible: runs in CI nightly (not per-PR — the weights are too heavy), pinned weights/llama.cpp/tokenizer SHAs, deterministic seeds, no live network dependency at test time (weights pre-staged on the runner).
  • On failure, the harness emits (model, quant, prompt_idx, token_idx, top5_burn, top5_llamacpp, delta) so divergence is greppable, not just a red X.
  • CI matrix entry for SPR AMX is explicit — not silently routed through AVX2 polyfill (per D2's ISA-coverage requirement).

Out of scope

  • candle-fork (ONNX/ViT/Whisper) — that's E2; same shape of issue, different framework, will be filed separately.
  • Production deployment of GGUF models inside MedCare-rs services — downstream consumer concern.
  • Multi-batch / multi-stream inference — single-stream greedy decode only here; batching is a follow-up.
  • Sampling parity (top-p, temperature, mirostat) — out of scope; we compare logits, not sampled tokens, to keep the assertion deterministic. Sampling is a downstream concern.
  • Larger models (70B, MoE) — they don't fit in CI smoke; covered in a future "heavy parity" issue if needed.
  • Training/fine-tuning paths — inference only.

Dependencies

  • Blocks on D1 (parity harness) — D1 defines the per-kernel BF16 tolerance bands this issue integrates over. Without D1 we'd be guessing tolerance numbers.
  • Blocks E2 (candle-fork ONNX/ViT/Whisper parity) — E2 mirrors this issue's structure for the discriminative side of the stack; landing E1 first proves the pattern.

Preliminary exploration (already done)

  • crates/burn/Cargo.toml confirms burn-ndarray backend depends on ndarray = { path = "../.." } and pins upstream burn at AdaWorldAPI/burn rev 9b2b671. ndarray spine wiring is real.
  • No GGUF test scaffolding exists in crates/burn/tests/ or top-level tests/ yet — this issue is greenfield from the test-fixture standpoint.
  • No existing gguf / llama references in repo source tree — confirms this validation is genuinely new work, not a duplicate of an existing test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions