[burn-fork] End-to-end GGUF inference parity: Llama 3 8B + Qwen 2.5 7B vs llama.cpp

**Worker:** E1 (ensemble: model frameworks; integral validator of the spine pattern)
**Mnemonic IDs blocking this:** D1 (parity harness — provides per-kernel baseline)
**Mnemonic IDs blocked by this:** E2 (candle-fork ONNX/ViT/Whisper parity follows the same shape)

## Why

`crates/burn/` in this repo vendors `AdaWorldAPI/burn` (pinned at rev `9b2b671`, see `crates/burn/Cargo.toml`) and rewires the burn-ndarray backend through `ndarray = { path = "../.." }` — gemm, BF16 matmul, AMX dispatch all come from this repo's SIMD + HPC stack now. The standing claim is: **burn-fork can reliably load and execute any GGUF**, because everything beneath it is the spine.

That claim is currently asserted at the per-kernel level (D1 parity harness) but **never integrated end-to-end**. Per-kernel parity is necessary but not sufficient — kernel composition, KV-cache layout, RoPE numerics, attention masking, and quantization dequant paths all interact, and BF16 mantissa-exact at one kernel does not guarantee BF16-tolerant logits at token N=50.

This issue is **the integral of D1**: full-stack GGUF inference, run identically on burn-fork (using ndarray kernels) and on llama.cpp reference, top-N logits compared at every token position. If the parity holds, the spine pattern is validated for real production models. If not, the BF16 mantissa-exact claim has a hole and we know exactly where the model-level numerics diverge from per-kernel numerics.

**Clinical relevance.** GGUF generative models (Llama 3, Qwen 2.5/3.5) are the generative-reasoning side of the MedCare-rs stack: anamnesis summarization, patient-facing translation (incl. German), clinical reasoning chains. They must produce stable, reproducible outputs across SPR-AMX cloud and NEON-Pi edge. A logit drift at token 30 of a 50-token discharge summary is a clinical safety incident, not a numerics curiosity.

## What

A nightly CI job that runs identical prompts through burn-fork-via-ndarray and llama.cpp-reference, and asserts top-5 logit parity within a documented BF16 tolerance band, on pinned weights, pinned reference, pinned ISA paths.

### Test models (CI smoke + production-representative)

| Model | Quantizations | Approx size (Q4_K_M / Q8_0) |
|---|---|---|
| **Llama 3 8B** | Q4_K_M, Q8_0 | ~4.7 GB / ~8.5 GB |
| **Qwen 2.5 7B** | Q4_K_M, Q8_0 | ~4.4 GB / ~7.6 GB |

Two quantizations per model exercise different precision regimes — Q4_K_M (k-quant block-scaled, lossy, common in deployment) and Q8_0 (8-bit, near-lossless, the parity ceiling).

### Pinning (must be in `tests/fixtures/gguf_parity.toml` or equivalent)

- **GGUF weights:** SHA-256 of each .gguf file, source URL, license note
- **llama.cpp reference:** exact upstream commit SHA (e.g. `ggerganov/llama.cpp@`), build flags, and the `llama-cli` invocation used to produce the reference logits
- **Tokenizer:** SHA-256 of the tokenizer.json / merges baked into each GGUF's metadata
- **Prompts:** the 10-prompt suite frozen in-repo (no live downloads)

### Test suite (10 prompts × 50 generated tokens × top-5 logits per position)

| Bucket | Prompts | Why |
|---|---|---|
| Short factual | 2 | "Capital of France?" — fast smoke, deterministic |
| Long generative | 2 | 200-token continuation — accumulates drift |
| Code | 2 | Rust + Python snippet completion — different token distribution |
| Multilingual | 2 | German + one non-Latin (e.g. Mandarin or Arabic) — clinical-translation surface |
| Reasoning | 2 | Short chain-of-thought — exposes attention-mask + KV-cache bugs |

For each (model, quant, prompt, position), record top-5 token IDs and their logit values from both engines, assert:
- top-1 token ID matches exactly (greedy stability)
- top-5 token ID set matches (rank-stability allowed within set)
- per-token logit deltas within BF16 tolerance band from D1, **per quantization** — Q4_K_M will be looser than Q8_0; tolerances are documented, not guessed

### ISA coverage (per D2's CI matrix)

- **SPR AMX path** — explicit, not polyfill. The whole point of the spine is that AMX dispatch comes from `ndarray`; this is where it must be exercised under a real model.
- **AVX2 path** — baseline x86, the floor every cloud runner hits.
- **NEON path (Pi 5 8GB if runner allows)** — model size is the binding constraint here; 8B BF16 weights are ~16 GB which does not fit, so NEON coverage is realistically Q4_K_M only and may need a smaller model substituted (e.g. Llama 3.2 3B Q4_K_M) as a proxy. **Decision needed at design pass.**

## Architecture

```text
        ┌── 10-prompt suite (pinned) ──┐
        │                              │
        ▼                              ▼
┌──────────────────┐         ┌──────────────────┐
│   burn-fork      │         │   llama.cpp      │
│   (this repo's   │         │   (pinned SHA)   │
│   crates/burn)   │         │                  │
│        │         │         │        │         │
│        ▼         │         │        ▼         │
│  burn-ndarray    │         │   ggml kernels   │
│  backend         │         │   (reference)    │
│        │         │         └────────┬─────────┘
│        ▼         │                  │
│  ndarray spine   │                  │
│  (gemm, BF16     │                  │
│   matmul, AMX,   │                  │
│   AVX2, NEON)    │                  │
└────────┬─────────┘                  │
         │ top-5 logits @ every pos   │ top-5 logits @ every pos
         └─────────────┬──────────────┘
                       ▼
        ┌──────────────────────────────┐
        │  parity assert harness       │
        │  (BF16 tolerance per quant)  │
        │  ── from D1                  │
        └──────────────────────────────┘
```

The point: **burn-fork is the validation surface, ndarray is what's actually being validated**. burn-fork being correct end-to-end is evidence the spine substitution worked.

## Acceptance criteria

- [ ] `tests/gguf_parity/` (or `crates/burn/tests/gguf_parity/`) scaffolded with the 10-prompt fixture, pinned-weights manifest, pinned llama.cpp commit SHA recorded in a `LLAMA_CPP_REF` constant or fixture file.
- [ ] **Llama 3 8B Q4_K_M** — top-5 logit parity vs llama.cpp on the 10-prompt suite × 50 tokens, within Q4_K_M BF16 tolerance band, on at least **SPR AMX + AVX2 + (NEON or NEON-proxy)** paths.
- [ ] **Llama 3 8B Q8_0** — same, with tighter Q8_0 tolerance band.
- [ ] **Qwen 2.5 7B Q4_K_M** — same as Llama Q4_K_M.
- [ ] **Qwen 2.5 7B Q8_0** — same as Llama Q8_0.
- [ ] Per-quantization tolerance bands documented in fixture (Q4_K_M looser than Q8_0; numbers traceable to D1's per-kernel baseline).
- [ ] Reproducible: runs in CI **nightly** (not per-PR — the weights are too heavy), pinned weights/llama.cpp/tokenizer SHAs, deterministic seeds, no live network dependency at test time (weights pre-staged on the runner).
- [ ] On failure, the harness emits `(model, quant, prompt_idx, token_idx, top5_burn, top5_llamacpp, delta)` so divergence is greppable, not just a red X.
- [ ] CI matrix entry for SPR AMX is **explicit** — not silently routed through AVX2 polyfill (per D2's ISA-coverage requirement).

## Out of scope

- **candle-fork (ONNX/ViT/Whisper)** — that's E2; same shape of issue, different framework, will be filed separately.
- **Production deployment** of GGUF models inside MedCare-rs services — downstream consumer concern.
- **Multi-batch / multi-stream inference** — single-stream greedy decode only here; batching is a follow-up.
- **Sampling parity** (top-p, temperature, mirostat) — out of scope; we compare logits, not sampled tokens, to keep the assertion deterministic. Sampling is a downstream concern.
- **Larger models** (70B, MoE) — they don't fit in CI smoke; covered in a future "heavy parity" issue if needed.
- **Training/fine-tuning paths** — inference only.

## Dependencies

- **Blocks on D1** (parity harness) — D1 defines the per-kernel BF16 tolerance bands this issue integrates over. Without D1 we'd be guessing tolerance numbers.
- **Blocks E2** (candle-fork ONNX/ViT/Whisper parity) — E2 mirrors this issue's structure for the discriminative side of the stack; landing E1 first proves the pattern.

## Preliminary exploration (already done)

- `crates/burn/Cargo.toml` confirms burn-ndarray backend depends on `ndarray = { path = "../.." }` and pins upstream burn at AdaWorldAPI/burn rev `9b2b671`. ndarray spine wiring is real.
- No GGUF test scaffolding exists in `crates/burn/tests/` or top-level `tests/` yet — this issue is greenfield from the test-fixture standpoint.
- No existing `gguf` / `llama` references in repo source tree — confirms this validation is genuinely new work, not a duplicate of an existing test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[burn-fork] End-to-end GGUF inference parity: Llama 3 8B + Qwen 2.5 7B vs llama.cpp #135

Why

What

Test models (CI smoke + production-representative)

Pinning (must be in `tests/fixtures/gguf_parity.toml` or equivalent)

Test suite (10 prompts × 50 generated tokens × top-5 logits per position)

ISA coverage (per D2's CI matrix)

Architecture

Acceptance criteria

Out of scope

Dependencies

Preliminary exploration (already done)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Quantizations	Approx size (Q4_K_M / Q8_0)
Llama 3 8B	Q4_K_M, Q8_0	~4.7 GB / ~8.5 GB
Qwen 2.5 7B	Q4_K_M, Q8_0	~4.4 GB / ~7.6 GB

Bucket	Prompts	Why
Short factual	2	"Capital of France?" — fast smoke, deterministic
Long generative	2	200-token continuation — accumulates drift
Code	2	Rust + Python snippet completion — different token distribution
Multilingual	2	German + one non-Latin (e.g. Mandarin or Arabic) — clinical-translation surface
Reasoning	2	Short chain-of-thought — exposes attention-mask + KV-cache bugs

[burn-fork] End-to-end GGUF inference parity: Llama 3 8B + Qwen 2.5 7B vs llama.cpp #135

Description

Why

What

Test models (CI smoke + production-representative)

Pinning (must be in tests/fixtures/gguf_parity.toml or equivalent)

Test suite (10 prompts × 50 generated tokens × top-5 logits per position)

ISA coverage (per D2's CI matrix)

Architecture

Acceptance criteria

Out of scope

Dependencies

Preliminary exploration (already done)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Pinning (must be in `tests/fixtures/gguf_parity.toml` or equivalent)