Benchmark: big brain in small body — compressed models vs same-VRAM natives

**airc-queue card**

Coordinates work via the AIRC queue substrate (airc#562). Edit this card by commenting OR by running `airc queue claim`/`airc queue release`/`airc queue heartbeat` (later PRs).

```json
{
  "kind": "airc-queue-card-v1",
  "id": "#627",
  "owner": "claude-tab-1",
  "status": "claimed",
  "evidence": "bulk-adopted from continuum unlabeled-open backlog (Joel directive 2026-05-14)",
  "next_action": "triage scope + assign or close stale"
}
```

Close this issue when the work is done (status=merged/abandoned).

## Original issue body

<details>
<summary>Pre-adoption body</summary>

## The Headline Comparison

Developers care about VRAM, not parameter counts. Our compressed models should beat native models at the same VRAM tier.

### Eval Matrix

| Our Model | VRAM | Compare Against | Their VRAM | Why |
|-----------|------|----------------|------------|-----|
| 27B forged GGUF Q4 | ~10GB | Qwen3.5-7B, CodeLlama-7B | ~4-7GB | 27B brain in 7B body |
| 35B-A3B (16 experts) | ~3GB | Any 3B model | ~2-3GB | 35B patterns in 3B |
| 14B compacted GGUF Q4 | ~5GB | Qwen3.5-7B | ~4-7GB | Pruned 14B vs native 7B |
| 4B forged GGUF Q4 | 2.6GB | Qwen2.5-Coder-1.5B | ~1-2GB | Already done: 53% HumanEval |

### Method

All HumanEval via EvalPlus, greedy decoding. Save .jsonl proof files.
GGUF models eval via llama-cpp-python (~10 min each).
fp16 models via HF transformers (~24hr each for 14B+).

### Priority

1. 27B forged GGUF vs 7B native (THE headline)
2. 35B-A3B vs 3B native (MoE surgery headline)
3. Controls (base models unforged)

## Related
- #588 (use forged models internally)
- #576 (factory pipeline with benchmarks)
- Paper §3.3.2

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: big brain in small body — compressed models vs same-VRAM natives #627

Original issue body

The Headline Comparison

Eval Matrix

Method

Priority

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Our Model	VRAM	Compare Against	Their VRAM	Why
27B forged GGUF Q4	~10GB	Qwen3.5-7B, CodeLlama-7B	~4-7GB	27B brain in 7B body
35B-A3B (16 experts)	~3GB	Any 3B model	~2-3GB	35B patterns in 3B
14B compacted GGUF Q4	~5GB	Qwen3.5-7B	~4-7GB	Pruned 14B vs native 7B
4B forged GGUF Q4	2.6GB	Qwen2.5-Coder-1.5B	~1-2GB	Already done: 53% HumanEval

Benchmark: big brain in small body — compressed models vs same-VRAM natives #627

Description

Original issue body

The Headline Comparison

Eval Matrix

Method

Priority

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions