Skip to content

Benchmark: big brain in small body — compressed models vs same-VRAM natives #627

@joelteply

Description

@joelteply

airc-queue card

Coordinates work via the AIRC queue substrate (airc#562). Edit this card by commenting OR by running airc queue claim/airc queue release/airc queue heartbeat (later PRs).

{
  "kind": "airc-queue-card-v1",
  "id": "#627",
  "owner": "claude-tab-1",
  "status": "claimed",
  "evidence": "bulk-adopted from continuum unlabeled-open backlog (Joel directive 2026-05-14)",
  "next_action": "triage scope + assign or close stale"
}

Close this issue when the work is done (status=merged/abandoned).

Original issue body

Pre-adoption body

The Headline Comparison

Developers care about VRAM, not parameter counts. Our compressed models should beat native models at the same VRAM tier.

Eval Matrix

Our Model VRAM Compare Against Their VRAM Why
27B forged GGUF Q4 ~10GB Qwen3.5-7B, CodeLlama-7B ~4-7GB 27B brain in 7B body
35B-A3B (16 experts) ~3GB Any 3B model ~2-3GB 35B patterns in 3B
14B compacted GGUF Q4 ~5GB Qwen3.5-7B ~4-7GB Pruned 14B vs native 7B
4B forged GGUF Q4 2.6GB Qwen2.5-Coder-1.5B ~1-2GB Already done: 53% HumanEval

Method

All HumanEval via EvalPlus, greedy decoding. Save .jsonl proof files.
GGUF models eval via llama-cpp-python (~10 min each).
fp16 models via HF transformers (~24hr each for 14B+).

Priority

  1. 27B forged GGUF vs 7B native (THE headline)
  2. 35B-A3B vs 3B native (MoE surgery headline)
  3. Controls (base models unforged)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    airc-queueAIRC-backed agent work queue card

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions