Skip to content

feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry#1315

Merged
joelteply merged 1 commit into
canaryfrom
feat/grid-inference-routing-pr1-capability
May 16, 2026
Merged

feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry#1315
joelteply merged 1 commit into
canaryfrom
feat/grid-inference-routing-pr1-capability

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16T15:03Z).

Pure-functions slice — types + derivation + in-memory registry. No grid wiring, no IPC, no async. Sets up the local-side capability surface that PR-2 / PR-3 / PR-4 stack on:

  • PR-2 (claude-tab-1): GridCapabilityAnnouncer + tailscale broadcast + peer registry sync
  • PR-3 (codex): GridInferenceRouter + supervisor pressure integration
  • PR-4 (vhsm-d1f4): bidirectional streaming over grid transport + failure-mode contract

Same pattern as rate_proposals / generate_recipe PR-1 — isolate pure data + derivation first so PR-2 stacks on a stable, test-covered shape.

What ships

inference_capability::types (ts-rs camelCase exports)

  • InferenceKind(String)NOT a const enum, per no-hardcoded-enums rule. Backends register dynamically; new ones (tflite, mlx, candle-vulkan) plug in without a schema change.
  • LatencyClassLocal|Fast|Mesh|Wan, serializes lowercase, totally ordered for PR-3 scoring.
  • HardwareProfileplatform, has_metal, has_cuda, has_vulkan, free_vram_bytes, total_vram_bytes, cpu_cores, system_ram_bytes.
  • InferenceCapabilitykind, free_vram_bytes, current_lease_count, latency_class.
  • NodeCapability — full advertisement: node_id, hardware, capabilities[], last_updated_ms.

Exports to shared/generated/inference_capability/ (not grid/) — keeps the new module's wire types in their own namespace and avoids collision with the existing grid::node::NodeCapability discriminated union.

inference_capability::probe (pure function)

pub fn probe_inference_capabilities(hw: &HardwareProfile) -> Vec<InferenceCapability>

No IO, no globals, no syscalls. Decisions encoded:

  • llamacpp + candle: require Metal OR CUDA (native GPU path)
  • ort-vision / ort-tts / ort-stt / ort-embedding: accept any GPU (Metal, CUDA, or Vulkan via ORT execution providers)
  • MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiB floor — below that, advertise nothing (deadhead-don't-fail policy per vhsm-d1f4 audit pass 1)
  • CPU-only nodes advertise zero capabilities (the no_cpu_fallback contract at the capability layer)

inference_capability::registry (NodeCapabilityRegistry)

In-memory HashMap<String, NodeCapability> with upsert, get, remove, list, find_capable(kind, min_vram), evict_stale(cutoff_ms). Sync, single-threaded — PR-2 wraps in parking_lot::RwLock when wiring the announcer.

Failure-mode discipline (non-negotiable per audit pass 1 + 6)

  • No CPU fallback: generic_dell_no_gpu_advertises_nothing pins the contract.
  • No hardcoded enums: InferenceKind(String) newtype.
  • No silent unwrap_or / fallback defaults: every field carries explicit data.
  • No per-module concurrency cap: registry is data-only; concurrency belongs on the upcoming base trait Phase 0 work, not here.

Test plan

  • cargo test --lib --features metal,accelerate inference_capability::43 passing, 0 failing.
  • cargo check --lib --features metal,accelerate clean (51 pre-existing warnings unrelated to this PR).
  • Hardware tiers covered: MacBook Air M2 (5GB free VRAM), M5 Pro (32GB), Blackwell RTX 5090 (28GB CUDA), generic Dell (no GPU), AMD with Vulkan-only.
  • Edge cases: below VRAM floor on Metal AND on Vulkan, CPU-only host with huge system RAM, exact-VRAM-boundary find_capable, exact-cutoff evict_stale, multi-capability per node, dynamic unknown InferenceKind handling.
  • Serde round-trips for every wire type with lowercase + camelCase verification.
  • ts-rs export tests generate 5 TypeScript bindings + barrel index.ts.
  • Live-deploy verification skipped — pure Rust data + derivation, no chat-path effect. PR-3 / PR-4 will VDD the actual inference dispatch.

VDD evidence

N/A for this slice — no inference dispatch, no tokens/s. The covenant evidence (latency, tok/s, before/after) lands with PR-3 (router decides remote vs local) and PR-4 (streaming over grid transport).

What PR-1 enables: PR-2 announces this node's capabilities to the mesh, PR-3 uses them to route inference jobs to whichever node is fastest/least-loaded, PR-4 streams tokens back. Net covenant target: vector search <50ms, RAG <500ms, voice <3s, persona tick <1ms — distributed across the mesh instead of single-node bottlenecked.

Stack

This is PR-1. PR-2 / PR-3 / PR-4 will branch off this once merged.

Refer to vhsm-d1f4's airc broadcast (2026-05-16T15:03:27Z) for the full design intent + per-PR scope.

…CapabilityRegistry

GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16).
Pure-functions slice — types + derivation + in-memory registry. No grid
wiring, no IPC, no async. PR-2 (claude-tab-1) will stack the
GridCapabilityAnnouncer + tailscale broadcast on top; PR-3 (codex) the
GridInferenceRouter; PR-4 (vhsm-d1f4) bidirectional streaming.

Why this layer first: the rate_proposals / generate_recipe PR-1 cadence
landed faster + safer by isolating data + pure derivation from any
async wiring. PR-2 stacks on a stable shape that's already test-covered.

What ships:

- `inference_capability::types` — wire shape (ts-rs camelCase exports
  to shared/generated/inference_capability/): InferenceKind(String)
  newtype (NOT a const enum — backends register dynamically per the
  no-hardcoded-enums rule), LatencyClass (Local/Fast/Mesh/Wan,
  serialize lowercase, ordered), HardwareProfile, InferenceCapability,
  NodeCapability.

- `inference_capability::probe` — pure function
  `probe_inference_capabilities(hw) -> Vec<InferenceCapability>` that
  derives the capability list from a HardwareProfile. No IO, no
  globals. llamacpp + candle require Metal OR CUDA (native GPU path);
  ort-vision/tts/stt/embedding accept any GPU (Metal/CUDA/Vulkan via
  ORT execution providers). MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiB floor
  — below that, advertise nothing (deadhead-don't-fail policy per
  vhsm-d1f4 audit pass 1).

- `inference_capability::registry` — `NodeCapabilityRegistry` in-memory
  map of node_id -> NodeCapability with upsert/get/remove/list/
  find_capable/evict_stale. Sync, single-threaded — PR-2 wraps in
  parking_lot::RwLock when wiring the announcer.

Failure-mode discipline (non-negotiable per audit pass 1 + 6):

- No CPU fallback: generic_dell_no_gpu_advertises_nothing test pins
  the contract — a CPU-only node returns ZERO capabilities, not "fall
  back to slow CPU."
- No hardcoded enums: InferenceKind is a String newtype; new backends
  (tflite, mlx, candle-vulkan) plug in without a schema change.
- No silent unwrap_or: every field carries explicit data.

Tests: 43 passing on cargo test --lib --features metal,accelerate
inference_capability::

- types (9): kinds const wire-string pin, InferenceKind hashable,
  serde round-trips (string + camelcase), LatencyClass lowercase +
  ordering, NodeCapability full advertisement.

- probe (14): MacBook Air M2 / M5 Pro / Blackwell / generic Dell /
  AMD Vulkan-only — all four hardware tiers vhsm-d1f4 named. Plus
  below-VRAM-floor edge case (Metal AND Vulkan), CPU-only with huge
  RAM still advertises nothing, free-VRAM agreement across both
  native + ORT branches, deterministic ordering, propagation.

- registry (15): upsert/get/remove/list CRUD, find_capable with
  kind + VRAM filter (inclusive boundary), evict_stale with cutoff
  semantics (inclusive at-cutoff), multi-capability per node,
  dynamic unknown kind handling, empty-state, clear-via-remove.

- ts-rs exports (5): InferenceKind + LatencyClass + HardwareProfile +
  InferenceCapability + NodeCapability barrel generated to
  shared/generated/inference_capability/.

Cargo check clean on --features metal,accelerate (51 pre-existing
warnings unrelated to this PR).

No VDD tok/s claim — this PR is pure data + zero inference dispatch.
The tok/s evidence will land with PR-3 (router) + PR-4 (streaming).
joelteply added a commit that referenced this pull request May 16, 2026
…types (PR-1) (#1321)

Pure-types slice of CBAR-SUBSTRATE missing piece 2 (claimed 15:38Z,
unblocked by #1319). Adds the typed wire shape `ServiceModule` will
adopt in PR-2 (Optional fields on ModuleConfig + default
`on_artifact_available` method) and the runtime will dispatch on in
PR-3 (artifact event delivery on cadence).

Same cadence as rate_proposals / generate_recipe PR-1: pure data layer
lands independently mergeable, with full test coverage, before any
runtime wiring. PR-2 stacks the trait extension on this; PR-3 wires
the dispatcher.

Types

- `ArtifactKey(String)` — newtype, transparent serde, no closed enum.
  Modules register their own kinds at boot per CLAUDE.md anti-pattern
  rules + Joel's "we do not hardcode" directive. Same shape as
  `inference_capability::InferenceKind` (codex's #1315 PR-1).
- `ArtifactSelector::{Exact, Prefix}` — what a subscriber wants. Exact
  string match + string-prefix only. Glob/regex deliberately omitted
  — the matcher is the runtime's hot path (walked every publish);
  string-prefix is cheap + covers the cases we have.
- `Cadence::{Periodic, EventDriven, OnArtifact, Mixed}` — supervised
  wake policy. interval_ms over the wire so TS doesn't deal with bigint
  Duration. No `Default` impl, no `OnDemand` variant — broker/supervisor
  decides cadence per the dynamic-hardware-detect rule, every
  registered module has an explicit policy.

What this PR is NOT

- No `ServiceModule` trait changes yet (PR-2)
- No `ModuleConfig` field additions yet (PR-2 — Optional so existing
  modules don't break; opt-in)
- No runtime dispatch wiring (PR-3)

12/12 unit tests (cargo test --features metal,accelerate
runtime::artifact_handle). ts-rs exports verified to
shared/generated/runtime/{ArtifactKey,ArtifactSelector,Cadence}.ts.
Test focus: serde wire shape (transparent / internally-tagged),
selector hot-path semantics (Exact doesn't prefix-match,
Prefix handles empty + degenerate cases), Cadence projection
(tick_interval returns None for non-periodic, wants_artifact_wakes
covers the right variants), full roundtrip every variant.

Stacked under codex's Lane D claim (PersonaTurnFrame proof, 19:23Z)
and airc-8a5e's CBAR-PIECE-5 signal. All three slices independent —
PR-2 picks up these types when Phase 0 trunk doc lands; Lane D and
PIECE-5 don't physically depend on PR-2.

Co-authored-by: Test <test@test.com>
@joelteply joelteply merged commit 0a0ef57 into canary May 16, 2026
4 checks passed
@joelteply joelteply deleted the feat/grid-inference-routing-pr1-capability branch May 16, 2026 21:18
joelteply pushed a commit that referenced this pull request May 16, 2026
…nctions slice)

CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md
§336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1)
inference_capability module — different file, same module surface, same pure-
functions cadence as rate_proposals + generate_recipe + #1315 PR-1s.

#1315's probe answers "does this node have an advertisable GPU at all?" This
gate answers the next question one level deeper: "will the SELECTED MODEL
actually fit with all layers on that GPU, evidenced not guessed?"

Per CBAR-SUBSTRATE spec, before any local-generation turn runs:

- Selected Qwen model named explicitly
- Backend (Metal / CUDA / Vulkan) named + matches platform
- GPU layer count reported
- Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.)
- VRAM residency estimate covers all layers
- "CPU graph splits or unsupported Qwen layers are blockers unless the
  turn is explicitly degraded with a visible reason."

What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2
wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate
into the actual turn dispatcher with a block-the-turn enforcement point):

- BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export
- QwenModelMetadata — model_name, architecture, layer_count,
  parameter_count_billions, bytes_per_parameter_quantized,
  layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader
- ResidencyEvidence — typed evidence emitted on Pass; covers every
  CBAR-SUBSTRATE-required field
- ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union
- BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit |
  WrongBackendForPlatform (typed, surfaces specific cause)
- Pure functions: select_backend, check_residency_gate

Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1):

- No silent CPU split: PartialGpuSplit fires when free VRAM < estimate
- No silent fallback: NoGpuBackendOnNode fires when no GPU at all
- No silent unsupported layer: UnsupportedLayer fires per-kind for
  Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today)
- No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's
  layer_kinds_needing_check is Vec<String> (new layer kinds plug in)
- No assumed defaults: every field comes from inputs

Backend selection precedence (matches probe.rs llamacpp advertisement rule):
Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None.
Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on
NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today).

Tests: 41 passing on cargo test --lib --features metal,accelerate
inference_capability::residency::

- select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None
  on CPU-only
- check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell
  / AMD-Vulkan all run their expected Qwen variants with full evidence
- check_residency_gate block paths (4): CPU-only blocks with
  NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan
  blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan
  handles qwen2 today, not qwen3moe)
- VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band,
  estimate scales with quantization
- Evidence + serde (5): every required field present on Pass; BackendChoice
  lowercase; BlockReason + ResidencyGateResult tagged-union round-trips;
  QwenModelMetadata + ResidencyEvidence camelCase
- Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks;
  tiny model on CPU still blocks; probe-passes-residency-blocks
  composition; multi-reason block accumulates; reasons() empty slice on
  Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips
- Layer-kind detail (3): backend_choice_as_str; vulkan emits one
  UnsupportedLayer per kind; empty layer_kinds never emits
- ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata,
  ResidencyEvidence, ResidencyGateResult

Cargo check clean on --features metal,accelerate.

This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends
backends::read_gguf_metadata with block_count + parameter count) to
populate QwenModelMetadata from a path. PR-3 wires the gate result into
the turn dispatcher with enforcement (block the turn instead of letting
it silently run).

VDD evidence N/A — pure data + derivation, no inference dispatch.
Evidence lands with PR-3.

Stack:
- #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE,
  zero file conflict)
- This PR: inference_capability/residency.rs (PIECE-5 PR-1)
- Future PR-2: GGUF reader + metadata populator
- Future PR-3: dispatcher integration + enforcement
joelteply added a commit that referenced this pull request May 16, 2026
…nctions slice) (#1331)

CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md
§336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1)
inference_capability module — different file, same module surface, same pure-
functions cadence as rate_proposals + generate_recipe + #1315 PR-1s.

#1315's probe answers "does this node have an advertisable GPU at all?" This
gate answers the next question one level deeper: "will the SELECTED MODEL
actually fit with all layers on that GPU, evidenced not guessed?"

Per CBAR-SUBSTRATE spec, before any local-generation turn runs:

- Selected Qwen model named explicitly
- Backend (Metal / CUDA / Vulkan) named + matches platform
- GPU layer count reported
- Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.)
- VRAM residency estimate covers all layers
- "CPU graph splits or unsupported Qwen layers are blockers unless the
  turn is explicitly degraded with a visible reason."

What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2
wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate
into the actual turn dispatcher with a block-the-turn enforcement point):

- BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export
- QwenModelMetadata — model_name, architecture, layer_count,
  parameter_count_billions, bytes_per_parameter_quantized,
  layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader
- ResidencyEvidence — typed evidence emitted on Pass; covers every
  CBAR-SUBSTRATE-required field
- ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union
- BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit |
  WrongBackendForPlatform (typed, surfaces specific cause)
- Pure functions: select_backend, check_residency_gate

Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1):

- No silent CPU split: PartialGpuSplit fires when free VRAM < estimate
- No silent fallback: NoGpuBackendOnNode fires when no GPU at all
- No silent unsupported layer: UnsupportedLayer fires per-kind for
  Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today)
- No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's
  layer_kinds_needing_check is Vec<String> (new layer kinds plug in)
- No assumed defaults: every field comes from inputs

Backend selection precedence (matches probe.rs llamacpp advertisement rule):
Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None.
Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on
NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today).

Tests: 41 passing on cargo test --lib --features metal,accelerate
inference_capability::residency::

- select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None
  on CPU-only
- check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell
  / AMD-Vulkan all run their expected Qwen variants with full evidence
- check_residency_gate block paths (4): CPU-only blocks with
  NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan
  blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan
  handles qwen2 today, not qwen3moe)
- VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band,
  estimate scales with quantization
- Evidence + serde (5): every required field present on Pass; BackendChoice
  lowercase; BlockReason + ResidencyGateResult tagged-union round-trips;
  QwenModelMetadata + ResidencyEvidence camelCase
- Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks;
  tiny model on CPU still blocks; probe-passes-residency-blocks
  composition; multi-reason block accumulates; reasons() empty slice on
  Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips
- Layer-kind detail (3): backend_choice_as_str; vulkan emits one
  UnsupportedLayer per kind; empty layer_kinds never emits
- ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata,
  ResidencyEvidence, ResidencyGateResult

Cargo check clean on --features metal,accelerate.

This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends
backends::read_gguf_metadata with block_count + parameter count) to
populate QwenModelMetadata from a path. PR-3 wires the gate result into
the turn dispatcher with enforcement (block the turn instead of letting
it silently run).

VDD evidence N/A — pure data + derivation, no inference dispatch.
Evidence lands with PR-3.

Stack:
- #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE,
  zero file conflict)
- This PR: inference_capability/residency.rs (PIECE-5 PR-1)
- Future PR-2: GGUF reader + metadata populator
- Future PR-3: dispatcher integration + enforcement

Co-authored-by: Test <test@test.com>
joelteply added a commit that referenced this pull request May 16, 2026
…wenModelMetadata (#1333)

* feat(inference): CBAR-PIECE-5 PR-1 — Qwen GPU residency gate (pure-functions slice)

CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md
§336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1)
inference_capability module — different file, same module surface, same pure-
functions cadence as rate_proposals + generate_recipe + #1315 PR-1s.

#1315's probe answers "does this node have an advertisable GPU at all?" This
gate answers the next question one level deeper: "will the SELECTED MODEL
actually fit with all layers on that GPU, evidenced not guessed?"

Per CBAR-SUBSTRATE spec, before any local-generation turn runs:

- Selected Qwen model named explicitly
- Backend (Metal / CUDA / Vulkan) named + matches platform
- GPU layer count reported
- Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.)
- VRAM residency estimate covers all layers
- "CPU graph splits or unsupported Qwen layers are blockers unless the
  turn is explicitly degraded with a visible reason."

What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2
wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate
into the actual turn dispatcher with a block-the-turn enforcement point):

- BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export
- QwenModelMetadata — model_name, architecture, layer_count,
  parameter_count_billions, bytes_per_parameter_quantized,
  layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader
- ResidencyEvidence — typed evidence emitted on Pass; covers every
  CBAR-SUBSTRATE-required field
- ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union
- BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit |
  WrongBackendForPlatform (typed, surfaces specific cause)
- Pure functions: select_backend, check_residency_gate

Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1):

- No silent CPU split: PartialGpuSplit fires when free VRAM < estimate
- No silent fallback: NoGpuBackendOnNode fires when no GPU at all
- No silent unsupported layer: UnsupportedLayer fires per-kind for
  Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today)
- No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's
  layer_kinds_needing_check is Vec<String> (new layer kinds plug in)
- No assumed defaults: every field comes from inputs

Backend selection precedence (matches probe.rs llamacpp advertisement rule):
Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None.
Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on
NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today).

Tests: 41 passing on cargo test --lib --features metal,accelerate
inference_capability::residency::

- select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None
  on CPU-only
- check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell
  / AMD-Vulkan all run their expected Qwen variants with full evidence
- check_residency_gate block paths (4): CPU-only blocks with
  NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan
  blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan
  handles qwen2 today, not qwen3moe)
- VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band,
  estimate scales with quantization
- Evidence + serde (5): every required field present on Pass; BackendChoice
  lowercase; BlockReason + ResidencyGateResult tagged-union round-trips;
  QwenModelMetadata + ResidencyEvidence camelCase
- Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks;
  tiny model on CPU still blocks; probe-passes-residency-blocks
  composition; multi-reason block accumulates; reasons() empty slice on
  Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips
- Layer-kind detail (3): backend_choice_as_str; vulkan emits one
  UnsupportedLayer per kind; empty layer_kinds never emits
- ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata,
  ResidencyEvidence, ResidencyGateResult

Cargo check clean on --features metal,accelerate.

This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends
backends::read_gguf_metadata with block_count + parameter count) to
populate QwenModelMetadata from a path. PR-3 wires the gate result into
the turn dispatcher with enforcement (block the turn instead of letting
it silently run).

VDD evidence N/A — pure data + derivation, no inference dispatch.
Evidence lands with PR-3.

Stack:
- #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE,
  zero file conflict)
- This PR: inference_capability/residency.rs (PIECE-5 PR-1)
- Future PR-2: GGUF reader + metadata populator
- Future PR-3: dispatcher integration + enforcement

* feat(inference): CBAR-PIECE-5 PR-2 — GGUF metadata loader populates QwenModelMetadata

Stacks on #1331 (CBAR-PIECE-5 PR-1, residency gate types). PR-1 defined
the QwenModelMetadata struct + gate; this PR-2 reads a real GGUF file
and produces the metadata the gate consumes. PR-3 will wire both probe
+ this loader into the turn dispatcher with enforcement.

Same pure-functions cadence as PR-1 — file I/O lives in a thin
wrapper, all parsing logic lives in helpers that are unit-testable
without GGUF fixtures.

What ships in inference_capability/gguf_loader.rs:

- pub fn read_qwen_model_metadata(path: &Path) -> Result<QwenModelMetadata>
  Thin file-opener; uses backends:: gguf_file::Content already in the
  crate. No new dependencies.

- pub(crate) fn file_type_to_bytes_per_param(ft: u32) -> Result<f64>
  Maps the GGUF general.file_type enum to bytes-per-weight. Covers the
  full shipped quantization set (Q4_0/Q4_1/Q4_K_S/Q4_K_M/Q5_0/Q5_1/
  Q5_K_S/Q5_K_M/Q6_K/Q8_0, IQ-series sub-2-bit, F16/F32/BF16). Unknown
  ft returns Err with the value named — same no-silent-default posture
  as backends::read_gguf_metadata.

- pub(crate) fn layer_kinds_for_architecture(arch: &str) -> Vec<String>
  Lookup table for architectures with known Vulkan-llama.cpp gaps:
  qwen3moe → [moe_gate, sliding_window_attn], qwen3 → [sliding_window_attn],
  everything else → []. Pinned by a dedicated test so renames must land
  in both the table + residency.rs's matching test simultaneously.

Failure-mode discipline:

- general.architecture: REQUIRED (refuse to guess — silent fallback was
  the 2026-04-23 bug Joel called out)
- {arch}.block_count: REQUIRED (no fake layer-count evidence)
- general.file_type: REQUIRED (no guessed quantization → wrong VRAM)
- general.parameter_count: OPTIONAL with loud fallback (derive from
  file_size / bytes_per_param — approximate, documented)
- general.name: OPTIONAL with file-stem fallback (display only, doesn't
  affect gate correctness)

Tests: 15 passing on cargo test --lib --features metal,accelerate
inference_capability::gguf_loader::

- file_type_to_bytes_per_param (7): workhorse quants present, Q4_K_M
  in 0.55-0.65 band, FP16=2.0, F32=4.0, unknown=Err, removed
  ft={4,5,6}=Err, ordering monotone, IQ-series sub-0.4 bytes
- layer_kinds_for_architecture (5): qwen3moe = [moe_gate,
  sliding_window_attn], qwen3 = [sliding_window_attn], qwen2 +
  qwen2vl empty, unknown arch empty, table pinning
- read_qwen_model_metadata I/O (2): nonexistent path Err, non-GGUF
  file (Cargo.toml) Err

VDD evidence N/A — pure-data loader, no inference dispatch. Evidence
will land with PR-3 (enforcement integration).

Stack:
- #1315 GRID-INFERENCE-ROUTING PR-1 (merged to canary)
- #1331 CBAR-PIECE-5 PR-1 (residency gate types — base of this PR)
- This PR: GGUF metadata loader (PIECE-5 PR-2)
- Future PR-3: dispatcher integration + enforcement

---------

Co-authored-by: Test <test@test.com>
joelteply pushed a commit that referenced this pull request May 16, 2026
…ning; navigate to MODULE-CATALOG queue

Second refresh of ALPHA-GAP Immediate Next Actions to reflect work
landed since #1316 merged. Six items closed; navigation into
MODULE-CATALOG queue made explicit.

Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1
(#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338),
PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family
architecture surface (#1324/#1327/#1332/#1336/#1337 open;
#1316/#1317/#1320/#1329 merged).

Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To
Build' queue (audit-recorder → threat-detector → working-set-manager
→ demand-aligned-recall → substrate-governor).

Adds closeout summary section listing what's done, what's open
(5 architecture-doc PRs ready for review + 2 airc PRs), and what's
queued (5 modules with dependency state + LoC + acceptance criteria
in MODULE-CATALOG).

Doc-driven development cycle is working: doc spec → implementing
agent picks up → ships PR → next spec referenced.
joelteply added a commit that referenced this pull request May 16, 2026
…ning; navigate to MODULE-CATALOG queue (#1342)

Second refresh of ALPHA-GAP Immediate Next Actions to reflect work
landed since #1316 merged. Six items closed; navigation into
MODULE-CATALOG queue made explicit.

Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1
(#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338),
PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family
architecture surface (#1324/#1327/#1332/#1336/#1337 open;
#1316/#1317/#1320/#1329 merged).

Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To
Build' queue (audit-recorder → threat-detector → working-set-manager
→ demand-aligned-recall → substrate-governor).

Adds closeout summary section listing what's done, what's open
(5 architecture-doc PRs ready for review + 2 airc PRs), and what's
queued (5 modules with dependency state + LoC + acceptance criteria
in MODULE-CATALOG).

Doc-driven development cycle is working: doc spec → implementing
agent picks up → ships PR → next spec referenced.

Co-authored-by: Test <test@test.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant