feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry by joelteply · Pull Request #1315 · CambrianTech/continuum

joelteply · 2026-05-16T15:30:01Z

Summary

GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16T15:03Z).

Pure-functions slice — types + derivation + in-memory registry. No grid wiring, no IPC, no async. Sets up the local-side capability surface that PR-2 / PR-3 / PR-4 stack on:

PR-2 (claude-tab-1): GridCapabilityAnnouncer + tailscale broadcast + peer registry sync
PR-3 (codex): GridInferenceRouter + supervisor pressure integration
PR-4 (vhsm-d1f4): bidirectional streaming over grid transport + failure-mode contract

Same pattern as rate_proposals / generate_recipe PR-1 — isolate pure data + derivation first so PR-2 stacks on a stable, test-covered shape.

What ships

`inference_capability::types` (ts-rs camelCase exports)

InferenceKind(String) — NOT a const enum, per no-hardcoded-enums rule. Backends register dynamically; new ones (tflite, mlx, candle-vulkan) plug in without a schema change.
LatencyClass — Local|Fast|Mesh|Wan, serializes lowercase, totally ordered for PR-3 scoring.
HardwareProfile — platform, has_metal, has_cuda, has_vulkan, free_vram_bytes, total_vram_bytes, cpu_cores, system_ram_bytes.
InferenceCapability — kind, free_vram_bytes, current_lease_count, latency_class.
NodeCapability — full advertisement: node_id, hardware, capabilities[], last_updated_ms.

Exports to shared/generated/inference_capability/ (not grid/) — keeps the new module's wire types in their own namespace and avoids collision with the existing grid::node::NodeCapability discriminated union.

`inference_capability::probe` (pure function)

pub fn probe_inference_capabilities(hw: &HardwareProfile) -> Vec<InferenceCapability>

No IO, no globals, no syscalls. Decisions encoded:

llamacpp + candle: require Metal OR CUDA (native GPU path)
ort-vision / ort-tts / ort-stt / ort-embedding: accept any GPU (Metal, CUDA, or Vulkan via ORT execution providers)
MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiB floor — below that, advertise nothing (deadhead-don't-fail policy per vhsm-d1f4 audit pass 1)
CPU-only nodes advertise zero capabilities (the no_cpu_fallback contract at the capability layer)

`inference_capability::registry` (`NodeCapabilityRegistry`)

In-memory HashMap<String, NodeCapability> with upsert, get, remove, list, find_capable(kind, min_vram), evict_stale(cutoff_ms). Sync, single-threaded — PR-2 wraps in parking_lot::RwLock when wiring the announcer.

Failure-mode discipline (non-negotiable per audit pass 1 + 6)

✅ No CPU fallback: generic_dell_no_gpu_advertises_nothing pins the contract.
✅ No hardcoded enums: InferenceKind(String) newtype.
✅ No silent unwrap_or / fallback defaults: every field carries explicit data.
✅ No per-module concurrency cap: registry is data-only; concurrency belongs on the upcoming base trait Phase 0 work, not here.

Test plan

cargo test --lib --features metal,accelerate inference_capability:: — 43 passing, 0 failing.
cargo check --lib --features metal,accelerate clean (51 pre-existing warnings unrelated to this PR).
Hardware tiers covered: MacBook Air M2 (5GB free VRAM), M5 Pro (32GB), Blackwell RTX 5090 (28GB CUDA), generic Dell (no GPU), AMD with Vulkan-only.
Edge cases: below VRAM floor on Metal AND on Vulkan, CPU-only host with huge system RAM, exact-VRAM-boundary find_capable, exact-cutoff evict_stale, multi-capability per node, dynamic unknown InferenceKind handling.
Serde round-trips for every wire type with lowercase + camelCase verification.
ts-rs export tests generate 5 TypeScript bindings + barrel index.ts.
Live-deploy verification skipped — pure Rust data + derivation, no chat-path effect. PR-3 / PR-4 will VDD the actual inference dispatch.

VDD evidence

N/A for this slice — no inference dispatch, no tokens/s. The covenant evidence (latency, tok/s, before/after) lands with PR-3 (router decides remote vs local) and PR-4 (streaming over grid transport).

What PR-1 enables: PR-2 announces this node's capabilities to the mesh, PR-3 uses them to route inference jobs to whichever node is fastest/least-loaded, PR-4 streams tokens back. Net covenant target: vector search <50ms, RAG <500ms, voice <3s, persona tick <1ms — distributed across the mesh instead of single-node bottlenecked.

Stack

This is PR-1. PR-2 / PR-3 / PR-4 will branch off this once merged.

Refer to vhsm-d1f4's airc broadcast (2026-05-16T15:03:27Z) for the full design intent + per-PR scope.

…CapabilityRegistry GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16). Pure-functions slice — types + derivation + in-memory registry. No grid wiring, no IPC, no async. PR-2 (claude-tab-1) will stack the GridCapabilityAnnouncer + tailscale broadcast on top; PR-3 (codex) the GridInferenceRouter; PR-4 (vhsm-d1f4) bidirectional streaming. Why this layer first: the rate_proposals / generate_recipe PR-1 cadence landed faster + safer by isolating data + pure derivation from any async wiring. PR-2 stacks on a stable shape that's already test-covered. What ships: - `inference_capability::types` — wire shape (ts-rs camelCase exports to shared/generated/inference_capability/): InferenceKind(String) newtype (NOT a const enum — backends register dynamically per the no-hardcoded-enums rule), LatencyClass (Local/Fast/Mesh/Wan, serialize lowercase, ordered), HardwareProfile, InferenceCapability, NodeCapability. - `inference_capability::probe` — pure function `probe_inference_capabilities(hw) -> Vec<InferenceCapability>` that derives the capability list from a HardwareProfile. No IO, no globals. llamacpp + candle require Metal OR CUDA (native GPU path); ort-vision/tts/stt/embedding accept any GPU (Metal/CUDA/Vulkan via ORT execution providers). MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiB floor — below that, advertise nothing (deadhead-don't-fail policy per vhsm-d1f4 audit pass 1). - `inference_capability::registry` — `NodeCapabilityRegistry` in-memory map of node_id -> NodeCapability with upsert/get/remove/list/ find_capable/evict_stale. Sync, single-threaded — PR-2 wraps in parking_lot::RwLock when wiring the announcer. Failure-mode discipline (non-negotiable per audit pass 1 + 6): - No CPU fallback: generic_dell_no_gpu_advertises_nothing test pins the contract — a CPU-only node returns ZERO capabilities, not "fall back to slow CPU." - No hardcoded enums: InferenceKind is a String newtype; new backends (tflite, mlx, candle-vulkan) plug in without a schema change. - No silent unwrap_or: every field carries explicit data. Tests: 43 passing on cargo test --lib --features metal,accelerate inference_capability:: - types (9): kinds const wire-string pin, InferenceKind hashable, serde round-trips (string + camelcase), LatencyClass lowercase + ordering, NodeCapability full advertisement. - probe (14): MacBook Air M2 / M5 Pro / Blackwell / generic Dell / AMD Vulkan-only — all four hardware tiers vhsm-d1f4 named. Plus below-VRAM-floor edge case (Metal AND Vulkan), CPU-only with huge RAM still advertises nothing, free-VRAM agreement across both native + ORT branches, deterministic ordering, propagation. - registry (15): upsert/get/remove/list CRUD, find_capable with kind + VRAM filter (inclusive boundary), evict_stale with cutoff semantics (inclusive at-cutoff), multi-capability per node, dynamic unknown kind handling, empty-state, clear-via-remove. - ts-rs exports (5): InferenceKind + LatencyClass + HardwareProfile + InferenceCapability + NodeCapability barrel generated to shared/generated/inference_capability/. Cargo check clean on --features metal,accelerate (51 pre-existing warnings unrelated to this PR). No VDD tok/s claim — this PR is pure data + zero inference dispatch. The tok/s evidence will land with PR-3 (router) + PR-4 (streaming).

…types (PR-1) (#1321) Pure-types slice of CBAR-SUBSTRATE missing piece 2 (claimed 15:38Z, unblocked by #1319). Adds the typed wire shape `ServiceModule` will adopt in PR-2 (Optional fields on ModuleConfig + default `on_artifact_available` method) and the runtime will dispatch on in PR-3 (artifact event delivery on cadence). Same cadence as rate_proposals / generate_recipe PR-1: pure data layer lands independently mergeable, with full test coverage, before any runtime wiring. PR-2 stacks the trait extension on this; PR-3 wires the dispatcher. Types - `ArtifactKey(String)` — newtype, transparent serde, no closed enum. Modules register their own kinds at boot per CLAUDE.md anti-pattern rules + Joel's "we do not hardcode" directive. Same shape as `inference_capability::InferenceKind` (codex's #1315 PR-1). - `ArtifactSelector::{Exact, Prefix}` — what a subscriber wants. Exact string match + string-prefix only. Glob/regex deliberately omitted — the matcher is the runtime's hot path (walked every publish); string-prefix is cheap + covers the cases we have. - `Cadence::{Periodic, EventDriven, OnArtifact, Mixed}` — supervised wake policy. interval_ms over the wire so TS doesn't deal with bigint Duration. No `Default` impl, no `OnDemand` variant — broker/supervisor decides cadence per the dynamic-hardware-detect rule, every registered module has an explicit policy. What this PR is NOT - No `ServiceModule` trait changes yet (PR-2) - No `ModuleConfig` field additions yet (PR-2 — Optional so existing modules don't break; opt-in) - No runtime dispatch wiring (PR-3) 12/12 unit tests (cargo test --features metal,accelerate runtime::artifact_handle). ts-rs exports verified to shared/generated/runtime/{ArtifactKey,ArtifactSelector,Cadence}.ts. Test focus: serde wire shape (transparent / internally-tagged), selector hot-path semantics (Exact doesn't prefix-match, Prefix handles empty + degenerate cases), Cadence projection (tick_interval returns None for non-periodic, wants_artifact_wakes covers the right variants), full roundtrip every variant. Stacked under codex's Lane D claim (PersonaTurnFrame proof, 19:23Z) and airc-8a5e's CBAR-PIECE-5 signal. All three slices independent — PR-2 picks up these types when Phase 0 trunk doc lands; Lane D and PIECE-5 don't physically depend on PR-2. Co-authored-by: Test <test@test.com>

…nctions slice) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement

…nctions slice) (#1331) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement Co-authored-by: Test <test@test.com>

…wenModelMetadata (#1333) * feat(inference): CBAR-PIECE-5 PR-1 — Qwen GPU residency gate (pure-functions slice) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement * feat(inference): CBAR-PIECE-5 PR-2 — GGUF metadata loader populates QwenModelMetadata Stacks on #1331 (CBAR-PIECE-5 PR-1, residency gate types). PR-1 defined the QwenModelMetadata struct + gate; this PR-2 reads a real GGUF file and produces the metadata the gate consumes. PR-3 will wire both probe + this loader into the turn dispatcher with enforcement. Same pure-functions cadence as PR-1 — file I/O lives in a thin wrapper, all parsing logic lives in helpers that are unit-testable without GGUF fixtures. What ships in inference_capability/gguf_loader.rs: - pub fn read_qwen_model_metadata(path: &Path) -> Result<QwenModelMetadata> Thin file-opener; uses backends:: gguf_file::Content already in the crate. No new dependencies. - pub(crate) fn file_type_to_bytes_per_param(ft: u32) -> Result<f64> Maps the GGUF general.file_type enum to bytes-per-weight. Covers the full shipped quantization set (Q4_0/Q4_1/Q4_K_S/Q4_K_M/Q5_0/Q5_1/ Q5_K_S/Q5_K_M/Q6_K/Q8_0, IQ-series sub-2-bit, F16/F32/BF16). Unknown ft returns Err with the value named — same no-silent-default posture as backends::read_gguf_metadata. - pub(crate) fn layer_kinds_for_architecture(arch: &str) -> Vec<String> Lookup table for architectures with known Vulkan-llama.cpp gaps: qwen3moe → [moe_gate, sliding_window_attn], qwen3 → [sliding_window_attn], everything else → []. Pinned by a dedicated test so renames must land in both the table + residency.rs's matching test simultaneously. Failure-mode discipline: - general.architecture: REQUIRED (refuse to guess — silent fallback was the 2026-04-23 bug Joel called out) - {arch}.block_count: REQUIRED (no fake layer-count evidence) - general.file_type: REQUIRED (no guessed quantization → wrong VRAM) - general.parameter_count: OPTIONAL with loud fallback (derive from file_size / bytes_per_param — approximate, documented) - general.name: OPTIONAL with file-stem fallback (display only, doesn't affect gate correctness) Tests: 15 passing on cargo test --lib --features metal,accelerate inference_capability::gguf_loader:: - file_type_to_bytes_per_param (7): workhorse quants present, Q4_K_M in 0.55-0.65 band, FP16=2.0, F32=4.0, unknown=Err, removed ft={4,5,6}=Err, ordering monotone, IQ-series sub-0.4 bytes - layer_kinds_for_architecture (5): qwen3moe = [moe_gate, sliding_window_attn], qwen3 = [sliding_window_attn], qwen2 + qwen2vl empty, unknown arch empty, table pinning - read_qwen_model_metadata I/O (2): nonexistent path Err, non-GGUF file (Cargo.toml) Err VDD evidence N/A — pure-data loader, no inference dispatch. Evidence will land with PR-3 (enforcement integration). Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (merged to canary) - #1331 CBAR-PIECE-5 PR-1 (residency gate types — base of this PR) - This PR: GGUF metadata loader (PIECE-5 PR-2) - Future PR-3: dispatcher integration + enforcement --------- Co-authored-by: Test <test@test.com>

…ning; navigate to MODULE-CATALOG queue Second refresh of ALPHA-GAP Immediate Next Actions to reflect work landed since #1316 merged. Six items closed; navigation into MODULE-CATALOG queue made explicit. Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1 (#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338), PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family architecture surface (#1324/#1327/#1332/#1336/#1337 open; #1316/#1317/#1320/#1329 merged). Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To Build' queue (audit-recorder → threat-detector → working-set-manager → demand-aligned-recall → substrate-governor). Adds closeout summary section listing what's done, what's open (5 architecture-doc PRs ready for review + 2 airc PRs), and what's queued (5 modules with dependency state + LoC + acceptance criteria in MODULE-CATALOG). Doc-driven development cycle is working: doc spec → implementing agent picks up → ships PR → next spec referenced.

…ning; navigate to MODULE-CATALOG queue (#1342) Second refresh of ALPHA-GAP Immediate Next Actions to reflect work landed since #1316 merged. Six items closed; navigation into MODULE-CATALOG queue made explicit. Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1 (#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338), PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family architecture surface (#1324/#1327/#1332/#1336/#1337 open; #1316/#1317/#1320/#1329 merged). Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To Build' queue (audit-recorder → threat-detector → working-set-manager → demand-aligned-recall → substrate-governor). Adds closeout summary section listing what's done, what's open (5 architecture-doc PRs ready for review + 2 airc PRs), and what's queued (5 modules with dependency state + LoC + acceptance criteria in MODULE-CATALOG). Doc-driven development cycle is working: doc spec → implementing agent picks up → ships PR → next spec referenced. Co-authored-by: Test <test@test.com>

github-actions Bot added the size: XL label May 16, 2026

joelteply mentioned this pull request May 16, 2026

feat(runtime,CBAR-PIECE-2): ArtifactKey + ArtifactSelector + Cadence types (PR-1) #1321

Merged

4 tasks

This was referenced May 16, 2026

docs(architecture): add GENOME-FOUNDRY-SENTINEL — artifact-sharing economy on consumer hardware (Lane H design) #1327

Merged

feat(inference): CBAR-PIECE-5 PR-1 — Qwen GPU residency gate (pure-functions slice) #1330

Closed

joelteply merged commit 0a0ef57 into canary May 16, 2026
4 checks passed

joelteply deleted the feat/grid-inference-routing-pr1-capability branch May 16, 2026 21:18

This was referenced May 16, 2026

feat(inference): CBAR-PIECE-5 PR-1 — Qwen GPU residency gate (pure-functions slice) #1331

Merged

feat(inference): CBAR-PIECE-5 PR-2 — GGUF metadata loader populates QwenModelMetadata #1333

Merged

feat(inference): add Qwen GPU residency gate #1334

Closed

This was referenced May 16, 2026

feat(inference): CBAR-PIECE-5 PR-3 — hardware probe populates HardwareProfile #1335

Merged

feat(inference): CBAR-PIECE-5 PR-4 — enforce_residency + LlamaCppAdapter wires gate at load time #1338

Merged

joelteply mentioned this pull request May 16, 2026

docs(alpha): second refresh — close PIECE-5 + PIECE-8 + contract-widening; navigate to MODULE-CATALOG queue #1342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry#1315

feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry#1315
joelteply merged 1 commit into
canaryfrom
feat/grid-inference-routing-pr1-capability

joelteply commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelteply commented May 16, 2026

Summary

What ships

inference_capability::types (ts-rs camelCase exports)

inference_capability::probe (pure function)

inference_capability::registry (NodeCapabilityRegistry)

Failure-mode discipline (non-negotiable per audit pass 1 + 6)

Test plan

VDD evidence

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`inference_capability::types` (ts-rs camelCase exports)

`inference_capability::probe` (pure function)

`inference_capability::registry` (`NodeCapabilityRegistry`)