feat(grid-inference-routing): PR-1 InferenceCapability + probe + NodeCapabilityRegistry#1315
Merged
Merged
Conversation
…CapabilityRegistry GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16). Pure-functions slice — types + derivation + in-memory registry. No grid wiring, no IPC, no async. PR-2 (claude-tab-1) will stack the GridCapabilityAnnouncer + tailscale broadcast on top; PR-3 (codex) the GridInferenceRouter; PR-4 (vhsm-d1f4) bidirectional streaming. Why this layer first: the rate_proposals / generate_recipe PR-1 cadence landed faster + safer by isolating data + pure derivation from any async wiring. PR-2 stacks on a stable shape that's already test-covered. What ships: - `inference_capability::types` — wire shape (ts-rs camelCase exports to shared/generated/inference_capability/): InferenceKind(String) newtype (NOT a const enum — backends register dynamically per the no-hardcoded-enums rule), LatencyClass (Local/Fast/Mesh/Wan, serialize lowercase, ordered), HardwareProfile, InferenceCapability, NodeCapability. - `inference_capability::probe` — pure function `probe_inference_capabilities(hw) -> Vec<InferenceCapability>` that derives the capability list from a HardwareProfile. No IO, no globals. llamacpp + candle require Metal OR CUDA (native GPU path); ort-vision/tts/stt/embedding accept any GPU (Metal/CUDA/Vulkan via ORT execution providers). MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiB floor — below that, advertise nothing (deadhead-don't-fail policy per vhsm-d1f4 audit pass 1). - `inference_capability::registry` — `NodeCapabilityRegistry` in-memory map of node_id -> NodeCapability with upsert/get/remove/list/ find_capable/evict_stale. Sync, single-threaded — PR-2 wraps in parking_lot::RwLock when wiring the announcer. Failure-mode discipline (non-negotiable per audit pass 1 + 6): - No CPU fallback: generic_dell_no_gpu_advertises_nothing test pins the contract — a CPU-only node returns ZERO capabilities, not "fall back to slow CPU." - No hardcoded enums: InferenceKind is a String newtype; new backends (tflite, mlx, candle-vulkan) plug in without a schema change. - No silent unwrap_or: every field carries explicit data. Tests: 43 passing on cargo test --lib --features metal,accelerate inference_capability:: - types (9): kinds const wire-string pin, InferenceKind hashable, serde round-trips (string + camelcase), LatencyClass lowercase + ordering, NodeCapability full advertisement. - probe (14): MacBook Air M2 / M5 Pro / Blackwell / generic Dell / AMD Vulkan-only — all four hardware tiers vhsm-d1f4 named. Plus below-VRAM-floor edge case (Metal AND Vulkan), CPU-only with huge RAM still advertises nothing, free-VRAM agreement across both native + ORT branches, deterministic ordering, propagation. - registry (15): upsert/get/remove/list CRUD, find_capable with kind + VRAM filter (inclusive boundary), evict_stale with cutoff semantics (inclusive at-cutoff), multi-capability per node, dynamic unknown kind handling, empty-state, clear-via-remove. - ts-rs exports (5): InferenceKind + LatencyClass + HardwareProfile + InferenceCapability + NodeCapability barrel generated to shared/generated/inference_capability/. Cargo check clean on --features metal,accelerate (51 pre-existing warnings unrelated to this PR). No VDD tok/s claim — this PR is pure data + zero inference dispatch. The tok/s evidence will land with PR-3 (router) + PR-4 (streaming).
4 tasks
joelteply
added a commit
that referenced
this pull request
May 16, 2026
…types (PR-1) (#1321) Pure-types slice of CBAR-SUBSTRATE missing piece 2 (claimed 15:38Z, unblocked by #1319). Adds the typed wire shape `ServiceModule` will adopt in PR-2 (Optional fields on ModuleConfig + default `on_artifact_available` method) and the runtime will dispatch on in PR-3 (artifact event delivery on cadence). Same cadence as rate_proposals / generate_recipe PR-1: pure data layer lands independently mergeable, with full test coverage, before any runtime wiring. PR-2 stacks the trait extension on this; PR-3 wires the dispatcher. Types - `ArtifactKey(String)` — newtype, transparent serde, no closed enum. Modules register their own kinds at boot per CLAUDE.md anti-pattern rules + Joel's "we do not hardcode" directive. Same shape as `inference_capability::InferenceKind` (codex's #1315 PR-1). - `ArtifactSelector::{Exact, Prefix}` — what a subscriber wants. Exact string match + string-prefix only. Glob/regex deliberately omitted — the matcher is the runtime's hot path (walked every publish); string-prefix is cheap + covers the cases we have. - `Cadence::{Periodic, EventDriven, OnArtifact, Mixed}` — supervised wake policy. interval_ms over the wire so TS doesn't deal with bigint Duration. No `Default` impl, no `OnDemand` variant — broker/supervisor decides cadence per the dynamic-hardware-detect rule, every registered module has an explicit policy. What this PR is NOT - No `ServiceModule` trait changes yet (PR-2) - No `ModuleConfig` field additions yet (PR-2 — Optional so existing modules don't break; opt-in) - No runtime dispatch wiring (PR-3) 12/12 unit tests (cargo test --features metal,accelerate runtime::artifact_handle). ts-rs exports verified to shared/generated/runtime/{ArtifactKey,ArtifactSelector,Cadence}.ts. Test focus: serde wire shape (transparent / internally-tagged), selector hot-path semantics (Exact doesn't prefix-match, Prefix handles empty + degenerate cases), Cadence projection (tick_interval returns None for non-periodic, wants_artifact_wakes covers the right variants), full roundtrip every variant. Stacked under codex's Lane D claim (PersonaTurnFrame proof, 19:23Z) and airc-8a5e's CBAR-PIECE-5 signal. All three slices independent — PR-2 picks up these types when Phase 0 trunk doc lands; Lane D and PIECE-5 don't physically depend on PR-2. Co-authored-by: Test <test@test.com>
joelteply
pushed a commit
that referenced
this pull request
May 16, 2026
…nctions slice) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement
joelteply
added a commit
that referenced
this pull request
May 16, 2026
…nctions slice) (#1331) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement Co-authored-by: Test <test@test.com>
joelteply
added a commit
that referenced
this pull request
May 16, 2026
…wenModelMetadata (#1333) * feat(inference): CBAR-PIECE-5 PR-1 — Qwen GPU residency gate (pure-functions slice) CBAR-SUBSTRATE missing-piece #5 (docs/architecture/CBAR-SUBSTRATE-ARCHITECTURE.md §336): Qwen GPU residency gate. Stacks on PR #1315 (GRID-INFERENCE-ROUTING PR-1) inference_capability module — different file, same module surface, same pure- functions cadence as rate_proposals + generate_recipe + #1315 PR-1s. #1315's probe answers "does this node have an advertisable GPU at all?" This gate answers the next question one level deeper: "will the SELECTED MODEL actually fit with all layers on that GPU, evidenced not guessed?" Per CBAR-SUBSTRATE spec, before any local-generation turn runs: - Selected Qwen model named explicitly - Backend (Metal / CUDA / Vulkan) named + matches platform - GPU layer count reported - Unsupported layers enumerated (Vulkan-llama.cpp gaps, etc.) - VRAM residency estimate covers all layers - "CPU graph splits or unsupported Qwen layers are blockers unless the turn is explicitly degraded with a visible reason." What ships (pure-functions slice — no GGUF I/O, no dispatch wiring; PR-2 wires the GGUF reader to populate QwenModelMetadata, PR-3 wires the gate into the actual turn dispatcher with a block-the-turn enforcement point): - BackendChoice (Metal / Cuda / Vulkan) — lowercase ts-rs export - QwenModelMetadata — model_name, architecture, layer_count, parameter_count_billions, bytes_per_parameter_quantized, layer_kinds_needing_check. Pure data populated by future PR-2 GGUF reader - ResidencyEvidence — typed evidence emitted on Pass; covers every CBAR-SUBSTRATE-required field - ResidencyGateResult — Pass(evidence) | Block { reasons } tagged-union - BlockReason — NoGpuBackendOnNode | UnsupportedLayer | PartialGpuSplit | WrongBackendForPlatform (typed, surfaces specific cause) - Pure functions: select_backend, check_residency_gate Failure-mode discipline (non-negotiable per vhsm-d1f4 audit pass 1): - No silent CPU split: PartialGpuSplit fires when free VRAM < estimate - No silent fallback: NoGpuBackendOnNode fires when no GPU at all - No silent unsupported layer: UnsupportedLayer fires per-kind for Vulkan + qwen3moe (vendored llama.cpp Vulkan gap today) - No hardcoded enums: BackendChoice is a tagged enum; QwenModelMetadata's layer_kinds_needing_check is Vec<String> (new layer kinds plug in) - No assumed defaults: every field comes from inputs Backend selection precedence (matches probe.rs llamacpp advertisement rule): Mac → Metal, NVIDIA → CUDA, AMD/Intel → Vulkan, CPU-only → None. Metal wins over Cuda on a Mac (native path); CUDA wins over Vulkan on NVIDIA hardware (llama.cpp CUDA kernels more complete than Vulkan today). Tests: 41 passing on cargo test --lib --features metal,accelerate inference_capability::residency:: - select_backend (4): picks Metal/CUDA/Vulkan correctly per HW class; None on CPU-only - check_residency_gate happy paths (4): M5 Pro / MacBook Air M2 / Blackwell / AMD-Vulkan all run their expected Qwen variants with full evidence - check_residency_gate block paths (4): CPU-only blocks with NoGpuBackendOnNode + exclusive reason; M2 blocks 30B for VRAM; AMD Vulkan blocks Qwen3 MoE with UnsupportedLayer; vulkan-+-Qwen2 PASSES (vulkan handles qwen2 today, not qwen3moe) - VRAM estimate (3): Q4 7B in 3-5GB band, Q4 30B in 14-18GB band, estimate scales with quantization - Evidence + serde (5): every required field present on Pass; BackendChoice lowercase; BlockReason + ResidencyGateResult tagged-union round-trips; QwenModelMetadata + ResidencyEvidence camelCase - Edge cases (8): inclusive-vram-boundary pass; one-byte-under blocks; tiny model on CPU still blocks; probe-passes-residency-blocks composition; multi-reason block accumulates; reasons() empty slice on Pass; FP16 7B blocks on 8GB Mac; WrongBackend variant round-trips - Layer-kind detail (3): backend_choice_as_str; vulkan emits one UnsupportedLayer per kind; empty layer_kinds never emits - ts-rs exports (5): BackendChoice, BlockReason, QwenModelMetadata, ResidencyEvidence, ResidencyGateResult Cargo check clean on --features metal,accelerate. This is PR-1 of CBAR-PIECE-5. PR-2 wires GGUF metadata reader (extends backends::read_gguf_metadata with block_count + parameter count) to populate QwenModelMetadata from a path. PR-3 wires the gate result into the turn dispatcher with enforcement (block the turn instead of letting it silently run). VDD evidence N/A — pure data + derivation, no inference dispatch. Evidence lands with PR-3. Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (this PR's base; OPEN, MERGEABLE, zero file conflict) - This PR: inference_capability/residency.rs (PIECE-5 PR-1) - Future PR-2: GGUF reader + metadata populator - Future PR-3: dispatcher integration + enforcement * feat(inference): CBAR-PIECE-5 PR-2 — GGUF metadata loader populates QwenModelMetadata Stacks on #1331 (CBAR-PIECE-5 PR-1, residency gate types). PR-1 defined the QwenModelMetadata struct + gate; this PR-2 reads a real GGUF file and produces the metadata the gate consumes. PR-3 will wire both probe + this loader into the turn dispatcher with enforcement. Same pure-functions cadence as PR-1 — file I/O lives in a thin wrapper, all parsing logic lives in helpers that are unit-testable without GGUF fixtures. What ships in inference_capability/gguf_loader.rs: - pub fn read_qwen_model_metadata(path: &Path) -> Result<QwenModelMetadata> Thin file-opener; uses backends:: gguf_file::Content already in the crate. No new dependencies. - pub(crate) fn file_type_to_bytes_per_param(ft: u32) -> Result<f64> Maps the GGUF general.file_type enum to bytes-per-weight. Covers the full shipped quantization set (Q4_0/Q4_1/Q4_K_S/Q4_K_M/Q5_0/Q5_1/ Q5_K_S/Q5_K_M/Q6_K/Q8_0, IQ-series sub-2-bit, F16/F32/BF16). Unknown ft returns Err with the value named — same no-silent-default posture as backends::read_gguf_metadata. - pub(crate) fn layer_kinds_for_architecture(arch: &str) -> Vec<String> Lookup table for architectures with known Vulkan-llama.cpp gaps: qwen3moe → [moe_gate, sliding_window_attn], qwen3 → [sliding_window_attn], everything else → []. Pinned by a dedicated test so renames must land in both the table + residency.rs's matching test simultaneously. Failure-mode discipline: - general.architecture: REQUIRED (refuse to guess — silent fallback was the 2026-04-23 bug Joel called out) - {arch}.block_count: REQUIRED (no fake layer-count evidence) - general.file_type: REQUIRED (no guessed quantization → wrong VRAM) - general.parameter_count: OPTIONAL with loud fallback (derive from file_size / bytes_per_param — approximate, documented) - general.name: OPTIONAL with file-stem fallback (display only, doesn't affect gate correctness) Tests: 15 passing on cargo test --lib --features metal,accelerate inference_capability::gguf_loader:: - file_type_to_bytes_per_param (7): workhorse quants present, Q4_K_M in 0.55-0.65 band, FP16=2.0, F32=4.0, unknown=Err, removed ft={4,5,6}=Err, ordering monotone, IQ-series sub-0.4 bytes - layer_kinds_for_architecture (5): qwen3moe = [moe_gate, sliding_window_attn], qwen3 = [sliding_window_attn], qwen2 + qwen2vl empty, unknown arch empty, table pinning - read_qwen_model_metadata I/O (2): nonexistent path Err, non-GGUF file (Cargo.toml) Err VDD evidence N/A — pure-data loader, no inference dispatch. Evidence will land with PR-3 (enforcement integration). Stack: - #1315 GRID-INFERENCE-ROUTING PR-1 (merged to canary) - #1331 CBAR-PIECE-5 PR-1 (residency gate types — base of this PR) - This PR: GGUF metadata loader (PIECE-5 PR-2) - Future PR-3: dispatcher integration + enforcement --------- Co-authored-by: Test <test@test.com>
This was referenced May 16, 2026
joelteply
pushed a commit
that referenced
this pull request
May 16, 2026
…ning; navigate to MODULE-CATALOG queue Second refresh of ALPHA-GAP Immediate Next Actions to reflect work landed since #1316 merged. Six items closed; navigation into MODULE-CATALOG queue made explicit. Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1 (#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338), PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family architecture surface (#1324/#1327/#1332/#1336/#1337 open; #1316/#1317/#1320/#1329 merged). Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To Build' queue (audit-recorder → threat-detector → working-set-manager → demand-aligned-recall → substrate-governor). Adds closeout summary section listing what's done, what's open (5 architecture-doc PRs ready for review + 2 airc PRs), and what's queued (5 modules with dependency state + LoC + acceptance criteria in MODULE-CATALOG). Doc-driven development cycle is working: doc spec → implementing agent picks up → ships PR → next spec referenced.
joelteply
added a commit
that referenced
this pull request
May 16, 2026
…ning; navigate to MODULE-CATALOG queue (#1342) Second refresh of ALPHA-GAP Immediate Next Actions to reflect work landed since #1316 merged. Six items closed; navigation into MODULE-CATALOG queue made explicit. Closed: #6 contract widening (#1341), #8 GRID-INFERENCE-ROUTING PR-1 (#1315), CBAR-PIECE-5 end-to-end (#1331/#1333/#1335/#1338), PIECE-8 inference-grpc hardcoded-clamps (#1340), doc family architecture surface (#1324/#1327/#1332/#1336/#1337 open; #1316/#1317/#1320/#1329 merged). Item #9 reorganized to point at MODULE-CATALOG's 'Next Modules To Build' queue (audit-recorder → threat-detector → working-set-manager → demand-aligned-recall → substrate-governor). Adds closeout summary section listing what's done, what's open (5 architecture-doc PRs ready for review + 2 airc PRs), and what's queued (5 modules with dependency state + LoC + acceptance criteria in MODULE-CATALOG). Doc-driven development cycle is working: doc spec → implementing agent picks up → ships PR → next spec referenced. Co-authored-by: Test <test@test.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GRID-INFERENCE-ROUTING PR-1 of 4 (vhsm-d1f4 allocation 2026-05-16T15:03Z).
Pure-functions slice — types + derivation + in-memory registry. No grid wiring, no IPC, no async. Sets up the local-side capability surface that PR-2 / PR-3 / PR-4 stack on:
Same pattern as
rate_proposals/generate_recipePR-1 — isolate pure data + derivation first so PR-2 stacks on a stable, test-covered shape.What ships
inference_capability::types(ts-rs camelCase exports)InferenceKind(String)— NOT a const enum, per no-hardcoded-enums rule. Backends register dynamically; new ones (tflite, mlx, candle-vulkan) plug in without a schema change.LatencyClass—Local|Fast|Mesh|Wan, serializes lowercase, totally ordered for PR-3 scoring.HardwareProfile—platform,has_metal,has_cuda,has_vulkan,free_vram_bytes,total_vram_bytes,cpu_cores,system_ram_bytes.InferenceCapability—kind,free_vram_bytes,current_lease_count,latency_class.NodeCapability— full advertisement:node_id,hardware,capabilities[],last_updated_ms.Exports to
shared/generated/inference_capability/(notgrid/) — keeps the new module's wire types in their own namespace and avoids collision with the existinggrid::node::NodeCapabilitydiscriminated union.inference_capability::probe(pure function)No IO, no globals, no syscalls. Decisions encoded:
MIN_GPU_INFERENCE_VRAM_BYTES = 2 GiBfloor — below that, advertise nothing (deadhead-don't-fail policy per vhsm-d1f4 audit pass 1)no_cpu_fallbackcontract at the capability layer)inference_capability::registry(NodeCapabilityRegistry)In-memory
HashMap<String, NodeCapability>withupsert,get,remove,list,find_capable(kind, min_vram),evict_stale(cutoff_ms). Sync, single-threaded — PR-2 wraps inparking_lot::RwLockwhen wiring the announcer.Failure-mode discipline (non-negotiable per audit pass 1 + 6)
generic_dell_no_gpu_advertises_nothingpins the contract.InferenceKind(String)newtype.Test plan
cargo test --lib --features metal,accelerate inference_capability::— 43 passing, 0 failing.cargo check --lib --features metal,accelerateclean (51 pre-existing warnings unrelated to this PR).index.ts.VDD evidence
N/A for this slice — no inference dispatch, no tokens/s. The covenant evidence (latency, tok/s, before/after) lands with PR-3 (router decides remote vs local) and PR-4 (streaming over grid transport).
What PR-1 enables: PR-2 announces this node's capabilities to the mesh, PR-3 uses them to route inference jobs to whichever node is fastest/least-loaded, PR-4 streams tokens back. Net covenant target: vector search <50ms, RAG <500ms, voice <3s, persona tick <1ms — distributed across the mesh instead of single-node bottlenecked.
Stack
This is PR-1. PR-2 / PR-3 / PR-4 will branch off this once merged.
Refer to vhsm-d1f4's airc broadcast (2026-05-16T15:03:27Z) for the full design intent + per-PR scope.