Inference capacity: consolidate to adapter-owned, delete duplicate gates

## Problem

Three layers independently compute local-inference concurrency. They happen to agree today (both sides use `RAM >= 48 → 3, >= 16 → 2, else 1`) — but they're duplicated across TS and Rust, and one of them drifting silently is a latent bug.

| Layer | File / line | Computes |
|---|---|---|
| Rust scheduler | `workers/continuum-core/src/inference/candle_adapter.rs:967` `concurrent_inference_permits()` | `n_seq_max` for `Scheduler` |
| Rust pre-enqueue gate | `candle_adapter.rs:107` `inference_semaphore` + `:733` acquire | permits in front of `scheduler.enqueue` |
| TS admission | `system/coordination/server/InferenceCoordinator.ts:57` `localInferenceCapacity()` | TS-side slot count |

Two forms of waste:

1. **Duplicated formula (TS + Rust).** If one changes and the other doesn't, TS denies when Rust has capacity, or over-admits beyond Rust's actual gate. Classic single-source-of-truth violation.
2. **Redundant Rust gate.** `BatchScheduler` already has a free-seq queue that naturally admits `n_seq_max` and backpressures the rest. Adding a `tokio::sync::Semaphore` with equal permits in front of `enqueue()` is a second serialization point doing the same job — and it's the layer that doesn't know about per-seq KV pressure.

## Proposed shape

1. **Adapter owns capacity.** Add `CandleAdapter::inference_capacity() -> usize` returning the adapter's declared concurrency (same surface-shape as `lora_capabilities()` at `candle_adapter.rs:159`). RAM detection moves inside, no free function.
2. **Delete `concurrent_inference_permits()`** as standalone.
3. **Delete `inference_semaphore`** and the `acquire_owned().await` at line 733. Scheduler's free-seq queue is the truth. If enqueue-outrunning-n_seq_max causes problems (unbounded channel growth), fix that in the scheduler — one gate, one place.
4. **Expose capacity over IPC.** New command `inference/capacity` returns `adapter.inference_capacity()`.
5. **TS reads, doesn't compute.** `InferenceCoordinator` replaces `localInferenceCapacity()` with an IPC call (cached at init). `localInferenceCapacity()` deleted.

## Why this matters

- Compression: one place defines concurrency, everything downstream reads it.
- Correctness: removes the class of bug where TS+Rust drift.
- Next-step enablement: the TODO at `candle_adapter.rs:963` for pressure-reactive permits becomes trivial — mutate one value in one place and both layers pick it up.
- Cleaner path to LoRA-per-seq (v2 of scheduler) and pressure-reactive permits without re-plumbing three layers.

## Risks / open questions

- **Does dropping the semaphore expose a backpressure gap?** When enqueue outruns `n_seq_max`, does the mpsc channel grow unboundedly? Scheduler-side bounded queue may be needed.
- **TS startup ordering.** `InferenceCoordinator` is a singleton that reads capacity at init; need to ensure Rust IPC is up before the first `requestSlot` call, or fall back cleanly on cold start.

## Context

- M5's continuous-batching scheduler lands in c9a0e77a3 + KV-pool fix in 9ca9908a5 on `feature/inference-perf`.
- M1 Pro (32GB, 2 permits) and M5 Pro (48GB, 3 permits) both validated post-scheduler: concurrent multi-stream without crashes.
- Work split TBD between Claude M1 Pro + Claude M5 — scheduler-side deletions by whoever owns the scheduler, TS/IPC side by the other.

Related: the comment at `candle_adapter.rs:963` anticipates the pressure-reactive dynamic layer this unblocks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference capacity: consolidate to adapter-owned, delete duplicate gates #887

Problem

Proposed shape

Why this matters

Risks / open questions

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Layer	File / line	Computes
Rust scheduler	`workers/continuum-core/src/inference/candle_adapter.rs:967` `concurrent_inference_permits()`	`n_seq_max` for `Scheduler`
Rust pre-enqueue gate	`candle_adapter.rs:107` `inference_semaphore` + `:733` acquire	permits in front of `scheduler.enqueue`
TS admission	`system/coordination/server/InferenceCoordinator.ts:57` `localInferenceCapacity()`	TS-side slot count

Inference capacity: consolidate to adapter-owned, delete duplicate gates #887

Description

Problem

Proposed shape

Why this matters

Risks / open questions

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions