Problem
Three layers independently compute local-inference concurrency. They happen to agree today (both sides use RAM >= 48 → 3, >= 16 → 2, else 1) — but they're duplicated across TS and Rust, and one of them drifting silently is a latent bug.
| Layer |
File / line |
Computes |
| Rust scheduler |
workers/continuum-core/src/inference/candle_adapter.rs:967 concurrent_inference_permits() |
n_seq_max for Scheduler |
| Rust pre-enqueue gate |
candle_adapter.rs:107 inference_semaphore + :733 acquire |
permits in front of scheduler.enqueue |
| TS admission |
system/coordination/server/InferenceCoordinator.ts:57 localInferenceCapacity() |
TS-side slot count |
Two forms of waste:
- Duplicated formula (TS + Rust). If one changes and the other doesn't, TS denies when Rust has capacity, or over-admits beyond Rust's actual gate. Classic single-source-of-truth violation.
- Redundant Rust gate.
BatchScheduler already has a free-seq queue that naturally admits n_seq_max and backpressures the rest. Adding a tokio::sync::Semaphore with equal permits in front of enqueue() is a second serialization point doing the same job — and it's the layer that doesn't know about per-seq KV pressure.
Proposed shape
- Adapter owns capacity. Add
CandleAdapter::inference_capacity() -> usize returning the adapter's declared concurrency (same surface-shape as lora_capabilities() at candle_adapter.rs:159). RAM detection moves inside, no free function.
- Delete
concurrent_inference_permits() as standalone.
- Delete
inference_semaphore and the acquire_owned().await at line 733. Scheduler's free-seq queue is the truth. If enqueue-outrunning-n_seq_max causes problems (unbounded channel growth), fix that in the scheduler — one gate, one place.
- Expose capacity over IPC. New command
inference/capacity returns adapter.inference_capacity().
- TS reads, doesn't compute.
InferenceCoordinator replaces localInferenceCapacity() with an IPC call (cached at init). localInferenceCapacity() deleted.
Why this matters
- Compression: one place defines concurrency, everything downstream reads it.
- Correctness: removes the class of bug where TS+Rust drift.
- Next-step enablement: the TODO at
candle_adapter.rs:963 for pressure-reactive permits becomes trivial — mutate one value in one place and both layers pick it up.
- Cleaner path to LoRA-per-seq (v2 of scheduler) and pressure-reactive permits without re-plumbing three layers.
Risks / open questions
- Does dropping the semaphore expose a backpressure gap? When enqueue outruns
n_seq_max, does the mpsc channel grow unboundedly? Scheduler-side bounded queue may be needed.
- TS startup ordering.
InferenceCoordinator is a singleton that reads capacity at init; need to ensure Rust IPC is up before the first requestSlot call, or fall back cleanly on cold start.
Context
- M5's continuous-batching scheduler lands in c9a0e77 + KV-pool fix in 9ca9908 on
feature/inference-perf.
- M1 Pro (32GB, 2 permits) and M5 Pro (48GB, 3 permits) both validated post-scheduler: concurrent multi-stream without crashes.
- Work split TBD between Claude M1 Pro + Claude M5 — scheduler-side deletions by whoever owns the scheduler, TS/IPC side by the other.
Related: the comment at candle_adapter.rs:963 anticipates the pressure-reactive dynamic layer this unblocks.
Problem
Three layers independently compute local-inference concurrency. They happen to agree today (both sides use
RAM >= 48 → 3, >= 16 → 2, else 1) — but they're duplicated across TS and Rust, and one of them drifting silently is a latent bug.workers/continuum-core/src/inference/candle_adapter.rs:967concurrent_inference_permits()n_seq_maxforSchedulercandle_adapter.rs:107inference_semaphore+:733acquirescheduler.enqueuesystem/coordination/server/InferenceCoordinator.ts:57localInferenceCapacity()Two forms of waste:
BatchScheduleralready has a free-seq queue that naturally admitsn_seq_maxand backpressures the rest. Adding atokio::sync::Semaphorewith equal permits in front ofenqueue()is a second serialization point doing the same job — and it's the layer that doesn't know about per-seq KV pressure.Proposed shape
CandleAdapter::inference_capacity() -> usizereturning the adapter's declared concurrency (same surface-shape aslora_capabilities()atcandle_adapter.rs:159). RAM detection moves inside, no free function.concurrent_inference_permits()as standalone.inference_semaphoreand theacquire_owned().awaitat line 733. Scheduler's free-seq queue is the truth. If enqueue-outrunning-n_seq_max causes problems (unbounded channel growth), fix that in the scheduler — one gate, one place.inference/capacityreturnsadapter.inference_capacity().InferenceCoordinatorreplaceslocalInferenceCapacity()with an IPC call (cached at init).localInferenceCapacity()deleted.Why this matters
candle_adapter.rs:963for pressure-reactive permits becomes trivial — mutate one value in one place and both layers pick it up.Risks / open questions
n_seq_max, does the mpsc channel grow unboundedly? Scheduler-side bounded queue may be needed.InferenceCoordinatoris a singleton that reads capacity at init; need to ensure Rust IPC is up before the firstrequestSlotcall, or fall back cleanly on cold start.Context
feature/inference-perf.Related: the comment at
candle_adapter.rs:963anticipates the pressure-reactive dynamic layer this unblocks.