Skip to content

Inference capacity: consolidate to adapter-owned, delete duplicate gates #887

@joelteply

Description

@joelteply

Problem

Three layers independently compute local-inference concurrency. They happen to agree today (both sides use RAM >= 48 → 3, >= 16 → 2, else 1) — but they're duplicated across TS and Rust, and one of them drifting silently is a latent bug.

Layer File / line Computes
Rust scheduler workers/continuum-core/src/inference/candle_adapter.rs:967 concurrent_inference_permits() n_seq_max for Scheduler
Rust pre-enqueue gate candle_adapter.rs:107 inference_semaphore + :733 acquire permits in front of scheduler.enqueue
TS admission system/coordination/server/InferenceCoordinator.ts:57 localInferenceCapacity() TS-side slot count

Two forms of waste:

  1. Duplicated formula (TS + Rust). If one changes and the other doesn't, TS denies when Rust has capacity, or over-admits beyond Rust's actual gate. Classic single-source-of-truth violation.
  2. Redundant Rust gate. BatchScheduler already has a free-seq queue that naturally admits n_seq_max and backpressures the rest. Adding a tokio::sync::Semaphore with equal permits in front of enqueue() is a second serialization point doing the same job — and it's the layer that doesn't know about per-seq KV pressure.

Proposed shape

  1. Adapter owns capacity. Add CandleAdapter::inference_capacity() -> usize returning the adapter's declared concurrency (same surface-shape as lora_capabilities() at candle_adapter.rs:159). RAM detection moves inside, no free function.
  2. Delete concurrent_inference_permits() as standalone.
  3. Delete inference_semaphore and the acquire_owned().await at line 733. Scheduler's free-seq queue is the truth. If enqueue-outrunning-n_seq_max causes problems (unbounded channel growth), fix that in the scheduler — one gate, one place.
  4. Expose capacity over IPC. New command inference/capacity returns adapter.inference_capacity().
  5. TS reads, doesn't compute. InferenceCoordinator replaces localInferenceCapacity() with an IPC call (cached at init). localInferenceCapacity() deleted.

Why this matters

  • Compression: one place defines concurrency, everything downstream reads it.
  • Correctness: removes the class of bug where TS+Rust drift.
  • Next-step enablement: the TODO at candle_adapter.rs:963 for pressure-reactive permits becomes trivial — mutate one value in one place and both layers pick it up.
  • Cleaner path to LoRA-per-seq (v2 of scheduler) and pressure-reactive permits without re-plumbing three layers.

Risks / open questions

  • Does dropping the semaphore expose a backpressure gap? When enqueue outruns n_seq_max, does the mpsc channel grow unboundedly? Scheduler-side bounded queue may be needed.
  • TS startup ordering. InferenceCoordinator is a singleton that reads capacity at init; need to ensure Rust IPC is up before the first requestSlot call, or fall back cleanly on cold start.

Context

  • M5's continuous-batching scheduler lands in c9a0e77 + KV-pool fix in 9ca9908 on feature/inference-perf.
  • M1 Pro (32GB, 2 permits) and M5 Pro (48GB, 3 permits) both validated post-scheduler: concurrent multi-stream without crashes.
  • Work split TBD between Claude M1 Pro + Claude M5 — scheduler-side deletions by whoever owns the scheduler, TS/IPC side by the other.

Related: the comment at candle_adapter.rs:963 anticipates the pressure-reactive dynamic layer this unblocks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions