Skip to content

Unified Admission

Ameya Borkar edited this page Jun 10, 2026 · 6 revisions

Unified admission — one decision across rate, concurrency, cost

unifiedAdmission(...) composes the three orthogonal admission axes a real API request must clear — rate (req/min), concurrency (in-flight ceiling), and cost (tokens per window) — into a single Decision via a pure, four-law algebra (combineDecisions). It's the shape LLM gateways need: one decision, one observable binding axis, one shared retry hint.

import {
  adaptiveConcurrency,
  gcra,
  rateLimit,
  tokenBucket,
  unifiedAdmission,
} from "throttlekit";

const admit = unifiedAdmission({
  rate:        rateLimit({ strategy: gcra({ limit: 60, periodMs: 60_000 }) }),
  concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
  cost:        rateLimit({ strategy: tokenBucket({ capacity: 100_000, refillPerSec: 1_667 }) }),
});

// In an express handler:
const { decision, release } = await admit.admit({
  key:  req.user.id,
  cost: req.body.maxTokens ?? 1000,
});

if (!decision.allowed) {
  res.setHeader("Retry-After", Math.ceil(decision.retryAfterMs / 1000));
  return res.status(429).json({ error: "rate_limited", retryAfterMs: decision.retryAfterMs });
}

res.on("finish", () => release({ dropped: false }));
res.on("close",  () => release({ dropped: true }));  // client hung up
await callLLM(req.body);

The algebra (combineDecisions)

Aggregation across axes:

Field Rule Why
allowed a.allowed && b.allowed AND — both must allow
limit min(a.limit, b.limit) binding ceiling — what the client should see
remaining min(a.remaining, b.remaining) binding remainder
resetAt max(a.resetAt, b.resetAt) latest-resolution wait
retryAfterMs max(a.retryAfterMs, b.retryAfterMs) dominant wait — never under-state

Four algebraic laws hold (proven via fast-check at numRuns ≥ 500): identity, associativity, commutativity, idempotency. Together they mean axis evaluation order doesn't change the result, N inputs reduce flat, retried sub-checks are safe, and unused axes plug in cleanly via the ALLOW_FULL neutral element.

combineDecisions and ALLOW_FULL are publicly exported off the root — useful for tests and N-ary composition.

Two backend modes

Mode When to use Wire cost
Sequential (default) Any backend mix (in-process + Redis + Postgres) rate-axis RTT + cost-axis RTT (often pipelined to ~1 RTT)
Lua-fused (opt-in) All rate/cost on the same Redis client; you want atomic joint enforcement 1 RTT regardless of axes

The lua-fused path ships GCRA + tokenBucket fusion in 0.9.0 — the LLM-gateway combo:

import Redis from "ioredis";
import { fromIoredis } from "throttlekit/redis";

const admit = unifiedAdmission({
  concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
  backend: "lua-fused",
  fused: {
    client: fromIoredis(new Redis(process.env.REDIS_URL!)),
    rate: { strategy: "gcra",        limit: 60,     periodMs: 60_000,    prefix: "rl:rate" },
    cost: { strategy: "tokenBucket", capacity: 100_000, refillPerSec: 1_667, prefix: "rl:cost" },
  },
});

Sequential ≡ Lua-fused: the byte-identical Decision-stream property is proven across 100 fast-check timelines per (rate-binding, cost-binding, both-binding) configuration in test/admission/fused-conformance.test.ts (TK-1006).

Observability — the binding axis

When an admission denies, which axis was binding? That's the #1 missing OTel signal for LLM gateways today. Two helpers from the throttlekit/observability subpath:

import { trace } from "@opentelemetry/api";
import { bindingAxisOf, recordUnifiedAdmissionOnSpan } from "throttlekit/observability";

const { decision, release } = await admit.admit({ key, cost });
const span = trace.getActiveSpan();
if (span) recordUnifiedAdmissionOnSpan(span, decision, admit.lastDecisions());

// Or query directly:
if (!decision.allowed) {
  log.info({ axis: bindingAxisOf(admit.lastDecisions()), retryAfterMs: decision.retryAfterMs });
}

The attribute key is throttlekit.binding_axis ∈ {"rate", "concurrency", "cost"}. It's set only on denied admissions (omitted when allowed). When multiple axes deny (possible in lua-fused mode), the convention is concurrency → rate → cost priority — matches sequential's evaluation order so the value is deterministic regardless of backend.

UnifiedAdmitter.lastDecisions() returns a frozen per-axis snapshot ({ rate, concurrency, cost }); unconfigured axes are undefined, short-circuited axes also undefined (so you can identify the first denying axis from absence alone).

Behind a ThrottleKit server the binding axis is also readable remotely and from any language via the read-only Monitor door (GetSnapshot / Watch), and exported to Prometheus /metrics as throttlekit_denied_by_axis_total — the same signal, off the request path. See the Monitoring guide and Operations.

Lifecycle — admit() vs admitSync()

  • admit() is async and returns Promise<UnifiedAdmission> — works for any backend mix (Redis, Postgres, in-process).
  • admitSync() is the sync sibling — only valid when every configured axis has checkSync (in-process MemoryStore for rate/cost; concurrency is always sync). Throws otherwise (same convention as Limiter.check / Limiter.checkSync).

Both return { decision, release } — the release is the lifecycle hook for the concurrency slot, separate from the Decision because concurrency has lease semantics (acquire-release) that don't fit Limiter's stateless .check() → Decision shape (the locked decision is D-U4 in research/bigger-bets/unified/DESIGN.md §14).

Idempotency: a second release() call is a no-op. A denied admit's release is a no-op (no slot was held — any transient acquire upstream of the binding axis was released as part of the short-circuit).

Joint-LP policy — bid-price admission (opt-in, 0.11.1)

Marginal-AND admits when each axis independently has room. When the cost axis binds and request types differ in value-per-cost-unit, that greedily burns budget on whatever arrives first — including cheap-to-pass, low-value, cost-heavy requests that starve the high-value requests arriving later. The fix from revenue management is a bid-price filter: admit iff the request's value clears the shadow price of the budget it consumes,

admit  ⟺  value ≥ p_R + p_C · cost

where (p_R, p_C) are the dual variables of the workload's fluid LP. The literature (Talluri–van Ryzin 1998; Devanur–Hayes 2009; Buchbinder–Jain–Naor 2007) shows static bid prices are asymptotically fluid-optimal under (approximate) stationarity. research/bigger-bets/unified/THEORY.md (TK-1007) calibrated the gap on an LLM-gateway workload:

ρ (autocorrelation) regret(marginal-AND) regret(joint-LP) ε
−1.0 (alternation) 40.00% 0.00% +40.00%
0.0 (independent) 40.50% 1.01% +39.49%
+1.0 (one type forever) 32.50% 65.00% −32.50% (the foil)
mean 38.90% 13.57% +25.33%

Mean ε = 25.33% ≫ the 5% ship gate (DR-19) → shipped as opt-in policy: "joint-lp" in 0.11.1.

API

// Supply a workload model — the library solves the fluid LP once at construction:
const admit = unifiedAdmission({
  cost: rateLimit({ strategy: tokenBudget({ budget: 50_000, windowMs: 60_000 }) }),
  policy: "joint-lp",
  jointLp: {
    workload: {
      types: [
        { cost: 100,    value: 1,  weight: 0.5 },  // small completion
        { cost: 10_000, value: 50, weight: 0.5 },  // large completion
      ],
      rateBudget: 1_000,
      costBudget: 50_000,
    },
  },
});

// …or supply precomputed bid prices directly (e.g. solved offline):
//   jointLp: { duals: { rate: 0, cost: 0.01 } }

const { decision, release, policyDenied } = admit.admitSync({ cost: 10_000, value: 50 });
// policyDenied === true ⇒ the bid-price filter bound (every axis had room).

solveFluidLp(...) is also exported standalone (returns { duals, admitFractions, objective }). Per-call value defaults to 1. The policy is strictly more selective than marginal-AND — it only ever removes admits, so it cannot breach any limit — and runs identically over the sequential and lua-fused backends. Default "marginal" is byte-for-byte unchanged. Requires a cost axis.

Honest caveat — do not enable blindly

The ρ = +1 column is negative: under a highly autocorrelated, near-absorbing workload (long runs of one type), the static fluid-LP duals can under-perform marginal-AND — the textbook fluid-LP failure under non-stationarity (Talluri–van Ryzin 1998). Real aggregator traffic sits in moderate ρ where joint-LP wins by +39–40%, but if your arrivals are strongly autocorrelated, re-measure ε on your own trace and keep the default.

Online dual refinement — jointLp.adaptive (opt-in, 0.11.3)

If you can't pin the prior confidently, let the policy learn the bid prices online (Devanur–Hayes sample-then-price). Requires the workload form:

const admit = unifiedAdmission({
  cost,
  policy: "joint-lp",
  jointLp: {
    workload,                        // the construction PRIOR (+ per-arrival budgets)
    adaptive: { sampleWindow: 500 }, // observe 500 requests, then re-price
  },
});

It prices the first sampleWindow requests with the prior while observing the live (cost, value) mixture, then re-solves the fluid LP and adopts the learned duals only if they beat the prior on the observed sample, else keeps the prior — then freezes. So:

  • a misspecified prior is rescued (a prior whose duals reject everything is escaped — ~100% → ~20–30% regret in the gate);
  • a correct prior is kept (noise can't dislodge it; the naïve "always re-price" variant instead hurts a correct prior, 9.9–21.1% vs static's 0.7–1.2% — that design was rejected by the gate).

Honest scope: the guarantee is non-inferiority on the observed sample, not over the full horizon — under autocorrelated arrivals the window can be unrepresentative and an adopted dual can be slightly worse on the full stream (the ρ=+1 foil's cousin; bounded, +~0.8pp measured). Prefer a larger sampleWindow on bursty traffic; the prior is always the floor. With a concurrency axis the window counts the concurrency-passed population.

Concurrency shadow price — the 3-axis filter (opt-in, 0.11.3)

The 2-axis filter prices two flow budgets (rate, cost). A third axis — concurrency — is a stock (a held slot), which looks like it doesn't fit the same fluid relaxation. It does, via Little's law: an occupancy cap L over a window T is a concurrency-seconds budget K = L·T, and each admit consumes its hold time h. The bid test gains a term:

admit iff  value ≥ p_R + p_C·cost + p_K·hold

This rejects a hold-time hog — a request that is cheap and valuable per token but holds a worker slot for a long time — that the 2-axis filter is structurally blind to (two requests identical on cost+value but 10× apart in hold time look the same to it). The gate (three-axis-gate.ts) measures regret 53% → 2% (ε≈51pp) when concurrency binds and the hog is indistinguishable on (rate, cost).

const admit = unifiedAdmission({
  concurrency,                                 // the real occupancy limiter
  cost,
  policy: "joint-lp",
  jointLp: {
    workload: {
      types: [
        { cost: 100, value: 10, weight: 1800, hold: 15 },   // short — frees its slot fast
        { cost: 100, value: 10, weight: 200, hold: 200 },    // long  — a concurrency hog
      ],
      rateBudget: 2000, costBudget: 1e9,
      concBudget: 20_000,                       // K = L·T  (e.g. L=10 slots × T=2000)
    },
  },
});
// pass the request's expected service time per call:
admit.admitSync({ cost: 100, value: 10, hold: 200 }); // policyDenied — the hog is priced out

Honest scope: it earns its keep only when concurrency BINDS and the hog is strictly dominated (a bid-price threshold can't ration a marginal hog — the same limit as the ρ=+1 foil); when concurrency is ample p_K = 0 and it's a no-op. A missing / non-finite / negative per-request hold is fail-open (no concurrency term — never a wrongful reject, and a hog can't dodge the price by reporting a negative hold). Not combinable with jointLp.adaptive yet.

Deferred / future work

Item Where Status
policy: "joint-lp" runtime bid-price filter on unifiedAdmission ✅ shipped 0.11.1 (ε = 25.33%)
Online primal-dual (Devanur–Hayes sample-then-price) jointLp.adaptive — guarded warm-up on unifiedAdmission ✅ shipped 0.11.3 (guarded self-validating; D-JLP-13/14)
3-axis joint LP (rate + cost + concurrency shadow price) value ≥ p_R + p_C·cost + p_K·hold via Little's law (concBudget + per-request hold) ✅ shipped 0.11.3 (D-JLP-15/16)
Decision.bindingAxis field breaking change to Decision shape — use the OTel attr + lastDecisions() / policyDenied instead 1.0 candidate

See research/bigger-bets/joint-lp-admission/DESIGN.md (D-JLP-1..16) for the policy design, research/bigger-bets/unified/DESIGN.md for the composition algebra, and research/bigger-bets/PLAN.md for the roadmap.

Recipes

Fastest-fail order matters (when correlated with cost)

Sequential evaluates concurrency → rate → cost (in-process first, fastest fail). Commutativity of combineDecisions (the proven law) means the result doesn't depend on order — only the short-circuit cost. For LLM gateways the concurrency axis usually binds first under load, so this saves the Redis round-trip in the common deny path.

Federated unified admission

Federation (federate({ coordinator, ... })) is shipped as of 0.8.3. unifiedAdmission composes with it for free — pass a federated Limiter as the rate or cost axis. No new surface is needed; the unified layer doesn't know or care about the federation layer. Tested in test/admission/unified.test.ts.

Configuring multiple tenants

Concurrency is keyless (one global guard per process). Rate and cost are per-key — admit({ key: "tenant:abc", cost: 1500 }) keys the rate and cost limits independently per tenant. For weighted fair sharing across tenants, layer weightedFairShare(...) upstream of unifiedAdmission.

Error path on release

const { decision, release } = await admit.admit({ key, cost });
if (!decision.allowed) return reject(decision);
try {
  await doWork();
  release({ dropped: false });
} catch (err) {
  release({ dropped: true });  // signal overload → AIMD contracts the ceiling
  throw err;
}

The dropped: true flag propagates to the underlying gradient2 / AIMD update — the limit contracts on overload signals.

See also

  • Distributed & provabletwoTier(leased), windowCoupled, the per-window overshoot bound.
  • Federation — the cross-cluster federation primitive that composes with unifiedAdmission.
  • Operations — Prometheus / Grafana / OTel guidance.
  • GALE & TALE — the bounded-overshoot guarantees unifiedAdmission plugs into.
  • examples/unified.ts in the repo — runnable LLM-gateway-style demo.
  • research/bigger-bets/unified/DESIGN.md — the design lock + decision records.
  • research/bigger-bets/unified/THEORY.md — joint-vs-marginal empirical regret analysis.

Clone this wiki locally