Unified Admission

Unified admission — one decision across rate, concurrency, cost

unifiedAdmission(...) composes the three orthogonal admission axes a real API request must clear — rate (req/min), concurrency (in-flight ceiling), and cost (tokens per window) — into a single Decision via a pure, four-law algebra (combineDecisions). It's the shape LLM gateways need: one decision, one observable binding axis, one shared retry hint.

import {
  adaptiveConcurrency,
  gcra,
  rateLimit,
  tokenBucket,
  unifiedAdmission,
} from "throttlekit";

const admit = unifiedAdmission({
  rate:        rateLimit({ strategy: gcra({ limit: 60, periodMs: 60_000 }) }),
  concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
  cost:        rateLimit({ strategy: tokenBucket({ capacity: 100_000, refillPerSec: 1_667 }) }),
});

// In an express handler:
const { decision, release } = await admit.admit({
  key:  req.user.id,
  cost: req.body.maxTokens ?? 1000,
});

if (!decision.allowed) {
  res.setHeader("Retry-After", Math.ceil(decision.retryAfterMs / 1000));
  return res.status(429).json({ error: "rate_limited", retryAfterMs: decision.retryAfterMs });
}

res.on("finish", () => release({ dropped: false }));
res.on("close",  () => release({ dropped: true }));  // client hung up
await callLLM(req.body);

The algebra (`combineDecisions`)

Aggregation across axes:

Field	Rule	Why
`allowed`	`a.allowed && b.allowed`	AND — both must allow
`limit`	`min(a.limit, b.limit)`	binding ceiling — what the client should see
`remaining`	`min(a.remaining, b.remaining)`	binding remainder
`resetAt`	`max(a.resetAt, b.resetAt)`	latest-resolution wait
`retryAfterMs`	`max(a.retryAfterMs, b.retryAfterMs)`	dominant wait — never under-state

Four algebraic laws hold (proven via fast-check at numRuns ≥ 500): identity, associativity, commutativity, idempotency. Together they mean axis evaluation order doesn't change the result, N inputs reduce flat, retried sub-checks are safe, and unused axes plug in cleanly via the ALLOW_FULL neutral element.

combineDecisions and ALLOW_FULL are publicly exported off the root — useful for tests and N-ary composition.

Two backend modes

Mode	When to use	Wire cost
Sequential (default)	Any backend mix (in-process + Redis + Postgres)	rate-axis RTT + cost-axis RTT (often pipelined to ~1 RTT)
Lua-fused (opt-in)	All rate/cost on the same Redis client; you want atomic joint enforcement	1 RTT regardless of axes

The lua-fused path ships GCRA + tokenBucket fusion in 0.9.0 — the LLM-gateway combo:

import Redis from "ioredis";
import { fromIoredis } from "throttlekit/redis";

const admit = unifiedAdmission({
  concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
  backend: "lua-fused",
  fused: {
    client: fromIoredis(new Redis(process.env.REDIS_URL!)),
    rate: { strategy: "gcra",        limit: 60,     periodMs: 60_000,    prefix: "rl:rate" },
    cost: { strategy: "tokenBucket", capacity: 100_000, refillPerSec: 1_667, prefix: "rl:cost" },
  },
});

Sequential ≡ Lua-fused: the byte-identical Decision-stream property is proven across 100 fast-check timelines per (rate-binding, cost-binding, both-binding) configuration in test/admission/fused-conformance.test.ts (TK-1006).

Observability — the binding axis

When an admission denies, which axis was binding? That's the #1 missing OTel signal for LLM gateways today. Two helpers from the throttlekit/observability subpath:

import { trace } from "@opentelemetry/api";
import { bindingAxisOf, recordUnifiedAdmissionOnSpan } from "throttlekit/observability";

const { decision, release } = await admit.admit({ key, cost });
const span = trace.getActiveSpan();
if (span) recordUnifiedAdmissionOnSpan(span, decision, admit.lastDecisions());

// Or query directly:
if (!decision.allowed) {
  log.info({ axis: bindingAxisOf(admit.lastDecisions()), retryAfterMs: decision.retryAfterMs });
}

The attribute key is throttlekit.binding_axis ∈ {"rate", "concurrency", "cost"}. It's set only on denied admissions (omitted when allowed). When multiple axes deny (possible in lua-fused mode), the convention is concurrency → rate → cost priority — matches sequential's evaluation order so the value is deterministic regardless of backend.

UnifiedAdmitter.lastDecisions() returns a frozen per-axis snapshot ({ rate, concurrency, cost }); unconfigured axes are undefined, short-circuited axes also undefined (so you can identify the first denying axis from absence alone).

Behind a ThrottleKit server the binding axis is also readable remotely and from any language via the read-only Monitor door (GetSnapshot / Watch), and exported to Prometheus /metrics as throttlekit_denied_by_axis_total — the same signal, off the request path. See the Monitoring guide and Operations.

Lifecycle — `admit()` vs `admitSync()`

admit() is async and returns Promise<UnifiedAdmission> — works for any backend mix (Redis, Postgres, in-process).
admitSync() is the sync sibling — only valid when every configured axis has checkSync (in-process MemoryStore for rate/cost; concurrency is always sync). Throws otherwise (same convention as Limiter.check / Limiter.checkSync).

Both return { decision, release } — the release is the lifecycle hook for the concurrency slot, separate from the Decision because concurrency has lease semantics (acquire-release) that don't fit Limiter's stateless .check() → Decision shape (the locked decision is D-U4 in research/bigger-bets/unified/DESIGN.md §14).

Idempotency: a second release() call is a no-op. A denied admit's release is a no-op (no slot was held — any transient acquire upstream of the binding axis was released as part of the short-circuit).

Joint-LP policy — bid-price admission (opt-in, 0.11.1)

Marginal-AND admits when each axis independently has room. When the cost axis binds and request types differ in value-per-cost-unit, that greedily burns budget on whatever arrives first — including cheap-to-pass, low-value, cost-heavy requests that starve the high-value requests arriving later. The fix from revenue management is a bid-price filter: admit iff the request's value clears the shadow price of the budget it consumes,

admit  ⟺  value ≥ p_R + p_C · cost

where (p_R, p_C) are the dual variables of the workload's fluid LP. The literature (Talluri–van Ryzin 1998; Devanur–Hayes 2009; Buchbinder–Jain–Naor 2007) shows static bid prices are asymptotically fluid-optimal under (approximate) stationarity. research/bigger-bets/unified/THEORY.md (TK-1007) calibrated the gap on an LLM-gateway workload:

ρ (autocorrelation)	regret(marginal-AND)	regret(joint-LP)	ε
−1.0 (alternation)	40.00%	0.00%	+40.00%
0.0 (independent)	40.50%	1.01%	+39.49%
+1.0 (one type forever)	32.50%	65.00%	−32.50% (the foil)
mean	38.90%	13.57%	+25.33%

Mean ε = 25.33% ≫ the 5% ship gate (DR-19) → shipped as opt-in policy: "joint-lp" in 0.11.1.

API

// Supply a workload model — the library solves the fluid LP once at construction:
const admit = unifiedAdmission({
  cost: rateLimit({ strategy: tokenBudget({ budget: 50_000, windowMs: 60_000 }) }),
  policy: "joint-lp",
  jointLp: {
    workload: {
      types: [
        { cost: 100,    value: 1,  weight: 0.5 },  // small completion
        { cost: 10_000, value: 50, weight: 0.5 },  // large completion
      ],
      rateBudget: 1_000,
      costBudget: 50_000,
    },
  },
});

// …or supply precomputed bid prices directly (e.g. solved offline):
//   jointLp: { duals: { rate: 0, cost: 0.01 } }

const { decision, release, policyDenied } = admit.admitSync({ cost: 10_000, value: 50 });
// policyDenied === true ⇒ the bid-price filter bound (every axis had room).

solveFluidLp(...) is also exported standalone (returns { duals, admitFractions, objective }). Per-call value defaults to 1. The policy is strictly more selective than marginal-AND — it only ever removes admits, so it cannot breach any limit — and runs identically over the sequential and lua-fused backends. Default "marginal" is byte-for-byte unchanged. Requires a cost axis.

Honest caveat — do not enable blindly

The ρ = +1 column is negative: under a highly autocorrelated, near-absorbing workload (long runs of one type), the static fluid-LP duals can under-perform marginal-AND — the textbook fluid-LP failure under non-stationarity (Talluri–van Ryzin 1998). Real aggregator traffic sits in moderate ρ where joint-LP wins by +39–40%, but if your arrivals are strongly autocorrelated, re-measure ε on your own trace and keep the default.

Online dual refinement — `jointLp.adaptive` (opt-in, 0.11.3)

If you can't pin the prior confidently, let the policy learn the bid prices online (Devanur–Hayes sample-then-price). Requires the workload form:

const admit = unifiedAdmission({
  cost,
  policy: "joint-lp",
  jointLp: {
    workload,                        // the construction PRIOR (+ per-arrival budgets)
    adaptive: { sampleWindow: 500 }, // observe 500 requests, then re-price
  },
});

It prices the first sampleWindow requests with the prior while observing the live (cost, value) mixture, then re-solves the fluid LP and adopts the learned duals only if they beat the prior on the observed sample, else keeps the prior — then freezes. So:

a misspecified prior is rescued (a prior whose duals reject everything is escaped — ~100% → ~20–30% regret in the gate);
a correct prior is kept (noise can't dislodge it; the naïve "always re-price" variant instead hurts a correct prior, 9.9–21.1% vs static's 0.7–1.2% — that design was rejected by the gate).

Honest scope: the guarantee is non-inferiority on the observed sample, not over the full horizon — under autocorrelated arrivals the window can be unrepresentative and an adopted dual can be slightly worse on the full stream (the ρ=+1 foil's cousin; bounded, +~0.8pp measured). Prefer a larger sampleWindow on bursty traffic; the prior is always the floor. With a concurrency axis the window counts the concurrency-passed population.

Concurrency shadow price — the 3-axis filter (opt-in, 0.11.3)

The 2-axis filter prices two flow budgets (rate, cost). A third axis — concurrency — is a stock (a held slot), which looks like it doesn't fit the same fluid relaxation. It does, via Little's law: an occupancy cap L over a window T is a concurrency-seconds budget K = L·T, and each admit consumes its hold time h. The bid test gains a term:

admit iff  value ≥ p_R + p_C·cost + p_K·hold

This rejects a hold-time hog — a request that is cheap and valuable per token but holds a worker slot for a long time — that the 2-axis filter is structurally blind to (two requests identical on cost+value but 10× apart in hold time look the same to it). The gate (three-axis-gate.ts) measures regret 53% → 2% (ε≈51pp) when concurrency binds and the hog is indistinguishable on (rate, cost).

const admit = unifiedAdmission({
  concurrency,                                 // the real occupancy limiter
  cost,
  policy: "joint-lp",
  jointLp: {
    workload: {
      types: [
        { cost: 100, value: 10, weight: 1800, hold: 15 },   // short — frees its slot fast
        { cost: 100, value: 10, weight: 200, hold: 200 },    // long  — a concurrency hog
      ],
      rateBudget: 2000, costBudget: 1e9,
      concBudget: 20_000,                       // K = L·T  (e.g. L=10 slots × T=2000)
    },
  },
});
// pass the request's expected service time per call:
admit.admitSync({ cost: 100, value: 10, hold: 200 }); // policyDenied — the hog is priced out

Honest scope: it earns its keep only when concurrency BINDS and the hog is strictly dominated (a bid-price threshold can't ration a marginal hog — the same limit as the ρ=+1 foil); when concurrency is ample p_K = 0 and it's a no-op. A missing / non-finite / negative per-request hold is fail-open (no concurrency term — never a wrongful reject, and a hog can't dodge the price by reporting a negative hold). Not combinable with jointLp.adaptive yet.

Deferred / future work

Item	Where	Status
`policy: "joint-lp"` runtime	bid-price filter on `unifiedAdmission`	✅ shipped 0.11.1 (ε = 25.33%)
Online primal-dual (Devanur–Hayes sample-then-price)	`jointLp.adaptive` — guarded warm-up on `unifiedAdmission`	✅ shipped 0.11.3 (guarded self-validating; D-JLP-13/14)
3-axis joint LP (rate + cost + concurrency shadow price)	`value ≥ p_R + p_C·cost + p_K·hold` via Little's law (`concBudget` + per-request `hold`)	✅ shipped 0.11.3 (D-JLP-15/16)
`Decision.bindingAxis` field	breaking change to Decision shape — use the OTel attr + `lastDecisions()` / `policyDenied` instead	1.0 candidate

See research/bigger-bets/joint-lp-admission/DESIGN.md (D-JLP-1..16) for the policy design, research/bigger-bets/unified/DESIGN.md for the composition algebra, and research/bigger-bets/PLAN.md for the roadmap.

Recipes

Fastest-fail order matters (when correlated with cost)

Sequential evaluates concurrency → rate → cost (in-process first, fastest fail). Commutativity of combineDecisions (the proven law) means the result doesn't depend on order — only the short-circuit cost. For LLM gateways the concurrency axis usually binds first under load, so this saves the Redis round-trip in the common deny path.

Federated unified admission

Federation (federate({ coordinator, ... })) is shipped as of 0.8.3. unifiedAdmission composes with it for free — pass a federated Limiter as the rate or cost axis. No new surface is needed; the unified layer doesn't know or care about the federation layer. Tested in test/admission/unified.test.ts.

Configuring multiple tenants

Concurrency is keyless (one global guard per process). Rate and cost are per-key — admit({ key: "tenant:abc", cost: 1500 }) keys the rate and cost limits independently per tenant. For weighted fair sharing across tenants, layer weightedFairShare(...) upstream of unifiedAdmission.

Error path on release

const { decision, release } = await admit.admit({ key, cost });
if (!decision.allowed) return reject(decision);
try {
  await doWork();
  release({ dropped: false });
} catch (err) {
  release({ dropped: true });  // signal overload → AIMD contracts the ceiling
  throw err;
}

The dropped: true flag propagates to the underlying gradient2 / AIMD update — the limit contracts on overload signals.

Unified Admission

Unified admission — one decision across rate, concurrency, cost

The algebra (combineDecisions)

Two backend modes

Observability — the binding axis

Lifecycle — admit() vs admitSync()

Joint-LP policy — bid-price admission (opt-in, 0.11.1)

API

Honest caveat — do not enable blindly

Online dual refinement — jointLp.adaptive (opt-in, 0.11.3)

Concurrency shadow price — the 3-axis filter (opt-in, 0.11.3)

Deferred / future work

Recipes

Fastest-fail order matters (when correlated with cost)

Federated unified admission

Configuring multiple tenants

Error path on release

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

The algebra (`combineDecisions`)

Lifecycle — `admit()` vs `admitSync()`

Online dual refinement — `jointLp.adaptive` (opt-in, 0.11.3)