-
Notifications
You must be signed in to change notification settings - Fork 0
Unified Admission
unifiedAdmission(...) composes the three orthogonal admission axes a
real API request must clear — rate (req/min), concurrency
(in-flight ceiling), and cost (tokens per window) — into a single
Decision via a pure, four-law algebra (combineDecisions). It's the
shape LLM gateways need: one decision, one observable binding axis,
one shared retry hint.
import {
adaptiveConcurrency,
gcra,
rateLimit,
tokenBucket,
unifiedAdmission,
} from "throttlekit";
const admit = unifiedAdmission({
rate: rateLimit({ strategy: gcra({ limit: 60, periodMs: 60_000 }) }),
concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
cost: rateLimit({ strategy: tokenBucket({ capacity: 100_000, refillPerSec: 1_667 }) }),
});
// In an express handler:
const { decision, release } = await admit.admit({
key: req.user.id,
cost: req.body.maxTokens ?? 1000,
});
if (!decision.allowed) {
res.setHeader("Retry-After", Math.ceil(decision.retryAfterMs / 1000));
return res.status(429).json({ error: "rate_limited", retryAfterMs: decision.retryAfterMs });
}
res.on("finish", () => release({ dropped: false }));
res.on("close", () => release({ dropped: true })); // client hung up
await callLLM(req.body);Aggregation across axes:
| Field | Rule | Why |
|---|---|---|
allowed |
a.allowed && b.allowed |
AND — both must allow |
limit |
min(a.limit, b.limit) |
binding ceiling — what the client should see |
remaining |
min(a.remaining, b.remaining) |
binding remainder |
resetAt |
max(a.resetAt, b.resetAt) |
latest-resolution wait |
retryAfterMs |
max(a.retryAfterMs, b.retryAfterMs) |
dominant wait — never under-state |
Four algebraic laws hold (proven via fast-check at numRuns ≥ 500):
identity, associativity, commutativity, idempotency.
Together they mean axis evaluation order doesn't change the result,
N inputs reduce flat, retried sub-checks are safe, and unused axes
plug in cleanly via the ALLOW_FULL neutral element.
combineDecisions and ALLOW_FULL are publicly exported off the root
— useful for tests and N-ary composition.
| Mode | When to use | Wire cost |
|---|---|---|
| Sequential (default) | Any backend mix (in-process + Redis + Postgres) | rate-axis RTT + cost-axis RTT (often pipelined to ~1 RTT) |
| Lua-fused (opt-in) | All rate/cost on the same Redis client; you want atomic joint enforcement | 1 RTT regardless of axes |
The lua-fused path ships GCRA + tokenBucket fusion in 0.9.0 — the LLM-gateway combo:
import Redis from "ioredis";
import { fromIoredis } from "throttlekit/redis";
const admit = unifiedAdmission({
concurrency: adaptiveConcurrency({ minLimit: 4, maxLimit: 32 }),
backend: "lua-fused",
fused: {
client: fromIoredis(new Redis(process.env.REDIS_URL!)),
rate: { strategy: "gcra", limit: 60, periodMs: 60_000, prefix: "rl:rate" },
cost: { strategy: "tokenBucket", capacity: 100_000, refillPerSec: 1_667, prefix: "rl:cost" },
},
});Sequential ≡ Lua-fused: the byte-identical Decision-stream property is
proven across 100 fast-check timelines per (rate-binding,
cost-binding, both-binding) configuration in
test/admission/fused-conformance.test.ts (TK-1006).
When an admission denies, which axis was binding? That's the #1
missing OTel signal for LLM gateways today. Two helpers from the
throttlekit/observability subpath:
import { trace } from "@opentelemetry/api";
import { bindingAxisOf, recordUnifiedAdmissionOnSpan } from "throttlekit/observability";
const { decision, release } = await admit.admit({ key, cost });
const span = trace.getActiveSpan();
if (span) recordUnifiedAdmissionOnSpan(span, decision, admit.lastDecisions());
// Or query directly:
if (!decision.allowed) {
log.info({ axis: bindingAxisOf(admit.lastDecisions()), retryAfterMs: decision.retryAfterMs });
}The attribute key is throttlekit.binding_axis ∈ {"rate", "concurrency", "cost"}. It's set only on denied admissions (omitted when allowed).
When multiple axes deny (possible in lua-fused mode), the convention
is concurrency → rate → cost priority — matches sequential's
evaluation order so the value is deterministic regardless of backend.
UnifiedAdmitter.lastDecisions() returns a frozen per-axis snapshot
({ rate, concurrency, cost }); unconfigured axes are undefined,
short-circuited axes also undefined (so you can identify the first
denying axis from absence alone).
Behind a ThrottleKit server the binding axis is also readable remotely
and from any language via the read-only Monitor door (GetSnapshot
/ Watch), and exported to Prometheus /metrics as
throttlekit_denied_by_axis_total — the same signal, off the request
path. See the Monitoring guide and
Operations.
-
admit()is async and returnsPromise<UnifiedAdmission>— works for any backend mix (Redis, Postgres, in-process). -
admitSync()is the sync sibling — only valid when every configured axis hascheckSync(in-process MemoryStore for rate/cost; concurrency is always sync). Throws otherwise (same convention asLimiter.check/Limiter.checkSync).
Both return { decision, release } — the release is the lifecycle
hook for the concurrency slot, separate from the Decision because
concurrency has lease semantics (acquire-release) that don't fit
Limiter's stateless .check() → Decision shape (the locked
decision is D-U4 in research/bigger-bets/unified/DESIGN.md §14).
Idempotency: a second release() call is a no-op. A denied admit's
release is a no-op (no slot was held — any transient acquire upstream
of the binding axis was released as part of the short-circuit).
Marginal-AND admits when each axis independently has room. When the cost axis binds and request types differ in value-per-cost-unit, that greedily burns budget on whatever arrives first — including cheap-to-pass, low-value, cost-heavy requests that starve the high-value requests arriving later. The fix from revenue management is a bid-price filter: admit iff the request's value clears the shadow price of the budget it consumes,
admit ⟺ value ≥ p_R + p_C · cost
where (p_R, p_C) are the dual variables of the workload's fluid LP.
The literature (Talluri–van Ryzin 1998; Devanur–Hayes 2009;
Buchbinder–Jain–Naor 2007) shows static bid prices are asymptotically
fluid-optimal under (approximate) stationarity. research/bigger-bets/unified/THEORY.md
(TK-1007) calibrated the gap on an LLM-gateway workload:
| ρ (autocorrelation) | regret(marginal-AND) | regret(joint-LP) | ε |
|---|---|---|---|
| −1.0 (alternation) | 40.00% | 0.00% | +40.00% |
| 0.0 (independent) | 40.50% | 1.01% | +39.49% |
| +1.0 (one type forever) | 32.50% | 65.00% | −32.50% (the foil) |
| mean | 38.90% | 13.57% | +25.33% |
Mean ε = 25.33% ≫ the 5% ship gate (DR-19) → shipped as opt-in
policy: "joint-lp" in 0.11.1.
// Supply a workload model — the library solves the fluid LP once at construction:
const admit = unifiedAdmission({
cost: rateLimit({ strategy: tokenBudget({ budget: 50_000, windowMs: 60_000 }) }),
policy: "joint-lp",
jointLp: {
workload: {
types: [
{ cost: 100, value: 1, weight: 0.5 }, // small completion
{ cost: 10_000, value: 50, weight: 0.5 }, // large completion
],
rateBudget: 1_000,
costBudget: 50_000,
},
},
});
// …or supply precomputed bid prices directly (e.g. solved offline):
// jointLp: { duals: { rate: 0, cost: 0.01 } }
const { decision, release, policyDenied } = admit.admitSync({ cost: 10_000, value: 50 });
// policyDenied === true ⇒ the bid-price filter bound (every axis had room).solveFluidLp(...) is also exported standalone (returns { duals, admitFractions, objective }).
Per-call value defaults to 1. The policy is strictly more selective than
marginal-AND — it only ever removes admits, so it cannot breach any limit — and
runs identically over the sequential and lua-fused backends. Default "marginal"
is byte-for-byte unchanged. Requires a cost axis.
The ρ = +1 column is negative: under a highly autocorrelated, near-absorbing workload (long runs of one type), the static fluid-LP duals can under-perform marginal-AND — the textbook fluid-LP failure under non-stationarity (Talluri–van Ryzin 1998). Real aggregator traffic sits in moderate ρ where joint-LP wins by +39–40%, but if your arrivals are strongly autocorrelated, re-measure ε on your own trace and keep the default.
If you can't pin the prior confidently, let the policy learn the bid prices online
(Devanur–Hayes sample-then-price). Requires the workload form:
const admit = unifiedAdmission({
cost,
policy: "joint-lp",
jointLp: {
workload, // the construction PRIOR (+ per-arrival budgets)
adaptive: { sampleWindow: 500 }, // observe 500 requests, then re-price
},
});It prices the first sampleWindow requests with the prior while observing the live
(cost, value) mixture, then re-solves the fluid LP and adopts the learned duals only
if they beat the prior on the observed sample, else keeps the prior — then freezes. So:
- a misspecified prior is rescued (a prior whose duals reject everything is escaped — ~100% → ~20–30% regret in the gate);
- a correct prior is kept (noise can't dislodge it; the naïve "always re-price" variant instead hurts a correct prior, 9.9–21.1% vs static's 0.7–1.2% — that design was rejected by the gate).
Honest scope: the guarantee is non-inferiority on the observed sample, not over the
full horizon — under autocorrelated arrivals the window can be unrepresentative and an
adopted dual can be slightly worse on the full stream (the ρ=+1 foil's cousin; bounded,
+~0.8pp measured). Prefer a larger sampleWindow on bursty traffic; the prior is always the
floor. With a concurrency axis the window counts the concurrency-passed population.
The 2-axis filter prices two flow budgets (rate, cost). A third axis — concurrency —
is a stock (a held slot), which looks like it doesn't fit the same fluid relaxation. It
does, via Little's law: an occupancy cap L over a window T is a concurrency-seconds
budget K = L·T, and each admit consumes its hold time h. The bid test gains a term:
admit iff value ≥ p_R + p_C·cost + p_K·hold
This rejects a hold-time hog — a request that is cheap and valuable per token but holds
a worker slot for a long time — that the 2-axis filter is structurally blind to (two requests
identical on cost+value but 10× apart in hold time look the same to it). The gate
(three-axis-gate.ts) measures regret 53% → 2% (ε≈51pp) when concurrency binds and the hog
is indistinguishable on (rate, cost).
const admit = unifiedAdmission({
concurrency, // the real occupancy limiter
cost,
policy: "joint-lp",
jointLp: {
workload: {
types: [
{ cost: 100, value: 10, weight: 1800, hold: 15 }, // short — frees its slot fast
{ cost: 100, value: 10, weight: 200, hold: 200 }, // long — a concurrency hog
],
rateBudget: 2000, costBudget: 1e9,
concBudget: 20_000, // K = L·T (e.g. L=10 slots × T=2000)
},
},
});
// pass the request's expected service time per call:
admit.admitSync({ cost: 100, value: 10, hold: 200 }); // policyDenied — the hog is priced outHonest scope: it earns its keep only when concurrency BINDS and the hog is strictly dominated
(a bid-price threshold can't ration a marginal hog — the same limit as the ρ=+1 foil); when
concurrency is ample p_K = 0 and it's a no-op. A missing / non-finite / negative per-request
hold is fail-open (no concurrency term — never a wrongful reject, and a hog can't dodge the
price by reporting a negative hold). Not combinable with jointLp.adaptive yet.
| Item | Where | Status |
|---|---|---|
policy: "joint-lp" runtime |
bid-price filter on unifiedAdmission
|
✅ shipped 0.11.1 (ε = 25.33%) |
| Online primal-dual (Devanur–Hayes sample-then-price) |
jointLp.adaptive — guarded warm-up on unifiedAdmission
|
✅ shipped 0.11.3 (guarded self-validating; D-JLP-13/14) |
| 3-axis joint LP (rate + cost + concurrency shadow price) |
value ≥ p_R + p_C·cost + p_K·hold via Little's law (concBudget + per-request hold) |
✅ shipped 0.11.3 (D-JLP-15/16) |
Decision.bindingAxis field |
breaking change to Decision shape — use the OTel attr + lastDecisions() / policyDenied instead |
1.0 candidate |
See research/bigger-bets/joint-lp-admission/DESIGN.md (D-JLP-1..16) for
the policy design, research/bigger-bets/unified/DESIGN.md for the
composition algebra, and research/bigger-bets/PLAN.md for the roadmap.
Sequential evaluates concurrency → rate → cost (in-process first,
fastest fail). Commutativity of combineDecisions (the proven law)
means the result doesn't depend on order — only the short-circuit
cost. For LLM gateways the concurrency axis usually binds first
under load, so this saves the Redis round-trip in the common deny path.
Federation (federate({ coordinator, ... })) is shipped as of 0.8.3.
unifiedAdmission composes with it for free — pass a federated
Limiter as the rate or cost axis. No new surface is needed; the
unified layer doesn't know or care about the federation layer. Tested
in test/admission/unified.test.ts.
Concurrency is keyless (one global guard per process). Rate and cost
are per-key — admit({ key: "tenant:abc", cost: 1500 }) keys the rate
and cost limits independently per tenant. For weighted fair sharing
across tenants, layer weightedFairShare(...) upstream of
unifiedAdmission.
const { decision, release } = await admit.admit({ key, cost });
if (!decision.allowed) return reject(decision);
try {
await doWork();
release({ dropped: false });
} catch (err) {
release({ dropped: true }); // signal overload → AIMD contracts the ceiling
throw err;
}The dropped: true flag propagates to the underlying gradient2 / AIMD
update — the limit contracts on overload signals.
-
Distributed & provable —
twoTier(leased),windowCoupled, the per-window overshoot bound. -
Federation — the cross-cluster federation primitive
that composes with
unifiedAdmission. - Operations — Prometheus / Grafana / OTel guidance.
-
GALE & TALE — the bounded-overshoot guarantees
unifiedAdmissionplugs into. -
examples/unified.tsin the repo — runnable LLM-gateway-style demo. -
research/bigger-bets/unified/DESIGN.md— the design lock + decision records. -
research/bigger-bets/unified/THEORY.md— joint-vs-marginal empirical regret analysis.
ThrottleKit · MIT · 1.0 — API frozen under SemVer (Stability)
- Getting Started
- Choosing a strategy
- Frameworks & the edge
- Distributed & provable
- Federation
- Scaling & the Fleet
- Unified admission
- Pillar 4 — Weighted Fair Escrow
- Middleware integration
- Distributed adaptive concurrency
- Advanced limiting
- Overload, fairness & DDoS
- Operations
- Monitoring — ThrottleKit Lens
- Policy Plans
- Replay
- Performance
- Migrating
- Polyglot & Python
- GALE & TALE