-
Notifications
You must be signed in to change notification settings - Fork 0
The Axes
ThrottleKit composes a few orthogonal admission axes. Every one is reachable from Python through the service door — and the core, inside the service, computes each decision. (The direct RedisBackend is check-only by design; the stateful axes stay on the service door, where the core — not a re-derived client port — produces the decision.)
| Axis | Python | Server policy |
|---|---|---|
| Rate |
check / check_many / peek / forecast
|
a strategy (gcra, tokenBucket, …) |
| Two-tier leased |
check (transparent) |
a strategy + a twoTier block |
| Cost | debit(policy, key, tokens) |
a tokenBudget block |
| Concurrency / unified | admit(policy, key) → Admission |
a concurrency block (± a strategy) |
| Tier-2 fleet lease | FleetBackend(...).leased(policy).check() |
a federated: policy (lease its global budget) |
Async & runnable examples. Every axis below has an
await-mirror onAsyncServiceBackend/AsyncRedisBackend(identical surface, same one-oracle guarantee — theyawaitthe transport, never re-derive a decision). A self-contained, runnable script for each axis lives inexamples/:async_service_backend.py,redis_backend.py,llm_token_budget.py,concurrency_admit.py,fastapi_app.py.
d = rl.check("api", api_key) # consume 1 (or check(..., cost=n))
d = rl.check_many("api", [k1, k2, k3]) # many keys at one consistent instant → list[Decision]
d = rl.peek("api", api_key) # non-consuming: what's left right now
f = rl.forecast("api", api_key) # Forecast: spendable_now / next_replenish_at / full_atpeek and forecast are service-door only — the core computes them; the direct door deliberately doesn't re-derive them client-side.
A server policy configured as twoTier: { mode: leased, … } draws L1-local credits in batches from the shared L2 store, cutting the per-request round trip while holding a machine-checked overshoot bound (≤ Limit + L·(batch−1), or exactly Limit with windowCoupled). It needs no new client API — you call check exactly as for a plain limiter, and the core still computes the decision:
d = rl.check("leased-api", api_key) # leased semantics, zero client changesFor costs you only learn after a request runs — the LLM-gateway problem, where a completion's token count isn't known until it streams — a tokenBudget policy meters a windowed budget. Debit the actual tokens as they're produced:
for chunk in stream:
d = rl.debit("completions", tenant, tokens=len(chunk.tokens))
if not d.allowed:
break # the window's budget is spentA debit is admitted while budget remains; the crossing debit is counted in full, then later debits in the window are refused (per-token debiting overshoots by 0). d.remaining is the tokens left in the window. debit on a rate limiter raises OperationNotSupportedError.
For limiting concurrent work — how many requests are in flight at once — a concurrency policy is served by a stateful lifecycle. admit holds an in-flight slot and returns an Admission context manager that releases it on exit:
with rl.admit("checkout", user_id) as adm:
if not adm.allowed:
return 429 # adm.binding_axis names the axis that bound it
do_work() # released on exit (dropped=True if the block raised)Add a strategy to the policy server-side and it becomes a unified rate × concurrency admitter — the core composes the axes and adm.binding_axis reports which one ("rate" / "concurrency") bound a denial.
Admission:
| Member | Meaning |
|---|---|
allowed |
whether the work may proceed |
binding_axis |
"rate" / "concurrency" / "" — the axis that denied |
held |
True iff a server slot is held (a denied admission holds none) |
reclaimed |
True iff the server reclaimed the lease (a missed heartbeat) |
release(dropped=False) |
return the slot; idempotent; dropped=True signals an overload so the adaptive limit contracts |
The with block calls release(dropped=exc is not None) on exit, so a raised exception releases with dropped=True.
A granted admission holds a server lease. If the client crashes without releasing, the server reclaims the slot once the lease TTL (default 2s) lapses without a heartbeat — the node↔coordinator crash-safety contract, one layer out. Short holds (under the TTL) need nothing extra. For a hold longer than the TTL, opt into heartbeats — a background daemon thread renews the lease, and adm.reclaimed flips to True if the server reclaimed it anyway:
with rl.admit("long-job", job_id, heartbeat=True) as adm:
if adm.allowed:
run_long_job() # renewed across the TTL boundary by a background beat
if adm.reclaimed:
... # the server reclaimed our slot mid-flight — treat as droppedAs of 0.5.0 the advanced axes scale to a fleet. Configure the server policy distributed —
distributedConcurrency:for the in-flight ceiling,federated:for rate,fleetBudget:for cost — and the sameadmit/check/debitcalls become fleet-coordinated across every instance, with no client change. For the highest-throughput rate path, lease a chunk of the global budget withFleetBackendand spend it locally. See Fleet & Monitor clients.
Next: Conformance & development — how this stays bit-for-bit with the core.