Skip to content

The Axes

ameyaborkar edited this page Jun 10, 2026 · 3 revisions

The axes

ThrottleKit composes a few orthogonal admission axes. Every one is reachable from Python through the service door — and the core, inside the service, computes each decision. (The direct RedisBackend is check-only by design; the stateful axes stay on the service door, where the core — not a re-derived client port — produces the decision.)

Axis Python Server policy
Rate check / check_many / peek / forecast a strategy (gcra, tokenBucket, …)
Two-tier leased check (transparent) a strategy + a twoTier block
Cost debit(policy, key, tokens) a tokenBudget block
Concurrency / unified admit(policy, key) → Admission a concurrency block (± a strategy)
Tier-2 fleet lease FleetBackend(...).leased(policy).check() a federated: policy (lease its global budget)

Async & runnable examples. Every axis below has an await-mirror on AsyncServiceBackend / AsyncRedisBackend (identical surface, same one-oracle guarantee — they await the transport, never re-derive a decision). A self-contained, runnable script for each axis lives in examples/: async_service_backend.py, redis_backend.py, llm_token_budget.py, concurrency_admit.py, fastapi_app.py.

Rate — check

d = rl.check("api", api_key)            # consume 1 (or check(..., cost=n))
d = rl.check_many("api", [k1, k2, k3])  # many keys at one consistent instant → list[Decision]
d = rl.peek("api", api_key)             # non-consuming: what's left right now
f = rl.forecast("api", api_key)         # Forecast: spendable_now / next_replenish_at / full_at

peek and forecast are service-door only — the core computes them; the direct door deliberately doesn't re-derive them client-side.

Two-tier leased — still check

A server policy configured as twoTier: { mode: leased, … } draws L1-local credits in batches from the shared L2 store, cutting the per-request round trip while holding a machine-checked overshoot bound (≤ Limit + L·(batch−1), or exactly Limit with windowCoupled). It needs no new client API — you call check exactly as for a plain limiter, and the core still computes the decision:

d = rl.check("leased-api", api_key)     # leased semantics, zero client changes

Cost — debit

For costs you only learn after a request runs — the LLM-gateway problem, where a completion's token count isn't known until it streams — a tokenBudget policy meters a windowed budget. Debit the actual tokens as they're produced:

for chunk in stream:
    d = rl.debit("completions", tenant, tokens=len(chunk.tokens))
    if not d.allowed:
        break                           # the window's budget is spent

A debit is admitted while budget remains; the crossing debit is counted in full, then later debits in the window are refused (per-token debiting overshoots by 0). d.remaining is the tokens left in the window. debit on a rate limiter raises OperationNotSupportedError.

Concurrency & unified — admit

For limiting concurrent work — how many requests are in flight at once — a concurrency policy is served by a stateful lifecycle. admit holds an in-flight slot and returns an Admission context manager that releases it on exit:

with rl.admit("checkout", user_id) as adm:
    if not adm.allowed:
        return 429                      # adm.binding_axis names the axis that bound it
    do_work()                           # released on exit (dropped=True if the block raised)

Add a strategy to the policy server-side and it becomes a unified rate × concurrency admitter — the core composes the axes and adm.binding_axis reports which one ("rate" / "concurrency") bound a denial.

Admission:

Member Meaning
allowed whether the work may proceed
binding_axis "rate" / "concurrency" / "" — the axis that denied
held True iff a server slot is held (a denied admission holds none)
reclaimed True iff the server reclaimed the lease (a missed heartbeat)
release(dropped=False) return the slot; idempotent; dropped=True signals an overload so the adaptive limit contracts

The with block calls release(dropped=exc is not None) on exit, so a raised exception releases with dropped=True.

Crash safety & long holds

A granted admission holds a server lease. If the client crashes without releasing, the server reclaims the slot once the lease TTL (default 2s) lapses without a heartbeat — the node↔coordinator crash-safety contract, one layer out. Short holds (under the TTL) need nothing extra. For a hold longer than the TTL, opt into heartbeats — a background daemon thread renews the lease, and adm.reclaimed flips to True if the server reclaimed it anyway:

with rl.admit("long-job", job_id, heartbeat=True) as adm:
    if adm.allowed:
        run_long_job()                  # renewed across the TTL boundary by a background beat
        if adm.reclaimed:
            ...                          # the server reclaimed our slot mid-flight — treat as dropped

As of 0.5.0 the advanced axes scale to a fleet. Configure the server policy distributed — distributedConcurrency: for the in-flight ceiling, federated: for rate, fleetBudget: for cost — and the same admit / check / debit calls become fleet-coordinated across every instance, with no client change. For the highest-throughput rate path, lease a chunk of the global budget with FleetBackend and spend it locally. See Fleet & Monitor clients.

Next: Conformance & development — how this stays bit-for-bit with the core.