Skip to content

Scaling and the Fleet

Ameya Borkar edited this page Jun 10, 2026 · 1 revision

Scaling & the Fleet

ThrottleKit scales from a single in-process check to a globally-coordinated fleet on one design: one oracle, four doors, two coordination tiers. The same configuration that limits one process limits a thousand — and the distributed behaviour is something you can verify rather than hope for.

The one invariant — one oracle. Exactly one thing ever computes a Decision or sizes a grant: the Node core, directly or as Lua-in-Redis. Every other surface (Python, edge, a leased client) is a thin pipe. Scaling never adds a second rate limiter to keep in sync — so a fleet can't silently drift.

Two coordination tiers

Tier What it is Wire change Reach from any client
Tier 1 — shared-store coordination Server instances coordinate through a shared store via the core's coordinators. Configure a policy distributed; every client gets globally-coordinated decisions over the existing RPCs. None Check / Debit / Admit — unchanged
Tier 2 — client-held lease A high-throughput client leases a chunk of the global budget and enforces locally, round-tripping only to refresh. Additive Fleet service new Fleet.Reserve door + a local-spend helper

Tier 1 is the default and the answer to "make it fleet-wide without touching my client." Tier 2 is the scale ceiling for a client that can't afford a round trip per request.

Tier 1 — distributed over the existing RPCs (zero client change)

Configure a policy as one of four distributed blocks and point every server instance at one shared store (--redis / --postgres / …). The instances coordinate through the store; the client calls the same RPC it always did and gets a fleet-coordinated decision. This is how the distributed features reach Python and every other language with no client change.

Feature Config block Served over What it holds across the fleet
Cross-region federation federated: Check One global per-window budget across regions (the core's federate()); Δ = 0 independent of region count
Fleet token budget fleetBudget: Debit One LLM/cost budget across every instance (the cost axis, fleet-wide)
Distributed concurrency distributedConcurrency: Admit One in-flight ceiling across every instance, not N × per-instance
Cross-region fair escrow federatedFairEscrow: Check One weighted-fair budget L split across tenants, with fleet total ≤ L across regions
version: 1
limiters:
  global-api:                 # ONE global rate limit across regions — served by plain Check
    federated: { batch: 16 }
    strategy: fixedWindow      # must be window-coupled (fixedWindow / slidingWindow / fixed-cadence quota)
    limit: 10000
    period: 1m
  completions:                # ONE token budget across the fleet — served by plain Debit
    fleetBudget: { budget: 1000000, windowMs: 60000 }
  checkout:                   # ONE in-flight ceiling across the fleet — served by plain Admit
    distributedConcurrency: { minLimit: 4, maxLimit: 200, aggregate: median }

Each is covered in depth on its own page — Federation, Overload, fairness & DDoS (fleet token budget), Distributed adaptive concurrency, and Pillar 4 — Weighted Fair Escrow (whose RedisRegionFairPool makes fair-escrow correct across separate region processes). Run the server with throttlekit-server; a memory/dynamodb store that can't coordinate fails fast at load, and unsupported ops on a distributed policy raise UNIMPLEMENTED rather than return a meaningless answer.

Tier 2 — the client-held lease (Fleet.Reserve)

A per-request Check/Debit round trip is the bottleneck for a very high-throughput client. The Fleet door hands such a client a chunk of a federated: policy's global per-window budget to spend locally, so it round-trips only to refresh — not once per request:

Reserve { policy: "global-api", caller: { domain: "acme" }, wants: 200 }
  → Lease { capacity: 200, expiry_ms, refresh_interval_ms, safe_capacity, retry_after_ms, limit }

The server is the one oracle. It computes the grant size via the policy's federation coordinator — a partial grant (capacity < wants) is legitimate, and the grant is window-coupled, discarded at expiry_ms. The client only spends it, with the core LeaseSpender (throttlekit/twotier) — a verbatim port of the leased-L1 spend, proven byte-for-byte against the shipped twoTier(leased, windowCoupled) path and pinned by a golden lease vector suite every polyglot port replays. The client never invents a denial; when capacity is 0 it surfaces the server's verdict. Local spend is ≈ 10 ns/op, so the lease effectively removes the network from the hot path.

The door is served automatically whenever a federated: policy is configured, on the same gRPC port, and is loopback-only by default (handing out budget is a poisoning vector) — set --fleet-secret (or THROTTLEKIT_FLEET_SECRET), paired with TLS, to use it from a remote peer. v1 leases the rate axis (Reserve returns UNIMPLEMENTED for concurrency, NOT_FOUND for a non-leasable policy). From Python, FleetBackend / LeasedLimiter wrap this — see Fleet & Monitor clients.

Clock skew, defended — leaseWindowed

A leased budget is only safe if the client discards leftover credits at exactly the global window boundary; if a node's clock runs fast it could discard early (wasting budget) or late (over-admitting). ThrottleKit closes this with an optional coordinator method, leaseWindowed(key, tokens) → { granted, expiresAt }, which returns the authoritative store-clock boundary atomically with the grant — the Redis TIME-derived (or Postgres clock_timestamp()-derived) window end, never a node-clock value. The Tier-2 client treats expiry_ms as authoritative and never extends it. The method is additive and optional: callers feature-detect it and fall back to a node-clock window, and the existing lease() is unchanged — so a coordinator that predates it still works, just without the skew-proof boundary.

The Monitor door — read the fleet remotely

The fleet's live operational state is readable from any language over the read-only Monitor door (throttlekit.v1.Monitor: GetSnapshot + Watch), with a Prometheus /metrics endpoint and standard gRPC health. It's the same state ThrottleKit Lens renders in the terminal — see that page for the full board and the auth posture.

Honest boundaries (the non-claims)

  • Tier-1 fleetBudget key-semantics: the wire key selects which budget (a per-policy key→store-key mapping). Two clients coordinate iff they resolve the same store key — which same-config instances do automatically.
  • Distributed CheckMany fans out to N coordinator round-trips (not the single consistent instant a local batch gives); distributed batch size is capped.
  • Peek / Forecast are UNIMPLEMENTED under federation/leasing (those limiters are async, window-only).
  • Federated fair-escrow is correct across N region instances only with the store-backed RedisRegionFairPool (--redis); a single-instance fairEscrow: is the right tool for one process.
  • Tier-2 lease decisions are made client-side, so the server's capture/Replay sees the lease grants, not each local spend — Tier-1 decisions remain fully observable.

See also

Clone this wiki locally