# Pillar 4 — Weighted Fair Escrow (WFE) `weightedFairEscrow(...)` splits a shared per-window budget across tenants in weighted-max-min fair proportion, with idle tenants' surplus reclaimed by backlogged ones. It's the production graduation of GALE's Pillar 4 (`research/gale/PILLAR4-fairness.md`) — work-conserving, weight- honoring, with a hard `Δ = 0` per-window cap inherited from Pillar 1. ```ts import { weightedFairEscrow } from "throttlekit"; const TIERS = { enterprise: 4, pro: 2, free: 1 }; const escrow = weightedFairEscrow({ limit: 30_000, // L tokens / window windowMs: 60_000, weightOf: (tenant) => TIERS[tenant.split(":")[0]] ?? 1, l1: { maxKeys: 1024 }, // cap tenant set }); const d = await escrow.check("enterprise:alpha", 8000); if (!d.allowed) return reject({ retryAfterMs: d.retryAfterMs }); // d.limit is THIS tenant's current ceiling; d.remaining the headroom. ``` ## When to use WFE vs the other fairness primitives ThrottleKit ships three fairness primitives that compose orthogonally: | Primitive | Best for | Distributed? | Work-conserving? | Weight-honoring? | |---|---|---|---|---| | `fairShare` | Single-process equal-share | no | no (FCFS surplus) | no (equal weights) | | `weightedFairShare` | Single-process weight-skewed | no | no (FCFS surplus) | yes (approximate) | | **`weightedFairEscrow`** | **Multi-process, work-conserving, weight-honoring** | **yes** | **yes (water-filled)** | **yes (exact)** | The difference matters under **multi-tenant overload**. With `weightedFairShare`, a low-weight flood gets unused budget on a first-come basis — the high-weight tenant's idle share is consumed by whoever happens to ask first. WFE redistributes that surplus by weight: under skew, an idle high-priority tenant's quota flows to backlogged tenants in proportion to their weights, not to the loudest one. ## The algorithm (the streaming realisation) Each tenant has a **dynamic guaranteed share** `gᵢ = ⌊wᵢ·L/W⌋` (recomputed from the current active set on every check; `W` = total weight of active tenants). Each `.check()` is hierarchical: 1. **Within guarantee** (`used + cost ≤ gᵢ`): always allowed, subject to the global cap `Σ used + cost ≤ L`. This is the **T2 sharing-incentive** promise — every backlogged tenant is admissible up to `gᵢ` before any other tenant can borrow beyond their `gⱼ`. 2. **Borrow phase** (`used + cost > gᵢ`): the asker tries to grow into surplus other tenants haven't claimed. The pessimistic surplus is ```text borrowAvailable = max(0, (L − Σ used) − Σⱼ≠ᵢ max(0, gⱼ − usedⱼ)) ``` — the unallocated budget minus what would still need to be served to *other* backlogged tenants' guarantees. The per-call borrow is capped at the call's own `cost`, matching Shreedhar-Varghese DRR semantics (`q = cost`). If `cost ≤ borrowAvailable`, the request is admitted; otherwise denied. ## What WFE guarantees | Theorem | Statement | |---|---| | **T1 safety** | `Σ used ≤ L` always (global cap; never over-admits even under adversarial scheduling). | | **T2 sharing-incentive** | Every active tenant is admissible up to `gᵢ` before any other tenant can borrow past `gⱼ`. | | **T3 work-conservation** | At end-of-window with continuously-backlogged tenants, `Σ used = L` (full utilisation). | | **T4 bounded unfairness** | For continuously-backlogged tenants, `|aᵢ/wᵢ − aⱼ/wⱼ| ≤ q·(1/wᵢ + 1/wⱼ)` with `q = cost`. | All four hold under `fast-check` at `numRuns: 200` (`test/twotier/ weighted-fair-escrow-properties.test.ts`). The pure batch algebra (`weightedMaxMin`) is independently machine-checked at 20 000 random trials (`test/gale/fair-escrow.test.ts`). ## What WFE does NOT guarantee - **Not strategy-proof.** Per [FairRide](https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-pu.pdf) (Pu et al., NSDI'16): no shared-cache primitive can be simultaneously sharing-incentive, work-conserving, AND strategy-proof. WFE takes the first two and concedes the third. A tenant *can* over-declare demand to claim surplus; window-coupling bounds the gain to one window. - **Flat tenant set** in the base limiter. A single `weightedFairEscrow` has one weight per tenant. The *recursive* WFE composition — an internal node weighted by the sum of its children — is **shipped** as [Federated Weighted Fair Escrow](#federated-weighted-fair-escrow-cross-region) for the region→tenant hierarchy (the GPS-decomposition special case). Arbitrary nested weights (teams-within-orgs) reuse the same two-level machinery and are the documented next step (DR-P4-8). ## Multi-process backing (the L2 path) When the budget needs to be shared across processes (a fleet of gateways, a horizontally-scaled service), pass any distributed `Store` and a per-process `quantum`: ```ts import Redis from "ioredis"; import { fromIoredis } from "throttlekit/redis"; const escrow = weightedFairEscrow({ limit: 200_000, windowMs: 60_000, weightOf: (t) => TIERS[t.split(":")[0]] ?? 1, l2: fromIoredis(new Redis(process.env.REDIS_URL!)), quantum: 500, // per-process lease size l2Key: "tk:wfe:gateway-prod", // same key on every process }); // Async-only when l2 is configured (the lease step is async): const d = await escrow.check("enterprise:alpha", 1500); ``` Internally each process leases `quantum` credits at a time from a shared `fixedWindow({ limit: L, windowMs })` counter in the store — the same atomic Lua that ships with `distributedTokenBudget`, so the wire surface adds zero new scripts (DR-P4-5). The per-tenant fairness arithmetic stays in-process; the shared store's atomicity is what bounds the global `Σ used ≤ L`. **Tuning `quantum`.** Smaller quantum = tighter cross-process fairness, more RTTs (`O(L/quantum)` Redis calls per window). Larger quantum = fewer RTTs, looser bound (the cross-process T4 picks up a `quantum`- scaled DRR slack: `Σₚ Q⁽ᵖ⁾ · (1/wᵢ + 1/wⱼ)`). For an LLM gateway with `L = 200 000` TPM and 4 process replicas, `quantum = 500` gives ~400 RTTs per minute per process and a sub-millisecond cross-process T4 in the common case. ## Federated Weighted Fair Escrow (cross-region) A single `weightedFairEscrow` splits ONE budget `L` across tenants in one process. **Federated WFE** (`federatedWeightedFairEscrow` + `regionFairPool`, shipped — TK-1404) composes that primitive across **regions** so that a tenant's **global** total — summed over every region it is active in — is the weighted-max-min allocation a single, flat, global WFE over `L` would produce. The regions are plumbing, not a fairness boundary: a tenant is neither helped nor hurt by *which* region it lives in. ### Why it isn't just "run WFE per region" Hierarchical max-min fairness is, in general, **not** flat max-min fairness. Running WFE per region over a budget shared by a plain first-come-first-served counter gives *in-region* weighted fairness but *cross-region* FCFS pooling — a heavily-weighted region that arrives late is starved by a lightly-weighted one (HLS isolation, Saeed et al. 2021). Flat global fairness emerges only under the **Parekh–Gallager GPS decomposition**: an internal node weighted by the **sum of its children's weights** reproduces the flat allocation. Federated WFE realises that as **two composed WFEs**: 1. **Cross-region WFE** — `regionFairPool`, a weighted-fair *reservation* layer whose "tenants" are regions. Region `r`'s weight is its **dynamic active aggregate** tenant weight `W_r = Σ w_{t,r}`; the pool reserves region `r` at least `⌊W_r/ΣW · L⌋` (a busy region cannot steal it) and lets it borrow idle regions' surplus. A plain counter cannot reserve — exactly why cross-region weights need this layer. 2. **In-region WFE** — `federatedWeightedFairEscrow`, the same per-tenant `gᵢ = ⌊wᵢ·L_r/W_r⌋` arithmetic from above, splitting each region's pool-granted budget `L_r` among its tenants by weight. ### API ```ts import { regionFairPool, federatedWeightedFairEscrow } from "throttlekit/twotier"; // ONE shared cross-region pool per global budget L. const pool = regionFairPool({ limit: 30_000, windowMs: 60_000 }); const TIERS = { enterprise: 4, pro: 2, free: 1 }; const weightOf = (t: string) => TIERS[t.split(":")[0]] ?? 1; // One limiter per region; they ALL share the same pool. const usEast = federatedWeightedFairEscrow({ region: "us-east", pool, weightOf }); const euWest = federatedWeightedFairEscrow({ region: "eu-west", pool, weightOf }); const d = usEast.checkSync("enterprise:alpha", 1_500); // sync — the in-process pool is synchronous // `.check(...)` is the Promise form; `.reset(tenant?)`, `.stats()` mirror the base limiter. ``` Each region's `weightOf` returns the tenant's weight **as seen in that region** (see the split rule below). `regionFairPool` owns `limit`, `windowMs`, and the `clock`; every region drawing from one global budget **must share one pool instance**. ### The demand-proportional weight split (region-spanning tenants) A tenant active in **one** region returns its full global weight `w_t`. A tenant active in **several** regions must have its weight **split** demand-proportionally: ```text w_{t,r} = w_t · d_{t,r} / d_t (so Σ_r w_{t,r} = w_t) ``` Returning the full `w_t` in *every* region **double-counts** the tenant — it claims a full-weight floor in each region and is over-served ≈k× (k = the number of regions it spans). The split keeps its **global** weight at exactly `w_t`, so its global share matches the flat ideal. ### What it guarantees | Theorem | Statement | |---|---| | **T-FED-1 safety** | The pool grants `Σ_r L_r ≤ L` and in-region WFE serves `Σ used ≤ L_r`, so **`Σ admitted ≤ L` globally**, independent of region count (`Δ = 0`, inherited from Pillar 1). | | **T-FED-2 fluid exactness** | Under the three conditions (region weight = `ΣW_r`, in-region WFE, demand-proportional split), each tenant's **global** total equals the flat global weighted-max-min ideal **exactly in the fluid limit**. | | **T-FED-3 bounded discrete error** | With discrete granting, `\|a_t − a*_t\| ≤ span(t)·(2·q_R+1)` — a two-level DRR residual (`q_R` = region-level lease quantum). Verified ≈0.02% relative. | Proof + the machine-checked gate (0 deviation on structured + random worlds, never over-admits): [`research/bigger-bets/federation/federated-wfe-gate.ts`](https://github.com/AmeyaBorkar/throttlekit/blob/main/research/bigger-bets/federation/federated-wfe-gate.ts); theorem write-up in [`research/gale/PILLAR4-fairness.md`](https://github.com/AmeyaBorkar/throttlekit/blob/main/research/gale/PILLAR4-fairness.md) §"Federated composition". ### Honest scope - **Exact in the all-backlogged limit** (the common overload case). - **Mixed saturation** (a region with a demand-bottlenecked co-tenant) inherits in-region WFE's **T3 reserve gap**: the shipped *streaming* code reclaims only from **truly-absent** regions (which never call the pool), not from a present-but-paused one — so it strands a saturated participant's reserve and deviates from the *clairvoyant* fluid oracle by that fraction. This is exactly what a flat *streaming* WFE does; no streaming limiter matches the clairvoyant oracle without a demand oracle. - **Topology.** `regionFairPool` is the **in-process** pool — correct and complete for a **single arbiter process** all regions consult (e.g. a central rate-limit service), and the substrate the tests verify against the flat oracle. To distribute the pool across **separate region processes**, **`RedisRegionFairPool`** keeps the same per-region accounting in a shared store (the weighted-max-min ceiling math ported to atomic Lua over a Redis hash, the weighted analog of `RegionalEscrow`'s Lua) — verified grant-for-grant against the in-process oracle on live Redis — so federated fair-escrow stays correct when each region runs as its own process (DR-FWFE-1). Behind a server this is also reachable over the existing `Check` RPC (`federatedFairEscrow:`) with no client change — see **[Scaling & the Fleet](Scaling-and-the-Fleet)**. Runnable demo: [`examples/federated-weighted-fair-escrow.ts`](https://github.com/AmeyaBorkar/throttlekit/blob/main/examples/federated-weighted-fair-escrow.ts) (`npx tsx examples/federated-weighted-fair-escrow.ts`). ## Failure modes | Failure | Behavior | |---|---| | L2 `Store` unreachable (Redis down, etc.) | `check()` rejects with `StoreUnavailableError`. No silent fallback — the leased pool is the safety boundary. | | Process restart mid-window | L1 state lost; tenants restart at `gᵢ`; pool re-leases from L2. `Δ ≤ quantum · N_tenants` in the cross-restart transition. | | Untrusted tenant flood (key blowup) | L1 `tenants` map grows unbounded. Set `l1.maxKeys` to a value that comfortably exceeds the working set. | | Tenant weight changes mid-window | New weight takes effect from the next check; the current window honors the weight at first check. | | Tenant stops mid-window | Its guaranteed reserve stays pinned until the window rolls — pessimistic-correct (we cannot distinguish "stopped" from "about to ask again"). End-of-window T3 still holds. | | Tenant over-declares demand (T5) | Window-coupling bounds the gain to one window; inflated credits expire at window roll. **Not** a bug — a stated impossibility (FairRide). | See `docs/FAILURE-MODES.md` for the complete outage matrix. ## Composition ### With `combineDecisions` (manual axis composition) WFE returns a `Decision` so it composes with any other axis via the 0.9.0 algebra: ```ts import { combineDecisions, rateLimit, gcra } from "throttlekit"; const rate = rateLimit({ strategy: gcra({ limit: 500, periodMs: 60_000 }) }); async function check(tenant: string, key: string, tokens: number) { const [r, w] = await Promise.all([rate.check(key), escrow.check(tenant, tokens)]); return combineDecisions(r, w); } ``` ### With `unifiedAdmission` At 0.9.1, `unifiedAdmission`'s admit options accept `{ cost?, tenant? }` additively — pass through to a WFE-backed cost axis when present. The existing rate / concurrency axes ignore `tenant`. See `docs/DESIGN-NOTES.md` or `research/bigger-bets/pillar4-wfe/DESIGN.md` §7.1. ## Roadmap notes - **0.9.1** ships single-pool WFE (this page). - **Shipped** — **federated WFE**: `federatedWeightedFairEscrow` + `regionFairPool`, the two-level region→tenant composition with the `(region, tenant)` cross-fairness theorem (DR-P4-7, now *shipped*). See [Federated Weighted Fair Escrow](#federated-weighted-fair-escrow-cross-region) above. The in-process pool ships now; the store-backed multi-process pool **`RedisRegionFairPool`** (DR-FWFE-1) is **also shipped** for separate-region-process topologies. - **next** — arbitrary hierarchical weights (teams-within-orgs) on the same two-level machinery (DR-P4-8). - **0.10.x?** — hot-path-fused leased WFE inside `twoTier` (DR-P4-9), conditional on a benchmark demonstrating ≥ 2× speedup over the Option L2-B path. The current `l2: Store` shape is fast enough for the LLM-gateway target. ## Anchors - Shreedhar & Varghese, **Efficient Fair Queuing using Deficit Round Robin**, *SIGCOMM '95* — the DRR quantum and its fairness bound (T4). - Demers, Keshav & Shenker, **Analysis and Simulation of a Fair Queueing Algorithm** (WFQ), *SIGCOMM '89*; Parekh & Gallager (GPS), *IEEE/ACM ToN* 1993 — the weighted max-min ideal that WFE approximates. - Stoica, Shenker & Zhang, **Core-Stateless Fair Queueing**, *SIGCOMM '98* — fairness with O(1) shared state. WFE is its escrow-leasing analog (one shared integer, no per-tenant queues). - Pu et al., **FairRide**, *NSDI '16* — the SIP impossibility theorem (T5, conceded vertex). - Bertsekas & Gallager, **Data Networks** §6.5.2 — canonical max-min fairness definition.