-
Notifications
You must be signed in to change notification settings - Fork 0
Federation
federate(...) ships a cross-cluster rate limiter that pools a single
global budget across K regions, with a formally-verified, K-INDEPENDENT
overshoot bound:
admitted_per_global_window ≤ Limit (for ANY number of regions K)
The TLA⁺ spec is spec/GaleFederatedLeasing.tla; the BFS twin in
test/gale/federated/leasing-variants.test.ts re-runs it in CI without
Java. The end-to-end eval against a real Lua-backed coordinator is in
research/bigger-bets/federation/eval/RESULTS.md.
When to use this vs
twoTier(leased). If your fleet shares ONE regional Redis (one cluster, many processes),twoTier(leased)withwindowCoupled: truealready gives you the K-independent bound. Usefederate(...)only when your processes span MULTIPLE regions and you need a single global budget across them.
From any client, no client change. Behind a ThrottleKit server this is a Tier-1 feature reachable over the server's existing
CheckRPC (federated:policy) — a stock client (including Python) gets it just by pointing at a federated policy. See Scaling & the Fleet.
import { fixedWindow } from "throttlekit";
import { federate, RedisCoordinator } from "throttlekit/federation";
import { fromIoredis } from "throttlekit/redis";
import Redis from "ioredis";
// One global coordinator Redis (the federation's "L3").
const coordinator = new RedisCoordinator({
client: fromIoredis(new Redis(process.env.GLOBAL_COORDINATOR_URL!)),
windowMs: 60_000,
budgetPerWindow: 1000,
});
// One federated Limiter per region. They all share `coordinator`.
const limiter = federate({
strategy: fixedWindow({ limit: 1000, windowMs: 60_000 }),
coordinator,
region: "us-east", // your region identifier
batch: 16, // escrow size; 16 is a sensible default
});
const decision = await limiter.check("user:42");Run the full example end-to-end:
docker compose -f research/bigger-bets/federation/eval/docker-compose.yml up -d
npx tsx examples/federation.ts
docker compose -f research/bigger-bets/federation/eval/docker-compose.yml down-
Each region holds a local escrow lease —
batchunits of budget drawn from the global coordinator in a single cross-region RPC. -
Most requests serve from local escrow — no coordinator hit. The
amortized coordinator cost is
1/batchper request. -
At the global window boundary, escrow EXPIRES —
Rollin the formal model. Un-served escrow forfeits, then reconciles back to the coordinator (idempotent onwindowStart) for the next window. This is the window-coupling rule — it collapses the per-window overshoot to zero. -
On coordinator outage, regions fail closed — the existing escrow
keeps serving until exhausted, then denies. The bound holds at
Δ = 0even through partition.
| Scheme | Δ (overshoot) | Pooling under skew | Failure mode |
|---|---|---|---|
| Per-region independent limiters | Unbounded | None | Each region serves its own budget |
Static partition (L/K each) |
0 | None — hot region binds at L/K | Each region serves its own L/K |
federate(...) |
0 | Full — hot region can draw the whole budget | Fail closed; bound preserved through partition |
| CRDT / gossip merge | Bounded by staleness | Full | Staleness-Δ tradeoff (research follow-up) |
At max skew (s = 1, all load on one region) the static partition admits
only L/K; federate(...) admits up to the full L. The eval rows in
RESULTS.md:
| skew | static U_capacity
|
federated U_capacity
|
|---|---|---|
| 0.00 | 1.000 | 0.973 |
| 0.25 | 0.833 | 0.977 |
| 0.50 | 0.667 | 0.990 |
| 0.75 | 0.500 | 0.957 |
| 1.00 | 0.333 | 1.000 |
The −0.027 at uniform load is the bounded batch overhead — at most
(K−1)·(batch−1) un-served escrow per window. Tighten batch to shrink
that gap at the cost of more coordinator round trips.
The bound holds end-to-end through every coordinator outage shape;
detailed table in
docs/FAILURE-MODES.md.
Summary:
| Outage shape | Behavior | Δ |
|---|---|---|
| Region partitioned from coordinator | Region serves existing escrow until empty, then denies (fail-closed default) | 0 |
| Coordinator crash + recovery within a window | Region serves existing escrow; on recovery, leases resume against the preserved budget | 0 |
| Coordinator unavailable across a window boundary | All regions deny during outage; on recovery, fresh window's budget acquired normally | 0 |
Coordinator outage with onCoordinatorOutage: "regional-only" (0.8.5; requires regionalEscrow) |
Engine continues serving from the L2 balance until depletion; re-probes via coordinator.isHealthy() every coordinatorHealthCheckMs (default 5 s); resumes on recovery |
≤ L2 balance |
Multi-process within a region sharing a regionalEscrow (0.8.5) |
M engines share one L2; in-flight per-region escrow bounded by perKeyBudget, not M × batch
|
0 |
| Regional Redis (L2) outage, coordinator reachable (0.8.5) | Engine falls through to direct L3 leasing (matches 0.8.4 behavior); multi-process atomicity lost for the outage duration | 0 |
For soft-traffic operators who prefer availability, onCoordinatorOutage: "regional-only" (shipped 0.8.5; requires a regionalEscrow) keeps
serving from the L2 balance during outage — Δ degrades to ≤ the L2
balance at outage onset, not 0; documented opt-in. See the
multi-process regional escrow section
below.
| Coordinator | When to use | Status |
|---|---|---|
TestCoordinator (in-memory) |
Tests + examples; deterministic | Shipped 0.8.3 |
RedisCoordinator (single global Redis) |
Production default; lowest latency; documented SPOF | Shipped 0.8.3 |
PostgresCoordinator (single Postgres primary) |
When you already run Postgres; faster failover via Patroni | Shipped 0.8.4 |
| Raft-via-etcd | HA-without-SPOF (the SPOF mitigation) | Future |
| CRDT / gossip | Multi-leader with bounded staleness | Research follow-up |
Both RedisCoordinator and PostgresCoordinator implement the identical
GlobalCoordinator interface — same window-coupling guarantee (Δ = 0),
same K-independent bound. They are drop-in interchangeable; the choice
is operational, not semantic.
| Axis | RedisCoordinator |
PostgresCoordinator |
|---|---|---|
| Latency per lease | ~0.5–1 ms (Lua EVALSHA) | ~1–3 ms (transactional SQL) |
| Throughput cap | 100K+ leases/sec | 5K–20K leases/sec |
| HA story | Sentinel / Cluster | Synchronous replication + Patroni / pg_auto_failover |
| Durability | Configurable (RDB / AOF) | WAL + sync replication (byte-durable) |
import { Pool } from "pg";
import { fixedWindow } from "throttlekit";
import { federate, PostgresCoordinator } from "throttlekit/federation";
const pool = new Pool({ connectionString: process.env.GLOBAL_COORDINATOR_PG_URL });
const coordinator = new PostgresCoordinator({
pool,
windowMs: 60_000,
budgetPerWindow: 1000,
});
const limiter = federate({
strategy: fixedWindow({ limit: 1000, windowMs: 60_000 }),
coordinator,
region: "us-east",
batch: 16,
});
// Same Limiter surface as RedisCoordinator-backed federation.
const decision = await limiter.check("global:checkout");
// On shutdown:
coordinator.close(); // stops the background GC timer
await pool.end();The Postgres-backed coordinator creates its own table (tk_fed_state)
on first use — no migration tool required. Schema details in
research/postgres-coordinator/DESIGN.md.
The GlobalCoordinator interface is small (3 methods); rolling a custom
backend is a couple of dozen lines of Lua / SQL / equivalent.
When you run multiple processes in the same region (e.g., M=4 Node
workers behind a load balancer), the default federation engine has each
process hold its own in-process escrow — so in-flight per-region escrow
is M × batch. For most workloads this is fine (the federation bound
Δ = 0 still holds at the global coordinator), but two scenarios benefit
from a tighter regional sub-bound:
-
Tight regional sub-bounds (rare; some billing/quota stories want
per-region admissions ≤
perKeyBudgetnotM × batch). -
The
regional-onlyoutage mode — without an L2 to serve from during a coordinator outage, the mode collapses to fail-closed.
The fix is a regional escrow — an L2 layer between in-process L1 and the cross-region L3 coordinator, typically backed by a regional Redis cluster shared by all M processes:
import { createClient } from "redis";
import { fixedWindow } from "throttlekit";
import {
federate,
RedisCoordinator,
RedisRegionalEscrow,
} from "throttlekit/federation";
import { fromNodeRedis } from "throttlekit/redis";
const l2 = createClient({ url: process.env.REGIONAL_REDIS_URL });
const l3 = createClient({ url: process.env.GLOBAL_COORDINATOR_REDIS_URL });
await l2.connect();
await l3.connect();
const coordinator = new RedisCoordinator({
client: fromNodeRedis(l3),
windowMs: 60_000,
budgetPerWindow: 1000,
});
const regionalEscrow = new RedisRegionalEscrow({
client: fromNodeRedis(l2),
windowMs: 60_000,
region: "us-east",
});
const limiter = federate({
strategy: fixedWindow({ limit: 1000, windowMs: 60_000 }),
coordinator,
regionalEscrow,
region: "us-east",
batch: 16,
onCoordinatorOutage: "regional-only", // optional; default is "fail-closed"
});The RegionalEscrow interface mirrors GlobalCoordinator one layer
down (lease / refill / release / optional isHealthy); the
default RedisRegionalEscrow uses three atomic Lua scripts, same
pattern as RedisCoordinator. TestRegionalEscrow is the
deterministic in-memory mirror for tests + examples.
regional-only outage mode: when set, on a coordinator outage the
engine continues serving from the L2 balance until depletion. It
re-probes coordinator.isHealthy() every coordinatorHealthCheckMs
(default 5 s, clock-driven not timer-driven); on recovery, normal lease
- reconcile resumes. The federation bound (Δ = 0) degrades to the regional sub-bound (≤ L2 balance at outage onset) during the outage and is re-enforced from the recovery point onward.
Backward compatible: federate({ ... }) without regionalEscrow uses
the legacy 0.8.4 in-process-only path verbatim. See
examples/federation-regional-escrow.ts
for a runnable demo, and
research/regional-escrow/DESIGN.md
for the full design.
A single global Redis IS a single point of failure for the federation's
safety bound. When the Redis is unreachable, every region's lease()
throws → fail-closed (default) → no new admissions across the entire
federation until the Redis returns. The mitigations:
- Sentinel / Cluster under your Redis client — the Lua scripts work unchanged.
-
PostgresCoordinator(0.8.x follow-up) — replaces the SPOF with Postgres failover semantics (synchronous replication, automatic primary promotion). - Raft-via-etcd (1.0.x) — true HA-without-SPOF.
For 0.8.3 the SPOF is documented; users in regulated environments should
opt for Sentinel or wait for PostgresCoordinator.
-
DESIGN.md —
full design, the lift argument from
windowCoupled, the lockedGlobalCoordinatorinterface, failure semantics in depth, the worked example. -
spec/GaleFederatedLeasing.tla— the TLA⁺ spec provingadmitted ≤ Limitindependent ofK. -
test/gale/federated/leasing-variants.test.ts— the CI-runnable BFS twin matching TLC's distinct-state counts. -
research/bigger-bets/federation/eval/RESULTS.md— the end-to-end 3-region eval with measured Δ and U numbers.
federate(...) returns a regular Limiter. It composes naturally with
the rest of the library:
- Wrap with
withAnalytics(...)for heavy-hitter detection. - Tap with
tapDecisions(...)for OTel / Prometheus integration. - Use with adapters (
createEnforcer(...), the Express middleware, etc.) via the standardLimiterinterface. - Stack
twoTier(leased)on TOP for an in-process L1 cache; the recursive twoTier composition is the canonical multi-process per- region setup.
ThrottleKit · MIT · 1.0 — API frozen under SemVer (Stability)
- Getting Started
- Choosing a strategy
- Frameworks & the edge
- Distributed & provable
- Federation
- Scaling & the Fleet
- Unified admission
- Pillar 4 — Weighted Fair Escrow
- Middleware integration
- Distributed adaptive concurrency
- Advanced limiting
- Overload, fairness & DDoS
- Operations
- Monitoring — ThrottleKit Lens
- Policy Plans
- Replay
- Performance
- Migrating
- Polyglot & Python
- GALE & TALE