Skip to content

Overload Fairness and DDoS

Ameya Borkar edited this page Jun 10, 2026 · 7 revisions

Overload, fairness, and DDoS

Admission-control primitives that sit upstream of the per-key limiters, plus the fixed-memory sketches for surviving a flood.

Adaptive load-shedding

adaptiveThrottle — Google-SRE client-side adaptive throttling. A client that keeps hammering an overloaded backend only deepens the overload; this sheds a growing fraction locally based on the backend's recent accept rate:

import { adaptiveThrottle } from "throttlekit";

const throttle = adaptiveThrottle({ k: 2 }); // shed once sending > 2× what the backend accepts
if (!throttle.request()) return failFast();   // shed locally, don't even try
try { const res = await callBackend(); throttle.record(res.ok); }
catch { throttle.record(false); }            // feed outcomes back so it self-corrects

p = max(0, (requests − K·accepts) / (requests + 1)) over a rolling window; request(priority) protects critical traffic.

Fairness

fairShare — split one global per-window budget across tenants so a greedy tenant can't starve the others (each active tenant guaranteed ≥ limit/N, global total ≤ limit):

import { fairShare } from "throttlekit";
const fair = fairShare({ limit: 1000, windowMs: 60_000 });
const d = fair.checkSync(tenantId); // d.limit is this tenant's current fair cap

weightedFairShare / weightedMaxMin / weightedFairEscrow — the three weighted-fairness primitives. Pick by use case:

Primitive Single-process Distributed (multi-process) Work-conserving Weight-honoring
weightedFairShare ✗ (FCFS surplus) ✓ (approximate)
weightedMaxMin (pure batch) ✓ (exact) ✓ (exact)
weightedFairEscrow ✓ (Store-backed) ✓ (water-filled) ✓ (exact)
import { weightedFairShare, weightedMaxMin, weightedFairEscrow } from "throttlekit";

// Single-process, equal-share approximation — surplus is FCFS, not weight-redistributed.
const lite = weightedFairShare({ limit: 1000, windowMs: 60_000, weightOf: (t) => t === "pro" ? 4 : 1 });

// Pure batch allocator (no streaming, no clock, no store) — exact integer water-filling.
weightedMaxMin([100, 100, 100, 100], [4, 1, 1, 1], 100); // → [57, 15, 14, 14]

// Work-conserving streaming WFE — surplus from idle tenants flows by weight; multi-process via l2.
const escrow = weightedFairEscrow({ limit: 1000, windowMs: 60_000, weightOf: (t) => t === "pro" ? 4 : 1 });
const d = await escrow.check("pro:alice", 5);

weightedFairEscrow is the Pillar 4 production graduation — the four fairness theorems (T1 safety, T2 sharing-incentive, T3 work-conservation, T4 bounded unfairness) are proven via property tests at numRuns: 200 plus the pure batch algebra at 20 000 random trials. Multi-process backing is opt-in via an l2: Store parameter that re-uses the existing fixedWindow strategy for atomic per-process leases — zero new wire surface.

Token budgets under post-hoc cost (tokenBudget)

When a request's cost is revealed only as it runs — the LLM-gateway problem, where you don't know a completion's output-token count until it has streamed — you can't price it at admission. tokenBudget is a streaming meter for exactly that: debit the actual tokens as they're produced, and worst-case overshoot is bounded by the debit granularity — exactly 0 per tokenindependent of the per-request cap (max_tokens) and of how many streams meter concurrently.

import { tokenBudget } from "throttlekit";

const meter = tokenBudget({ budget: 100_000, windowMs: 60_000 }); // 100k tokens/min
for await (const tok of completion) {
  if (!meter.debitSync(1).allowed) break; // budget spent this window — stop generating
  emit(tok);
}

It dominates the two production corners at once: reserving max_tokens up front never overshoots but sterilizes most of every reservation on a heavy-tailed length distribution, while charging the real cost only at completion is efficient but overshoots by up to C · max_tokens. tokenBudget gets reserve-max's safety at admit-then-count's utilization, with no dependence on the cap. It is the shipped Layer 1 of TALE (see GALE & TALE).

For a budget shared across a fleet of gateways, distributedTokenBudget runs the same stop-at-boundary debit as an atomic read-modify-write against a Store — the B = 1 GALE window-coupled instantiation with the token as the unit — so the per-token overshoot stays 0 independent of fleet size:

import { distributedTokenBudget } from "throttlekit";
import { RedisStore } from "throttlekit/redis";

// Construct the SAME budget (same key + store) on every gateway in the fleet.
const meter = distributedTokenBudget({
  budget: 1_000_000, windowMs: 60_000, store: new RedisStore({ client }), key: "tpm:acme",
});
for await (const tok of completion) {
  if (!(await meter.debit(1)).allowed) break; // fleet budget spent this window — stop generating
  emit(tok);
}

It carries a Lua form for single-round-trip Redis, and works over any Store (Postgres, DynamoDB, D1, Deno KV); the multi-gateway reduction is byte-identical to GALE's leased budget. Behind a ThrottleKit server this fleet-shared budget is reachable from any client (including Python) over the existing Debit RPC (fleetBudget:) with no client change — see Scaling & the Fleet.

Pacing admission over the budget (learnedReservation / predictiveReservation)

tokenBudget bounds overshoot for any reservation, but admission still needs a per-request token reservation committed before the cost is known — it decides the 429 and paces concurrency. Reserve too much (= max_tokens) and you reject admissible traffic and starve concurrency; too little and you over-admit, forcing the meter to abort half-finished streams at the boundary. learnedReservation learns the cost-optimal reservation online — projected online gradient descent on the asymmetric newsvendor / pinball loss, descending onto the critical-fractile quantile τ = overrunCost/(holdCost+overrunCost) with O(√T) regret versus the best fixed reservation:

import { tokenBudget, learnedReservation } from "throttlekit";

const meter = tokenBudget({ budget: 100_000, windowMs: 60_000 });
const policy = learnedReservation({ holdCost: 1, overrunCost: 4, maxReservation: 4096 });
if (policy.reserve() <= meter.remaining()) {  // admit only if the reservation fits the budget
  let produced = 0;
  for await (const tok of completion) {
    if (!meter.debitSync(1).allowed) break;   // budget spent — stop generating
    produced++; emit(tok);
  }
  policy.observe(produced);                   // learn from the realised cost
}

predictiveReservation adds a per-request output-length prediction (e.g. a learning-to-rank model), blended against the robust learner with a Hedge meta-learner: accurate hints drive cost to the clairvoyant optimum (consistency), adversarial ones fall back to the no-regret quantile (robustness), and safety is untouched — the prediction is just a number the meter caps, so no prediction can breach the budget. These are TALE Layers 2–3 (see GALE & TALE); safety always stays the meter's job — the learner only makes admission efficient.

Surviving a flood

Per-key stores keep one record per active key — which, under a flood of millions of distinct keys (every source IP in a volumetric attack), makes that state itself the memory-exhaustion vector. sketchRateLimit limits an unbounded key universe in fixed memory with a Count-Min Sketch: ~7.4 KB total, regardless of how many keys it sees.

import { sketchRateLimit } from "throttlekit";

const shield = sketchRateLimit({ limit: 100, windowMs: 60_000 }); // ε=0.01, δ=0.001 by default
if (!shield.checkSync(clientIp).allowed) return reject(429);       // sync or async (check)

Because the sketch never undercounts, allowed implies the true admitted count is ≤ limit — it never over-admits (a hard, non-probabilistic property). Its only error is the safe direction: it may deny a key slightly early once collisions inflate its estimate, bounded by ε·N w.p. ≥ 1−δ (Cormode & Muthukrishnan 2005; conservative-update from Estan & Varghese).

Cluster-wide detection (mergeableSketch)

A low-and-slow distributed attacker can stay under every node's threshold while flooding the fleet. Because CMS counters are linear, each node keeps its own fixed-memory sketch, ships it as compact bytes (snapshot() / toBytes()), and merge()s peers' — the sum is exactly the sketch of the whole cluster's traffic, so the global heavy hitter becomes visible everywhere. Honestly scoped as eventually-consistent detection, not a strongly-consistent global limit. See examples/distributed-sketch.ts.

Clone this wiki locally