Overload Fairness and DDoS

Overload, fairness, and DDoS

Admission-control primitives that sit upstream of the per-key limiters, plus the fixed-memory sketches for surviving a flood.

Adaptive load-shedding

adaptiveThrottle — Google-SRE client-side adaptive throttling. A client that keeps hammering an overloaded backend only deepens the overload; this sheds a growing fraction locally based on the backend's recent accept rate:

import { adaptiveThrottle } from "throttlekit";

const throttle = adaptiveThrottle({ k: 2 }); // shed once sending > 2× what the backend accepts
if (!throttle.request()) return failFast();   // shed locally, don't even try
try { const res = await callBackend(); throttle.record(res.ok); }
catch { throttle.record(false); }            // feed outcomes back so it self-corrects

p = max(0, (requests − K·accepts) / (requests + 1)) over a rolling window; request(priority) protects critical traffic.

Fairness

fairShare — split one global per-window budget across tenants so a greedy tenant can't starve the others (each active tenant guaranteed ≥ limit/N, global total ≤ limit):

import { fairShare } from "throttlekit";
const fair = fairShare({ limit: 1000, windowMs: 60_000 });
const d = fair.checkSync(tenantId); // d.limit is this tenant's current fair cap

weightedFairShare / weightedMaxMin / weightedFairEscrow — the three weighted-fairness primitives. Pick by use case:

Primitive	Single-process	Distributed (multi-process)	Work-conserving	Weight-honoring
`weightedFairShare`	✓	—	✗ (FCFS surplus)	✓ (approximate)
`weightedMaxMin` (pure batch)	✓	—	✓ (exact)	✓ (exact)
`weightedFairEscrow`	✓	✓ (Store-backed)	✓ (water-filled)	✓ (exact)

import { weightedFairShare, weightedMaxMin, weightedFairEscrow } from "throttlekit";

// Single-process, equal-share approximation — surplus is FCFS, not weight-redistributed.
const lite = weightedFairShare({ limit: 1000, windowMs: 60_000, weightOf: (t) => t === "pro" ? 4 : 1 });

// Pure batch allocator (no streaming, no clock, no store) — exact integer water-filling.
weightedMaxMin([100, 100, 100, 100], [4, 1, 1, 1], 100); // → [57, 15, 14, 14]

// Work-conserving streaming WFE — surplus from idle tenants flows by weight; multi-process via l2.
const escrow = weightedFairEscrow({ limit: 1000, windowMs: 60_000, weightOf: (t) => t === "pro" ? 4 : 1 });
const d = await escrow.check("pro:alice", 5);

weightedFairEscrow is the Pillar 4 production graduation — the four fairness theorems (T1 safety, T2 sharing-incentive, T3 work-conservation, T4 bounded unfairness) are proven via property tests at numRuns: 200 plus the pure batch algebra at 20 000 random trials. Multi-process backing is opt-in via an l2: Store parameter that re-uses the existing fixedWindow strategy for atomic per-process leases — zero new wire surface.

Token budgets under post-hoc cost (`tokenBudget`)

When a request's cost is revealed only as it runs — the LLM-gateway problem, where you don't know a completion's output-token count until it has streamed — you can't price it at admission. tokenBudget is a streaming meter for exactly that: debit the actual tokens as they're produced, and worst-case overshoot is bounded by the debit granularity — exactly 0 per token — independent of the per-request cap (max_tokens) and of how many streams meter concurrently.

import { tokenBudget } from "throttlekit";

const meter = tokenBudget({ budget: 100_000, windowMs: 60_000 }); // 100k tokens/min
for await (const tok of completion) {
  if (!meter.debitSync(1).allowed) break; // budget spent this window — stop generating
  emit(tok);
}

It dominates the two production corners at once: reserving max_tokens up front never overshoots but sterilizes most of every reservation on a heavy-tailed length distribution, while charging the real cost only at completion is efficient but overshoots by up to C · max_tokens. tokenBudget gets reserve-max's safety at admit-then-count's utilization, with no dependence on the cap. It is the shipped Layer 1 of TALE (see GALE & TALE).

For a budget shared across a fleet of gateways, distributedTokenBudget runs the same stop-at-boundary debit as an atomic read-modify-write against a Store — the B = 1 GALE window-coupled instantiation with the token as the unit — so the per-token overshoot stays 0 independent of fleet size:

import { distributedTokenBudget } from "throttlekit";
import { RedisStore } from "throttlekit/redis";

// Construct the SAME budget (same key + store) on every gateway in the fleet.
const meter = distributedTokenBudget({
  budget: 1_000_000, windowMs: 60_000, store: new RedisStore({ client }), key: "tpm:acme",
});
for await (const tok of completion) {
  if (!(await meter.debit(1)).allowed) break; // fleet budget spent this window — stop generating
  emit(tok);
}

It carries a Lua form for single-round-trip Redis, and works over any Store (Postgres, DynamoDB, D1, Deno KV); the multi-gateway reduction is byte-identical to GALE's leased budget. Behind a ThrottleKit server this fleet-shared budget is reachable from any client (including Python) over the existing Debit RPC (fleetBudget:) with no client change — see Scaling & the Fleet.

Pacing admission over the budget (`learnedReservation` / `predictiveReservation`)

tokenBudget bounds overshoot for any reservation, but admission still needs a per-request token reservation committed before the cost is known — it decides the 429 and paces concurrency. Reserve too much (= max_tokens) and you reject admissible traffic and starve concurrency; too little and you over-admit, forcing the meter to abort half-finished streams at the boundary. learnedReservation learns the cost-optimal reservation online — projected online gradient descent on the asymmetric newsvendor / pinball loss, descending onto the critical-fractile quantile τ = overrunCost/(holdCost+overrunCost) with O(√T) regret versus the best fixed reservation:

import { tokenBudget, learnedReservation } from "throttlekit";

const meter = tokenBudget({ budget: 100_000, windowMs: 60_000 });
const policy = learnedReservation({ holdCost: 1, overrunCost: 4, maxReservation: 4096 });
if (policy.reserve() <= meter.remaining()) {  // admit only if the reservation fits the budget
  let produced = 0;
  for await (const tok of completion) {
    if (!meter.debitSync(1).allowed) break;   // budget spent — stop generating
    produced++; emit(tok);
  }
  policy.observe(produced);                   // learn from the realised cost
}

predictiveReservation adds a per-request output-length prediction (e.g. a learning-to-rank model), blended against the robust learner with a Hedge meta-learner: accurate hints drive cost to the clairvoyant optimum (consistency), adversarial ones fall back to the no-regret quantile (robustness), and safety is untouched — the prediction is just a number the meter caps, so no prediction can breach the budget. These are TALE Layers 2–3 (see GALE & TALE); safety always stays the meter's job — the learner only makes admission efficient.

Surviving a flood

Per-key stores keep one record per active key — which, under a flood of millions of distinct keys (every source IP in a volumetric attack), makes that state itself the memory-exhaustion vector. sketchRateLimit limits an unbounded key universe in fixed memory with a Count-Min Sketch: ~7.4 KB total, regardless of how many keys it sees.

import { sketchRateLimit } from "throttlekit";

const shield = sketchRateLimit({ limit: 100, windowMs: 60_000 }); // ε=0.01, δ=0.001 by default
if (!shield.checkSync(clientIp).allowed) return reject(429);       // sync or async (check)

Because the sketch never undercounts, allowed implies the true admitted count is ≤ limit — it never over-admits (a hard, non-probabilistic property). Its only error is the safe direction: it may deny a key slightly early once collisions inflate its estimate, bounded by ε·N w.p. ≥ 1−δ (Cormode & Muthukrishnan 2005; conservative-update from Estan & Varghese).

Cluster-wide detection (`mergeableSketch`)

A low-and-slow distributed attacker can stay under every node's threshold while flooding the fleet. Because CMS counters are linear, each node keeps its own fixed-memory sketch, ships it as compact bytes (snapshot() / toBytes()), and merge()s peers' — the sum is exactly the sketch of the whole cluster's traffic, so the global heavy hitter becomes visible everywhere. Honestly scoped as eventually-consistent detection, not a strongly-consistent global limit. See examples/distributed-sketch.ts.

ThrottleKit · MIT · 1.0 — API frozen under SemVer (Stability)

ThrottleKit Wiki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overload Fairness and DDoS

Overload, fairness, and DDoS

Adaptive load-shedding

Fairness

Token budgets under post-hoc cost (`tokenBudget`)

Pacing admission over the budget (`learnedReservation` / `predictiveReservation`)

Surviving a flood

Cluster-wide detection (`mergeableSketch`)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Overload Fairness and DDoS

Overload, fairness, and DDoS

Adaptive load-shedding

Fairness

Token budgets under post-hoc cost (tokenBudget)

Pacing admission over the budget (learnedReservation / predictiveReservation)

Surviving a flood

Cluster-wide detection (mergeableSketch)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Token budgets under post-hoc cost (`tokenBudget`)

Pacing admission over the budget (`learnedReservation` / `predictiveReservation`)

Cluster-wide detection (`mergeableSketch`)