Skip to content

Commit 25d53b6

Browse files
committed
fix(throttle): reset consecutive_429s on non-rate-limit failure
In `release_failure`, the cascade counter wasn't reset, so a sequence like 429 β†’ 500 β†’ 429 was treated as 2 consecutive 429s. The cascade counter feeds AIMD's reduce-once-per-cascade logic; the second 429 should start a fresh cascade and trigger another concurrency reduction, but currently doesn't. Standalone bug surfaced during #575 investigation; not on the failure path that drives the gate-trip outcome but worth fixing while we're in this code.
1 parent 763eedd commit 25d53b6

2 files changed

Lines changed: 22 additions & 0 deletions

File tree

β€Žpackages/data-designer-engine/src/data_designer/engine/models/clients/throttle_manager.pyβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,10 @@ def release_failure(
310310
with self._lock:
311311
state = self._get_or_create_domain(provider_name, model_id, domain)
312312
state.in_flight = max(0, state.in_flight - 1)
313+
# Non-rate-limit failure breaks the 429 cascade: a sequence like
314+
# 429 β†’ 500 β†’ 429 should treat the second 429 as the start of a
315+
# new cascade, not the third in a row.
316+
state.consecutive_429s = 0
313317

314318
# -------------------------------------------------------------------
315319
# Sync / async wrappers

β€Žpackages/data-designer-engine/tests/engine/models/clients/test_throttle_manager.pyβ€Ž

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,24 @@ def test_failure_releases_slot_without_limit_change(manager: ThrottleManager) ->
183183
assert state.in_flight == 0
184184

185185

186+
def test_failure_resets_consecutive_429s_cascade(manager: ThrottleManager) -> None:
187+
"""Non-rate-limit failure breaks the 429 cascade so 429β†’500β†’429 isn't treated as 2-in-a-row.
188+
189+
The cascade counter feeds the AIMD reduce-once-per-cascade logic; if a
190+
non-RL failure doesn't reset it, the subsequent 429 is treated as part of
191+
the previous cascade and the limit isn't reduced when it should be.
192+
"""
193+
manager.try_acquire(provider_name=PROVIDER, model_id=MODEL, domain=DOMAIN, now=0.0)
194+
manager.release_rate_limited(provider_name=PROVIDER, model_id=MODEL, domain=DOMAIN, now=0.0)
195+
state = manager.get_domain_state(PROVIDER, MODEL, DOMAIN)
196+
assert state is not None
197+
assert state.consecutive_429s == 1
198+
199+
manager.try_acquire(provider_name=PROVIDER, model_id=MODEL, domain=DOMAIN, now=0.0)
200+
manager.release_failure(provider_name=PROVIDER, model_id=MODEL, domain=DOMAIN, now=0.0)
201+
assert state.consecutive_429s == 0
202+
203+
186204
# --- Global cap ---
187205

188206

0 commit comments

Comments
Β (0)