Skip to content

fix: tighten request controller release semantics#682

Merged
eric-tramel merged 3 commits into
scheduling-yolofrom
andreatgretel/fix/request-controller-semantics
May 20, 2026
Merged

fix: tighten request controller release semantics#682
eric-tramel merged 3 commits into
scheduling-yolofrom
andreatgretel/fix/request-controller-semantics

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

Summary

  • Apply multiplicative decrease on every rate-limited request outcome.
  • Emit the terminal release outcome on request_lease_released events instead of the generic release result.
  • Treat modified same-id request leases as stale so they cannot release or account for the active lease.

Why

The request-admission plan describes rate-limited outcomes as applying multiplicative decrease, release events as carrying the release outcome, and stale leases as diagnostic-only. This keeps controller accounting and telemetry aligned with those contracts.

Validation

  • .venv/bin/ruff check --fix .
  • .venv/bin/ruff format .
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py

@nabinchha
Copy link
Copy Markdown
Contributor

Red-teamed this together with PR #681 and found one concurrency edge case around repeated rate_limited releases.

If two leases are both acquired when current_limit == 8, both leases carry current_adaptive_limit == 8. If both later return 429, this change drops the controller from 8 -> 4 -> 2. The second 429 is stale evidence from the pre-backoff admission window, not proof that limit 4 is still too high.

Could we gate the multiplicative decrease on the lease limit, for example only decrease when active.current_adaptive_limit <= state.current_limit, or otherwise suppress additional decreases from leases admitted before the latest backoff? We can still extend cooldown / record the 429, but avoid compounding stale in-flight failures.

@eric-tramel eric-tramel marked this pull request as ready for review May 19, 2026 20:38
@eric-tramel eric-tramel requested a review from a team as a code owner May 19, 2026 20:38
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR tightens the release semantics of AdaptiveRequestAdmissionController in three ways: it now treats any caller-provided lease whose fields differ from the stored active lease as stale (diagnostic-only), captures rate_limit_ceiling only on the first rate-limited event in a cascade rather than on every one, and records the terminal RequestReleaseOutcome kind as the reason_or_outcome on request_lease_released events instead of the generic "released" string.

  • Stale-lease guard (elif active != lease): leases with a matching ID but mutated fields are rejected and logged as diagnostics without touching counters or applying outcomes.
  • One-shot ceiling capture (if state.rate_limit_ceiling == 0): prevents the pre-cascade limit from being overwritten by the already-reduced limit on successive decreases in the same cascade, fixing a false-recovery signal.
  • Outcome-first event routing: _request_event_locked now checks outcome before result, so request_lease_released reports the real outcome kind ("rate_limited", "provider_failure", etc.).

Confidence Score: 5/5

Safe to merge — all three behavioral changes are logically correct, well-scoped, and backed by updated tests.

The stale-lease branch correctly restores the active lease before returning, so no counter leaks occur. The one-shot ceiling guard only fires when the ceiling is still zero, preserving the pre-cascade value on every subsequent decrease. The outcome-priority reordering in _request_event_locked is intentional and the new test directly verifies the emitted reason_or_outcome. No unintended side-effects were found.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/models/request_admission/controller.py Three targeted fixes: stale lease detection via exact equality check, one-shot rate_limit_ceiling capture, and outcome-first reason_or_outcome in release events — all logically consistent with the PR's stated contracts.
packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py Test updates cover the ceiling fix (assert 8 instead of 4), stale-lease detection with explicit construction, and a new test asserting that request_lease_released events carry the terminal outcome kind.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["release(lease, outcome)"] --> B{controller_generation\nmatches?}
    B -- No --> D1["diagnostic: wrong_controller_generation"]
    B -- Yes --> C{active_leases\nhas lease_id?}
    C -- No --> D2["diagnostic: duplicate / unknown_lease"]
    C -- Yes, pop active --> E{active == lease?}
    E -- No --> D3["put active back\ndiagnostic: stale_lease"]
    E -- Yes --> F["decrement counters\n_apply_outcome()"]
    F --> G{outcome.kind ==\n'rate_limited'?}
    G -- Yes --> H["emit request_rate_limited\nusing active lease"]
    G -- No --> I
    H --> I["emit request_lease_released\nwith outcome.kind as reason_or_outcome"]
    I --> J["admit_waiters / notify"]

    subgraph _apply_outcome ["_apply_outcome (rate_limited path)"]
        direction TB
        P["multiplicative decrease\ncurrent_limit x factor"] --> Q{rate_limit_ceiling == 0?}
        Q -- Yes --> R["set ceiling = admitted_adaptive_limit\n(captured once per cascade)"]
        Q -- No --> S["ceiling unchanged"]
    end

    F -.-> _apply_outcome
Loading

Reviews (4): Last reviewed commit: "test: restore waiter deadline time impor..." | Re-trigger Greptile

@andreatgretel andreatgretel force-pushed the andreatgretel/fix/request-controller-semantics branch from 4a838c1 to 8f0255c Compare May 19, 2026 20:43
@andreatgretel andreatgretel force-pushed the andreatgretel/fix/request-controller-semantics branch from c19a078 to cb9097d Compare May 19, 2026 22:00
@eric-tramel eric-tramel merged commit 05852e4 into scheduling-yolo May 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants