Skip to content

feat: enable bounded-borrow task admission#693

Open
eric-tramel wants to merge 9 commits into
mainfrom
codex/bounded-borrow-default
Open

feat: enable bounded-borrow task admission#693
eric-tramel wants to merge 9 commits into
mainfrom
codex/bounded-borrow-default

Conversation

@eric-tramel
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel commented May 21, 2026

📋 Summary

Enables bounded-borrow task admission as the default async scheduler policy for #650. The policy computes per-resource strict share from scheduler capacity, queued competing groups, and group weights, rounds strict share up by default to avoid avoidable capacity loss, then permits solo groups to borrow only up to a capacity-derived reserve so heavy root work cannot fill the whole scheduler window before downstream or peer work becomes ready.

This PR is the scheduler-side admission-control guardrail that #651 can generalize from coarse task-stage resources to provider/model/resource-vector resources; it is not claiming to solve endpoint/GPU utilization by itself.

🔗 Related Issue

Closes #650

Follow-up design: #651

🔄 Changes

  • Enable BoundedBorrowTaskAdmissionPolicyConfig in the default AsyncTaskScheduler task-admission config.
  • Implement bounded-borrow strict-share accounting using SchedulerResourceRequest, QueueView, TaskAdmissionView, and TaskGroupSpec without moving DAG traversal or resource ownership into FairTaskQueue.
  • Default bounded-borrow strict-share rounding to ceil, keeping the older floor behavior available only as an internal policy config for focused tests/experiments.
  • Replace the fixed default borrow ceiling with a dynamic internal reserve: reserved_slots = min(8, max(1, ceil(resource_limit * 0.125))); solo groups may borrow up to resource_limit - reserved_slots.
  • Track borrow debt as marginal over-share slots, so explicit ceilings greater than 1 admit the number of borrowed slots they advertise.
  • Preserve explicit internal borrow ceilings through default_borrow_ceiling and borrow_ceiling_by_group_resource for tests/experiments, while leaving public RunConfig unchanged.
  • Preserve the strict/lease-only internal path through TaskAdmissionConfig(..., bounded_borrow=None) for tests and benchmark comparison.
  • Add deterministic benchmark coverage in scripts/benchmarks/benchmark_bounded_borrow_admission.py comparing strict fair admission against the default bounded-borrow policy for heavy-root peer arrival and neutral ready-at-start traffic.
  • Report benchmark generation-column idle explicitly: per-column zero-inflight idle, total column idle, and max column idle across the workflow.
  • Add focused tests for solo heavy-group reservation, dynamic reserve-derived ceilings, explicit ceiling semantics, peer-arrival queue selection, release/debt repayment, strict-share rounding defaults, and the async scheduler default policy.
  • Document bounded-borrow as the default async admission policy in architecture/dataset-builders.md and expose the effective scheduler task-admission config through a read-only engine accessor for tests/diagnostics.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the policy and benchmark interpretation.

  • task_policies.py — strict-share computation, ceil rounding default, dynamic reserve-derived ceiling, marginal borrow-debt accounting, and admission decisions.
  • async_scheduler.py — default policy selection for production async scheduling.
  • benchmark_bounded_borrow_admission.py — deterministic evidence path for strict versus default bounded behavior and per-generation-column idle time.
  • #651 — follow-up resource-vector/provider-aware design should decide how this policy maps from coarse task-stage resources to concrete provider/model generation resources.

📈 Bounded Borrow Verification Report

Plan alignment against plans/645

  • Default policy: async scheduler defaults to bounded borrow, while TaskAdmissionController still supports strict fair admission when bounded_borrow=None for the Implement TaskAdmissionController lease boundary for async scheduler #644 lease-only comparison path.
  • No public knob: no RunConfig or public API field was added; the change stays engine-internal as requested by Add bounded-borrow admission policy for heavy-root async workloads #650.
  • Correct ownership: FairTaskQueue still owns only ready ordering. The policy consumes queue/admission snapshots and does not inspect the DAG, generators, model registry, provider registry, or request-admission AIMD state.
  • Scheduler-resource semantics: decisions are based on scheduler resource requests and capacity snapshots. Borrow debt is tracked by group/resource and remains separate from hard resource availability.
  • Single-group liveness: solo model groups remain live, but the policy is intentionally less work-conserving than strict fair admission when a heavy group is alone at launch, preserving a small resource reserve for later peer/downstream work.
  • Peer arrival: when an over-borrowed group has debt and a queued peer can use the resource, the indebted group is withheld so the peer can be selected.
  • Rounding: strict share rounds up by default, which reduces unnecessary idle capacity compared with the initial floor default while preserving bounded-borrow peer protection.
  • Dynamic reserve: the default ceiling scales with task-admission capacity. For capacity 8, the policy reserves 1 slot, admits up to 7 solo tasks, and records 3 marginal over-share debt slots.

Bridge to #651 resource-vector work

Bounded borrow establishes the control-law shape that #651 should generalize: strict share, bounded solo prefill, peer-pressure yielding, and borrow-debt repayment. Today this operates over coarse scheduler task-stage resources such as llm_wait; #651 should decide how the same policy maps onto provider/model/resource-vector identities without duplicating request-stage AIMD.

The benchmark now reports per-generation-column zero-inflight idle as a proxy for endpoint/GPU idleness. #651 should carry that acceptance criterion forward for provider/model resources: per-resource idle, total idle, max idle, and baseline deltas should be part of the resource-vector evidence gate.

Benchmark command

.venv/bin/python scripts/benchmarks/benchmark_bounded_borrow_admission.py --output-dir .scratch/bounded-borrow-admission

Evidence was generated from commit 96b1a4db58592189f633530601202f6e0a285c15 on macOS with Python 3.12.11.

Benchmark metric definition

Generation-column idle is the amount of workflow wall time where that column has no in-flight generation task. This is the benchmark proxy for endpoint/GPU time with no tokens being generated for that model resource. The benchmark reports each generation column independently, plus total and max column idle.

Benchmark results

Scenario Policy Tasks Wall time (s) Generation-column idle (s) Total column idle (s) Max column idle (s) Hot dispatches before peer ready Peer wait p95 (s) Peer first wait (s)
heavy_root_peer_arrival strict 768 33.600 hot=0.450, peer=26.400 26.850 26.400 8 33.450 0.450
heavy_root_peer_arrival bounded 768 36.700 hot=0.000, peer=24.050 24.050 24.050 7 11.950 0.000
neutral_ready_at_start strict 512 6.400 hot=0.000, peer=0.000 0.000 0.000 0 6.000 0.000
neutral_ready_at_start bounded 512 6.400 hot=0.000, peer=0.000 0.000 0.000 0 6.000 0.000

Interpretation

  • In the heavy-root peer-arrival case, strict fair admission filled all 8 scheduler slots with hot/root work before peer work arrived. Default bounded borrow admitted 7 hot tasks, preserved one slot, and dispatched peer work immediately when it became ready.
  • Peer p95 ready-to-dispatch wait improved from 33.450s to 11.950s, a 64.3% reduction. First peer wait improved from 0.450s to 0.000s.
  • Under the per-generation-column idle metric, bounded borrow reduced total idle from 26.850s to 24.050s (-2.800s) and reduced max column idle from 26.400s to 24.050s (-2.350s) in the heavy-root scenario.
  • The dynamic reserve substantially reduces the solo-work penalty versus the earlier fixed ceiling: bounded wall time is now 36.700s instead of 46.550s for the ceil/fixed-ceiling version, while keeping immediate peer dispatch.
  • The neutral ready-at-start scenario is unchanged: same wall time, generation-column idle, peer p95 wait, and first peer wait for strict and bounded policies.

🧪 Testing

  • .venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff check scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff format --check scripts/benchmarks/benchmark_bounded_borrow_admission.py
  • .venv/bin/ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py
  • .venv/bin/ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py
  • git diff --check
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_policies.py -q — 24 passed
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py -q — 169 passed
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/test_capacity.py -q — 10 passed
  • .venv/bin/python scripts/benchmarks/benchmark_bounded_borrow_admission.py --output-dir .scratch/bounded-borrow-admission
  • E2E tests added/updated: N/A — scheduler-policy change is covered by unit/integration tests and deterministic benchmark evidence.

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture checked against plans/645; no public docs/API updates required

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
@eric-tramel eric-tramel requested a review from a team as a code owner May 21, 2026 01:16
@github-actions
Copy link
Copy Markdown
Contributor

PR #693 Review: enable bounded-borrow task admission

Summary

Promotes the bounded-borrow task admission policy from an opt-in alternative to the default for the production AsyncTaskScheduler. The policy now computes a per-resource "strict share" from scheduler capacity, queued competing groups, and group weights, and only allows a small bounded amount of solo borrowing past that share so a heavy root group cannot pre-fill the scheduler window before downstream/peer work becomes ready. The change is engine-internal (no public API surface added) and is backed by unit tests, an integration smoke test on the scheduler default, and a deterministic benchmark script under scripts/benchmarks/.

Touched files:

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py — set bounded-borrow config as default
  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py — strict-share computation, restructured evaluate, validation
  • packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py — two new tests
  • packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py — default-policy smoke test
  • scripts/benchmarks/benchmark_bounded_borrow_admission.py — new (377 lines)

Findings

Correctness

  • evaluate logic is internally consistent (task_policies.py:134-206). Tracing the new heavy-solo test: with capacity=4, weight=4, no peers, _strict_share returns 2 (solo case uses total_weight = group_weight * 2); items 0 and 1 admit at-or-below share, item 2 borrows new_debt=1 against ceiling=1, item 3 is denied with reason="borrow_debt" because debt(1) + new_debt(2) = 3 > 1. The asserted policy_debt_by_group_resource == 1 and resources_available == 1 line up.
  • Default value change is a subtle behavior shift for direct callers. default_borrow_ceiling default went from 01 (line 47). For the production async scheduler this is intentional and exercised by the new tests, but any code that constructed BoundedBorrowTaskAdmissionPolicyConfig() expecting "no borrowing" now silently permits one unit of debt per (group, resource). Worth a one-line docstring note that the default is non-zero, since the dataclass docstring still describes borrow tracking generically without naming the new default. All existing tests pass default_borrow_ceiling=1 explicitly, so the change is invisible there — which actually hides whether anything else relied on the old 0 default.
  • Solo-case heuristic is undocumented in the code. _strict_share (lines 250-275) hard-codes total_weight = group_weight * 2 when only one competing group is visible (line 264). This effectively reserves half of every typed resource for a hypothetical future peer, which is exactly the "less work-conserving for solo heavy traffic" tradeoff called out in the PR description. The reasoning is non-obvious from the code alone — a brief comment ("reserve half capacity for an unobserved future peer; otherwise solo groups would never see strict-share pressure") would save the next reader a trip back to this PR.
  • _strict_share early-return for resource_limit <= 1. Returning 1 when limit is 0 (line 258-259) is mildly surprising — it would let a strict-share comparison succeed against a resource that has no capacity. In practice the hard-resource eligibility check (_is_hard_resource_eligible) gates this earlier in the admission flow, so it is not exploitable, but the bound <= 1 would be more defensible as < 1 returning 0, with == 1 returning 1. Minor.
  • Pressure-path group_cap denial leaves the heavy group blocked but does not mutate debt. Correct — the group voluntarily yields without consuming borrow budget. The behavior is exercised by the new test_bounded_borrow_debt_yields_queue_selection_to_new_peer test.
  • __post_init__ validation is good. Rejects negative ceilings on both default_borrow_ceiling and per-(group, resource) overrides at construction time.

Style / conventions

  • Absolute imports, from __future__ import annotations, modern type syntax — all consistent with STYLEGUIDE.md.
  • import math is stdlib so no lazy_heavy_imports concern.
  • New file benchmark_bounded_borrow_admission.py has the SPDX header (line 1-2). Good.
  • The evaluate method has grown to ~70 lines with three nested branches. It is still readable, but the per-resource block mixes pressure-vs-non-pressure logic with diagnostic accumulation. Pulling the pressure branch into a small helper (e.g., _evaluate_under_pressure) would shorten the main function and make the "yield to peer" semantics easier to reason about. Optional; not blocking.

Tests

  • test_bounded_borrow_prevents_solo_heavy_group_from_consuming_all_typed_capacity is a strong addition — exercises the new strict-share clamp end-to-end through TaskAdmissionController.
  • test_bounded_borrow_debt_yields_queue_selection_to_new_peer verifies the queue-level effect (peer wins selection). Good integration-flavored coverage.
  • test_scheduler_default_task_admission_uses_bounded_borrow_policy is intentionally light. Two suggestions:
    • Assert isinstance(scheduler._task_admission_config.bounded_borrow, BoundedBorrowTaskAdmissionPolicyConfig) rather than just is not None — the latter would still pass if a future change set it to a different sentinel.
    • Reaching into _task_admission_config is private-attribute coupling. Acceptable here, but a TaskAdmissionConfig.snapshot() accessor or similar would age better.
  • No test pins the new default default_borrow_ceiling=1 value. Every existing test passes default_borrow_ceiling=1 explicitly, so a regression that flipped the default back to 0 (or to some other value) would not be caught. A one-line assert BoundedBorrowTaskAdmissionPolicyConfig().default_borrow_ceiling == 1 test, or asserting it through the scheduler default, would lock the contract. Recommended.

Benchmark script

  • Uses time.strftime for the dated subdirectory; the JSON report's created_at uses time.gmtime() while the directory name uses local time (line 90 vs line 99). Minor inconsistency — pick one timezone.
  • _git_sha swallows all exceptions to return "unknown". Fine for a benchmark, but subprocess.check_output will also raise FileNotFoundError if git is missing — covered by bare except Exception so no issue.
  • The script is not registered as a test or CI step — it is a manual evidence-generation tool, which matches its location under scripts/benchmarks/. Good.
  • Benchmark numbers in the PR description show a 63.8% reduction in peer p95 wait at the cost of +15.4s wall time and a 31% utilization drop in the heavy-root scenario. This is a real production tradeoff being made the default. The PR description owns it explicitly, which is the right call, but downstream users running heavy-solo workloads (no peers) will see longer wall times. Worth surfacing in a CHANGELOG or release note when this ships.

Performance

  • _strict_share is invoked per-resource per-item per evaluate. Each call scans queue_view.first_candidate_resources_by_group (and _is_hard_resource_eligible does another loop over each peer's resources). For workloads with many groups and resources this is O(groups × resources²) per admission decision. Probably fine at current scales; revisit if admission becomes hot. No action needed.

Architecture / structural impact

  • Direction is clean: task_policies.py consumes QueueView / TaskAdmissionView / TaskGroupSpec snapshots and does not import from dataset_builders.async_scheduler or higher layers. Matches the AGENTS.md "registries / snapshots, no reverse imports" guidance.
  • No public API change; the change is contained within data_designer.engine.dataset_builders.scheduling.

Verdict

Approve with minor follow-ups. The implementation is correct under the cases I traced, well-covered by new unit tests, and aligned with the architectural guidance in AGENTS.md (snapshot-driven policy, no DAG inspection, no public knob). The main asks:

  1. Add a test that pins BoundedBorrowTaskAdmissionPolicyConfig().default_borrow_ceiling == 1 (or assert the value through the scheduler default) so the default cannot drift silently.
  2. Comment the solo-case total_weight = group_weight * 2 heuristic in _strict_share.
  3. Tighten test_scheduler_default_task_admission_uses_bounded_borrow_policy to assert the type of the bounded-borrow config.
  4. Consider noting the new default ceiling (1) in the BoundedBorrowTaskAdmissionPolicyConfig docstring, since direct callers using BoundedBorrowTaskAdmissionPolicyConfig() will now see borrow behavior they previously did not.

The intentional utilization regression for solo-heavy traffic is well-documented in the PR description and benchmark, and is the right tradeoff for the #650 goal.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR activates BoundedBorrowTaskAdmissionPolicyConfig as the default policy for AsyncTaskScheduler, replacing the previous no-borrow baseline. The core change redesigns the borrow-debt model to use marginal over-share accounting (min(amount, projected − strict_share)) and a dynamic reserve-derived ceiling rather than a fixed-zero default.

  • task_policies.py: Adds _strict_share (proportional fair share with synthesized-peer solo treatment), _competing_group_specs (queue-based competitor set), and _dynamic_reserved_slots; rewrites BoundedBorrowTaskAdmissionPolicy.evaluate to check strict-share per resource under both peer-pressure and solo paths; switches debt recording to marginal borrow units.
  • async_scheduler.py: Passes BoundedBorrowTaskAdmissionPolicyConfig() to the default TaskAdmissionConfig and exposes task_admission_config for test inspection.
  • Test and benchmark additions: New focused unit tests cover dynamic-ceiling reservation, explicit-ceiling marginal semantics, peer-arrival queue selection, and the scheduler default; a new deterministic benchmark compares strict vs. bounded policies across heavy-root and neutral scenarios.

Confidence Score: 5/5

Safe to merge. The bounded-borrow policy is well-scoped to engine-internal scheduler state with no public API changes, and the neutral-scenario benchmark confirms zero regression under balanced workloads.

The marginal debt increment fix correctly tracks only the over-share portion, and the dynamic ceiling computation matches the documented formula. Acquire/release accounting is symmetric and existing tests were carefully updated to preserve prior semantics.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py Core logic rewrite: adds strict-share computation with synthesized-peer solo treatment, marginal borrow-debt via min(amount, projected−strict_share), dynamic reserve-derived ceiling, and per-resource peer-pressure gating. Logic is correct; debt increment and ceiling are consistent across acquire/release paths.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Minimal change: threads BoundedBorrowTaskAdmissionPolicyConfig() into the default TaskAdmissionConfig and adds the task_admission_config read-only property for test introspection.
packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_task_admission.py Adds five new focused tests covering solo reservation, dynamic-ceiling reserve values, explicit-ceiling marginal semantics, peer-arrival queue selection, and updates three existing tests to opt into floor rounding to preserve prior behavior.
scripts/benchmarks/benchmark_bounded_borrow_admission.py New deterministic event-driven benchmark; correctly models task dispatch, lease lifecycle, and per-column zero-inflight idle time.

Reviews (4): Last reviewed commit: "address bounded-borrow review feedback" | Re-trigger Greptile

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
@nabinchha
Copy link
Copy Markdown
Contributor

Thanks for putting this together, @eric-tramel — the strict-share + dynamic reserve story holds up well, and the benchmark gives reviewers exactly the evidence we need.

Summary

Enables bounded-borrow as the default async-scheduler task admission policy. Strict share now computes per-resource fair share from queue/admission snapshots and rounds up by default, while solo groups can borrow up to a capacity-derived reserve (reserved_slots = min(8, max(1, ceil(resource_limit * 0.125)))). Borrow debt is now tracked as marginal over-share (fixing the silent under-permissiveness that Greptile flagged), and the change ships with a deterministic benchmark that quantifies the trade-off (peer p95 wait −64.3% vs +9% wall time on heavy_root_peer_arrival, neutral case unchanged). Implementation matches the stated intent — the change stays engine-internal, public RunConfig is untouched, and FairTaskQueue ownership remains read-only via QueueView/TaskAdmissionView/TaskGroupSpec.

Findings

Warnings — Worth addressing

packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py:1857 — Test reaches into private attribute

  • What: assert scheduler._task_admission_config.bounded_borrow is not None accesses a _-prefixed attribute. DEVELOPMENT.md calls this out: "Tests should exercise public interfaces, not _-prefixed functions or classes."
  • Why: This couples the regression test to internal layout. If we ever rename or restructure how the scheduler exposes its admission config, this is the test that breaks for the wrong reason.
  • Suggestion: Add a small public surface — even a read-only property task_admission_config (or a policy_kind on the controller) — and assert against that. If we want to keep it engine-internal, a # noqa-style carve-out plus a comment explaining the intent would at least flag the trade-off explicitly.

architecture/dataset-builders.md — Default policy change isn't reflected in the architecture map

  • What: The dataset-builders architecture doc describes async admission as "Fair async admission keeps the scheduler flowing across ready columns and model groups" without mentioning that the default is now bounded-borrow with a strict-share + reserved-slots guardrail.
  • Why: A reader of architecture/dataset-builders.md today would still get the strict-fair mental model. Since this PR is also the spot where bounded-borrow becomes "the way the scheduler behaves by default in production," the doc starts being subtly false relative to runtime behavior.
  • Suggestion: Add a sentence or two to the "Async Execution" / "Design Decisions" section that names BoundedBorrowTaskAdmissionPolicyConfig as the default, summarizes the strict-share + reserve invariant, and points at this PR's policy file. A short note that bounded_borrow=None selects strict fair admission for tests/benchmarks would also cover Implement TaskAdmissionController lease boundary for async scheduler #644-style comparisons. Either in this PR or a quick follow-up before Design resource-vector and provider-aware admission policies #651 builds on top.

Suggestions — Take it or leave it

packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py:298resource_limit == 0 returns strict_share = 1

  • What: _strict_share early-exits with return 1 when resource_limit <= 1, including the resource_limit == 0 case (resource missing from resource_limits).
  • Why: This is harmless today because TaskAdmissionController._missing_resources denies before the policy is ever asked. But it's a subtle invariant to rely on — if someone later refactors the controller to defer missing-resource checks, the policy would silently report a strict share above the limit.
  • Suggestion: Either guard the 0 case explicitly (if resource_limit <= 0: return 0) or add a one-line comment near the early-exit noting the _missing_resources precondition. Both options keep the body readable.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py:303-305 — Doubled-weight solo heuristic deserves a slightly fuller note

  • What: When the candidate set is <= 1, total_weight = group_weight * 2 to keep solo groups from claiming the full resource and skipping borrow accounting entirely. The comment captures the what but not really the why we picked 2x.
  • Why: For future readers (especially Design resource-vector and provider-aware admission policies #651, which is going to generalize this), the choice of "2x" reads like a magic number. It's effectively a 50% solo strict share before any borrow.
  • Suggestion: A one-liner like # 2x synthesizes a single equally-weighted future peer; combined with the dynamic reserve this gives ~50% strict share and the rest borrowable. would land it. No code change needed.

scripts/benchmarks/benchmark_bounded_borrow_admission.py:411-415 — Bare except Exception

  • What: _git_sha() catches Exception to fall back to "unknown". STYLEGUIDE.md discourages catching Exception without re-raising.
  • Why: It's a benchmark utility and the broad catch is pragmatic, but it'll trip future "no bare-Exception" lint sweeps and makes it slightly easier for a real bug (e.g. wrong cwd, oddly-shaped git output) to disappear silently.
  • Suggestion: Narrow to (subprocess.CalledProcessError, FileNotFoundError, OSError). That covers "git not installed", "not a git repo", and "subprocess failed", which is what we actually care about here.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_policies.py:200-217 — Partial diagnostics on early-exit denial

  • What: evaluate builds diagnostics_by_resource for every resource, but on the first per-resource denial we return diagnostics={"resource": …, "debt": …, "new_debt": …, "ceiling": …, "strict_share": …} — i.e. only the offending resource. The accumulated diagnostics_by_resource is dropped.
  • Why: For multi-resource tasks (e.g. submission + llm_wait), an operator looking at explain_blocked only sees one resource at a time. We're already paying the cost of computing both.
  • Suggestion: Optional — pass the accumulated diagnostics_by_resource along on denial too, e.g. diagnostics={..., "all_resources": diagnostics_by_resource}. Cheap and aids debugging when a single task is contended on multiple resources.

What Looks Good

  • Greptile's P1 about marginal borrow debt is actually fixed, and test_bounded_borrow_explicit_ceiling_counts_marginal_borrowed_slots pins the new behavior — ceiling=3 now admits the advertised 7 (4 strict + 3 borrowed) instead of silently under-permitting. Good catch turned into a regression test.
  • __post_init__ validation on the frozen config is exactly the right shape: explicit non-negative checks for default_borrow_ceiling, the per-key entries in borrow_ceiling_by_group_resource, the fraction in [0, 1], and a positive dynamic_borrow_max_reserved_slots. Failing fast on bad config beats discovering it as a weird scheduler hiccup later.
  • The benchmark deserves a callout — it's deterministic, reports per-column zero-inflight idle (the right proxy for endpoint/GPU idleness), and the strict-vs-bounded comparison gives reviewers a clean evidence path. The neutral scenario being a no-op (same wall time, same idle) is exactly the regression check we'd want before flipping a default.
  • The _strict_share / _competing_group_specs / _dynamic_reserved_slots split keeps the policy readable: each helper has one job, the main evaluate reads top-down, and the diagnostics dict is built once and updated incrementally rather than reconstructed per branch.
  • Existing tests that depended on the old floor rounding were explicitly migrated to strict_share_rounding="floor" rather than silently re-tuned to the new defaults — that's the right call when changing a default behavior.

Verdict

Needs changes — let's address the two warnings before merge:

  • test_async_scheduler.py:1857 reaching into _task_admission_config (add a public property on AsyncTaskScheduler and assert against it).
  • architecture/dataset-builders.md not reflecting that bounded-borrow is now the default async-admission policy (a couple of sentences in the "Async Execution" / "Design Decisions" section, plus naming BoundedBorrowTaskAdmissionPolicyConfig and the strict-share + reserve invariant).

The suggestions can land later or in a follow-up — they're not blocking.

@eric-tramel eric-tramel self-assigned this May 22, 2026
@eric-tramel
Copy link
Copy Markdown
Contributor Author

@nabinchha thanks for the detailed review. I pushed 96b1a4db addressing the actionable items:

  • Added a read-only AsyncTaskScheduler.task_admission_config accessor and updated the default-policy test to use that instead of _task_admission_config.
  • Updated architecture/dataset-builders.md to describe bounded borrow as the default async-admission policy, including strict share, dynamic reserve, peer-pressure yielding, and bounded_borrow=None for strict-fair comparisons.
  • Tightened _strict_share so a zero/missing resource limit defensively returns 0 rather than 1, with a focused policy test.
  • Expanded the solo-group 2x heuristic comment to explain that it synthesizes one equally weighted future peer and keeps solo strict share near 50% before dynamic borrow.
  • Narrowed the benchmark _git_sha() exception handling to CalledProcessError/OSError.

I left the optional accumulated denial diagnostics out of this PR. The point is valid, but the main blocked-queue path currently aggregates only denial reasons, so making that operationally useful would require broader explain_blocked() diagnostic aggregation and feels better as follow-up scope.

I reran the focused checks and benchmark, then refreshed the PR description with the current benchmark evidence SHA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add bounded-borrow admission policy for heavy-root async workloads

2 participants