Skip to content

BudgetWindowBatchRunner busy-polls child simulations at 100 ms interval #449

@MaxGhenis

Description

@MaxGhenis

Summary

BudgetWindowBatchRunner.run() busy-polls child simulations on a 100 ms interval for up to the 1-hour parent timeout. Each budget-window request occupies a Modal container for the full duration, starving the max_containers=100 pool and hammering Modal's control plane.

Location

  • projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:33 (POLL_INTERVAL_SECONDS = 0.1)
  • projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:74-86 (main loop)
  • projects/policyengine-api-simulation/src/modal/app.py:113 (timeout=3600 on run_budget_window_batch)

What goes wrong

POLL_INTERVAL_SECONDS = 0.1
...
def run(self) -> dict[str, Any]:
    mark_batch_running(self.state)
    put_batch_job_state(self.state)

    while self.has_pending_work():
        self.spawn_until_capacity()
        progress_made = self.poll_running_children_once()
        if self.state.status == "failed":
            return serialize_batch_status(self.state)
        if self.state.running_years and not progress_made:
            time.sleep(self.poll_interval_seconds)

poll_running_children_once calls handle.call.get(timeout=0) on every running year on every pass. With 75 years at a 100 ms cadence that is up to 750 Modal RPCs per second per active batch. The parent container also sits idle consuming a scheduler slot for the full 3600 s timeout.

Combined with max_containers=100 on run_budget_window_batch in app.py:115, a handful of concurrent budget-window requests can trivially saturate the pool with sleeping parent containers. Costs scale linearly with batch duration, not work.

Suggested fix

  • Replace the fixed 100 ms interval with exponential backoff (start at 1 s, cap at 30 s).
  • Preferred: convert to event-driven orchestration — spawn children with callbacks or use Modal's FunctionCall.map()/.gather() primitives so the parent blocks on completion instead of polling.
  • Consider decoupling the parent's lifetime from poll duration by having poll requests re-consult the store and letting a short-lived worker resume.

Severity

High. Cost and scalability impact grows with adoption; also amplifies the unauthenticated-gateway DoS vector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions