Summary
BudgetWindowBatchRunner.run() busy-polls child simulations on a 100 ms interval for up to the 1-hour parent timeout. Each budget-window request occupies a Modal container for the full duration, starving the max_containers=100 pool and hammering Modal's control plane.
Location
projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:33 (POLL_INTERVAL_SECONDS = 0.1)
projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:74-86 (main loop)
projects/policyengine-api-simulation/src/modal/app.py:113 (timeout=3600 on run_budget_window_batch)
What goes wrong
POLL_INTERVAL_SECONDS = 0.1
...
def run(self) -> dict[str, Any]:
mark_batch_running(self.state)
put_batch_job_state(self.state)
while self.has_pending_work():
self.spawn_until_capacity()
progress_made = self.poll_running_children_once()
if self.state.status == "failed":
return serialize_batch_status(self.state)
if self.state.running_years and not progress_made:
time.sleep(self.poll_interval_seconds)
poll_running_children_once calls handle.call.get(timeout=0) on every running year on every pass. With 75 years at a 100 ms cadence that is up to 750 Modal RPCs per second per active batch. The parent container also sits idle consuming a scheduler slot for the full 3600 s timeout.
Combined with max_containers=100 on run_budget_window_batch in app.py:115, a handful of concurrent budget-window requests can trivially saturate the pool with sleeping parent containers. Costs scale linearly with batch duration, not work.
Suggested fix
- Replace the fixed 100 ms interval with exponential backoff (start at 1 s, cap at 30 s).
- Preferred: convert to event-driven orchestration — spawn children with callbacks or use Modal's
FunctionCall.map()/.gather() primitives so the parent blocks on completion instead of polling.
- Consider decoupling the parent's lifetime from poll duration by having poll requests re-consult the store and letting a short-lived worker resume.
Severity
High. Cost and scalability impact grows with adoption; also amplifies the unauthenticated-gateway DoS vector.
Summary
BudgetWindowBatchRunner.run()busy-polls child simulations on a 100 ms interval for up to the 1-hour parent timeout. Each budget-window request occupies a Modal container for the full duration, starving themax_containers=100pool and hammering Modal's control plane.Location
projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:33(POLL_INTERVAL_SECONDS = 0.1)projects/policyengine-api-simulation/src/modal/budget_window_scheduler.py:74-86(main loop)projects/policyengine-api-simulation/src/modal/app.py:113(timeout=3600onrun_budget_window_batch)What goes wrong
poll_running_children_oncecallshandle.call.get(timeout=0)on every running year on every pass. With 75 years at a 100 ms cadence that is up to 750 Modal RPCs per second per active batch. The parent container also sits idle consuming a scheduler slot for the full 3600 stimeout.Combined with
max_containers=100onrun_budget_window_batchinapp.py:115, a handful of concurrent budget-window requests can trivially saturate the pool with sleeping parent containers. Costs scale linearly with batch duration, not work.Suggested fix
FunctionCall.map()/.gather()primitives so the parent blocks on completion instead of polling.Severity
High. Cost and scalability impact grows with adoption; also amplifies the unauthenticated-gateway DoS vector.