Summary
When a budget-window parent job fails on the Modal side, the gateway's polling endpoint mutates the seed BudgetWindowBatchState in memory only. The mutation is never persisted, so the next poll reloads an untouched seed and reports status="submitted" again. Clients see the status flap between failed and submitted.
Location
projects/policyengine-api-simulation/src/modal/gateway/endpoints.py:298-310
What goes wrong
try:
result = call.get(timeout=0)
except TimeoutError:
return batch_status_response(build_batch_status_response(seed_state))
except Exception as e:
seed_state.status = "failed"
seed_state.error = str(e)
return batch_status_response(build_batch_status_response(seed_state))
seed_state is an instance loaded from the seed store via get_batch_job_seed (line 292). Neither put_batch_job_seed(seed_state) nor put_batch_job_state(seed_state) is called before returning. On the next /budget-window-jobs/{batch_job_id} poll:
get_batch_job_state returns None (worker never reached the main store).
get_batch_job_seed returns the original seed (status still "submitted").
call.get(timeout=0) either succeeds or raises again.
- Client alternates between "failed" and "submitted" on each poll.
Suggested fix
Persist state transitions before returning:
except Exception as e:
seed_state.status = "failed"
seed_state.error = str(e)
put_batch_job_seed(seed_state) # or put_batch_job_state
return batch_status_response(build_batch_status_response(seed_state))
Consider consolidating on a single store so the gateway and worker cannot diverge on which dict is authoritative.
Severity
High. Breaks polling contract and any retry logic that keys off "failed".
Summary
When a budget-window parent job fails on the Modal side, the gateway's polling endpoint mutates the seed
BudgetWindowBatchStatein memory only. The mutation is never persisted, so the next poll reloads an untouched seed and reportsstatus="submitted"again. Clients see the status flap betweenfailedandsubmitted.Location
projects/policyengine-api-simulation/src/modal/gateway/endpoints.py:298-310What goes wrong
seed_stateis an instance loaded from the seed store viaget_batch_job_seed(line 292). Neitherput_batch_job_seed(seed_state)norput_batch_job_state(seed_state)is called before returning. On the next/budget-window-jobs/{batch_job_id}poll:get_batch_job_statereturnsNone(worker never reached the main store).get_batch_job_seedreturns the original seed (status still"submitted").call.get(timeout=0)either succeeds or raises again.Suggested fix
Persist state transitions before returning:
Consider consolidating on a single store so the gateway and worker cannot diverge on which dict is authoritative.
Severity
High. Breaks polling contract and any retry logic that keys off "failed".