Summary
Critical background operations lack error recovery, causing permanent data loss or stuck resources.
Findings
Impact
ELO calculations permanently lost on transient errors. Servers can become permanently stuck.
Suggested Fix
- Add BullMQ retry configuration with backoff to ELO jobs.
- Add recovery mechanism for servers in bad state (periodic health check job).
Summary
Critical background operations lack error recovery, causing permanent data loss or stuck resources.
Findings
connected: falsebefore K8s job creation. If job fails, server left unusable with no recovery.Impact
ELO calculations permanently lost on transient errors. Servers can become permanently stuck.
Suggested Fix