Skip to content

[API] ELO calculation missing retry & dedicated server stuck in bad state #377

@Flegma

Description

@Flegma

Summary

Critical background operations lack error recovery, causing permanent data loss or stuck resources.

Findings

  • EloCalculation job — no error handling or retry logic. If DB query fails, job silently fails with no dead letter queue.
  • DedicatedServersService — server marked connected: false before K8s job creation. If job fails, server left unusable with no recovery.

Impact

ELO calculations permanently lost on transient errors. Servers can become permanently stuck.

Suggested Fix

  • Add BullMQ retry configuration with backoff to ELO jobs.
  • Add recovery mechanism for servers in bad state (periodic health check job).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1-highStability & reliabilityaudit-2026-03From March 2026 codebase auditreliabilityReliability or availability concernservice:api5stackgg/api service

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions