Summary
backend/gateway/src/dependencies/grpc.py builds channels once at startup (grpc.aio.insecure_channel) and forwards errors as-is. A blip in any backend = 5xx storm on the user's screen. No retry, no timeout budget, no circuit breaker.
Changes
- Wrap every stub call in a retry helper: exponential backoff on
UNAVAILABLE and DEADLINE_EXCEEDED, max 3 attempts, total budget 2s.
- Circuit breaker per upstream (auth, places, devices, collector): open after N consecutive failures in a window, half-open probe after cooldown. Use
purgatory or circuitbreaker library.
- Per-upstream timeout from env:
GATEWAY__AUTH_TIMEOUT_MS etc.
- Metric:
gateway_grpc_retries_total, gateway_grpc_circuit_state{upstream}.
Verification
docker restart auth under light load: brief latency bump, no 5xx spike, retries visible in metrics.
- Prolonged auth outage: circuit opens, subsequent requests fail fast with a clear error.
Summary
backend/gateway/src/dependencies/grpc.pybuilds channels once at startup (grpc.aio.insecure_channel) and forwards errors as-is. A blip in any backend = 5xx storm on the user's screen. No retry, no timeout budget, no circuit breaker.Changes
UNAVAILABLEandDEADLINE_EXCEEDED, max 3 attempts, total budget 2s.purgatoryorcircuitbreakerlibrary.GATEWAY__AUTH_TIMEOUT_MSetc.gateway_grpc_retries_total,gateway_grpc_circuit_state{upstream}.Verification
docker restart authunder light load: brief latency bump, no 5xx spike, retries visible in metrics.