Backend: clock skew tolerance + JSON 408 diagnostics (#5929)#5932
Backend: clock skew tolerance + JSON 408 diagnostics (#5929)#5932beastoin merged 2 commits intocollab/5929-integrationfrom
Conversation
- Add HTTP_CLOCK_SKEW_ALLOWANCE env var (default 5min) for clock drift - Effective stale threshold = max_age + skew_allowance (10min default) - 408 response returns JSON with server_time, client_time, skew_seconds so the app can detect drift and show user-facing warning Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 tests: tolerance boundaries, JSON diagnostics fields, env var config, zero-allowance fallback, multipart with skew, 504 timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR extends Key observations:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant TimeoutMiddleware
participant App
Client->>TimeoutMiddleware: Request + X-Request-Start-Time header
TimeoutMiddleware->>TimeoutMiddleware: request_age = server_time - client_time
alt request_age > max_age + skew_allowance (default 10 min)
TimeoutMiddleware-->>Client: 408 JSON {error, server_time, client_time, skew_seconds, hint}
else header absent or malformed
TimeoutMiddleware->>App: pass through
else request_age within threshold
TimeoutMiddleware->>App: pass through
App-->>TimeoutMiddleware: response
TimeoutMiddleware-->>Client: response
end
note over TimeoutMiddleware: asyncio.wait_for wraps call_next
alt execution exceeds per-method timeout
TimeoutMiddleware-->>Client: 504 Gateway Timeout
end
Reviews (1): Last reviewed commit: "Add unit tests for clock skew tolerance ..." | Re-trigger Greptile |
| """ | ||
|
|
||
| import asyncio | ||
| import os |
| "message": "Request rejected — your device clock may be out of sync", | ||
| "server_time": current_time, | ||
| "client_time": request_start_time, | ||
| "skew_seconds": round(request_age, 1), |
There was a problem hiding this comment.
skew_seconds field contains request age, not clock skew
round(request_age, 1) is the apparent age of the request from the server's perspective (i.e. current_time - client_reported_time). This includes both the genuine elapsed time since the request was sent and any clock skew component; it isn't the clock skew in isolation.
For a request that took 5 s to arrive with a 650 s skew, request_age would be ~655 s, but the actual skew is 650 s. Labelling this skew_seconds is semantically misleading for API consumers (e.g. mobile apps parsing the JSON to display a friendly error). Consider renaming to apparent_age_seconds (or similar) to be accurate, and optionally add a separate field for the estimated skew if needed:
| "skew_seconds": round(request_age, 1), | |
| "skew_seconds": round(request_age - self.maximum_age_seconds, 1), |
or rename:
"apparent_age_seconds": round(request_age, 1),The test assertion on line 84 (body["skew_seconds"] >= 900) also asserts the request age rather than the skew, confirming the current semantics.
…5934) Fixes #5929 — voice chat transcription fails with 408 when user's device clock is out of sync with server. ## Changes ### Backend (kenji, sub-PR #5932) - Add `HTTP_CLOCK_SKEW_ALLOWANCE` env var (default 5 min) to `TimeoutMiddleware` - Stale request threshold becomes `max_age + skew_allowance` (effective 10 min) - Return structured JSON on 408 with `server_time`, `client_time`, `skew_seconds`, `hint` for client-side detection - 13 unit tests ### App (kelvin, sub-PRs #5937, #5938) - **New: `ClockSkewDetector`** singleton (`backend/http/clock_skew_detector.dart`) — parses 408 JSON, emits typed `ClockSkewEvent` via broadcast stream, rate-limits (45s cooldown) - **`shared.dart` cleaned** — delegates to `ClockSkewDetector.instance.checkResponse()`, zero UI imports (no `app_globals`, `AppLocalizations`, `AppSnackbar`) - **`AppShell` subscribes** to `ClockSkewDetector.onClockSkew` stream — shows localized snackbar with proper `BuildContext` - Content-type check before JSON parsing (ignores HTML/text 408s from proxies) - `clockSkewWarning(minutes)` l10n key in all 34 locales - 28 unit tests — parsing (17), skewMinutes (3), checkResponse cooldown/emission/broadcast (8) ### Architecture (review feedback) - **Separation of concerns**: HTTP transport layer (`shared.dart`) no longer controls UI — it detects and emits, `AppShell` subscribes and renders - **Consistent with 401 pattern**: 401 handling does domain actions (refresh/signout), UI reacts via auth state at higher layers. Clock skew now follows the same boundary. - **Testability**: Tests import and verify real production classes directly instead of duplicating private logic - **Scalable**: Broadcast stream pattern supports future global HTTP signals (rate limit warnings, maintenance mode, etc.) - `.coordination/` added to `.gitignore` - Fixed pre-existing `MyApp.navigatorKey` → `globalNavigatorKey` in `device_provider.dart` - Rebased onto latest `main` ## Test Results - **Backend**: 13/13 unit tests pass (`test_timeout_middleware.py`) - **App**: 28/28 unit tests pass (`clock_skew_detection_test.dart`) — tests real production classes - CP7 reviewer: PR_APPROVED_LGTM - CP8 tester: TESTS_APPROVED ## CP9 L2 Evidence (Backend + App Integrated) **Setup**: Local proxy backend (port 10150) returning 408 `{error: "clock_skew", skew_seconds: 900}` + Flutter app on emulator (commit `a47a06b1d`). **Screenshot** — snackbar visible at bottom:  **Verified behaviors**: 1. `ClockSkewDetector.parseResponse` correctly parses `clock_skew` JSON from 408 responses 2. Content-type check ignores non-JSON 408s (confirmed against prod API which returns `text/html`) 3. `AppShell` stream subscriber shows localized snackbar with correct minutes (900s → ~15 min) 4. Rate limiter: 48 concurrent 408s → 1 snackbar (45s cooldown) 5. Warning logging for all detections **Evidence**: [Screenshot](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp) · [Flutter logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_flutter_logs_v2.txt) · [Proxy logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_proxy_logs_v2.txt) ## Deployment Steps ### 1. Backend first (backward-compatible) ```bash # Set env var in Helm values (optional — default is 300s / 5 min) HTTP_CLOCK_SKEW_ALLOWANCE=300 # Deploy backend-listen gh workflow run gcp_backend.yml -f environment=prod -f branch=main # Verify: stale request returns JSON 408 (not plain text) curl -s -H "X-Request-Start-Time: $(echo "$(date +%s) - 900" | bc)" \ https://api.omiapi.com/health | python3 -m json.tool # Expected: {"error": "clock_skew", "server_time": ..., "skew_seconds": ...} ``` ### 2. App second (backward-compatible) - The app change is backward-compatible: it only parses 408 with `content-type: json` AND `error: "clock_skew"` - Old backend 408s (text/html) are safely ignored - Release via normal mobile release pipeline (App Store + Google Play) ### Order matters - Backend **must** deploy first so the structured JSON 408 is available - App can deploy anytime after — it gracefully handles both old (text) and new (JSON) 408 formats ## Changed-Path Coverage | Path | Changed code | Happy | Non-happy | L1 | L2 | |------|-------------|-------|-----------|----|----| | P1 | `timeout.py:dispatch` — skew tolerance + JSON 408 | Fresh → 200 | 15min stale → 408 JSON | PASS | PASS | | P2 | `clock_skew_detector.dart:parseResponse` — JSON parse | Valid 408 → parsed | Non-JSON/malformed → null | PASS (17 tests) | PASS | | P3 | `clock_skew_detector.dart:checkResponse` — rate-limit + stream emit | First → event emitted | Second <45s → suppressed; 45s exact → emits | PASS (8 tests) | PASS | | P4 | `shared.dart:makeApiCall` — delegates to detector | 408 → detector called | Non-408 → skipped | PASS | PASS | | P5 | `shared.dart:makeMultipartApiCall` — delegates to detector | 408 → detector called | Non-408 → skipped | PASS | PASS | | P6 | `app_shell.dart:initState` — stream subscriber | Event → snackbar shown | Unmounted → ignored | PASS | PASS | CP9C skipped (no cluster/infra deps). --- _by AI for @beastoin_
…rdware#5929) (BasedHardware#5934) Fixes BasedHardware#5929 — voice chat transcription fails with 408 when user's device clock is out of sync with server. ## Changes ### Backend (kenji, sub-PR BasedHardware#5932) - Add `HTTP_CLOCK_SKEW_ALLOWANCE` env var (default 5 min) to `TimeoutMiddleware` - Stale request threshold becomes `max_age + skew_allowance` (effective 10 min) - Return structured JSON on 408 with `server_time`, `client_time`, `skew_seconds`, `hint` for client-side detection - 13 unit tests ### App (kelvin, sub-PRs BasedHardware#5937, BasedHardware#5938) - **New: `ClockSkewDetector`** singleton (`backend/http/clock_skew_detector.dart`) — parses 408 JSON, emits typed `ClockSkewEvent` via broadcast stream, rate-limits (45s cooldown) - **`shared.dart` cleaned** — delegates to `ClockSkewDetector.instance.checkResponse()`, zero UI imports (no `app_globals`, `AppLocalizations`, `AppSnackbar`) - **`AppShell` subscribes** to `ClockSkewDetector.onClockSkew` stream — shows localized snackbar with proper `BuildContext` - Content-type check before JSON parsing (ignores HTML/text 408s from proxies) - `clockSkewWarning(minutes)` l10n key in all 34 locales - 28 unit tests — parsing (17), skewMinutes (3), checkResponse cooldown/emission/broadcast (8) ### Architecture (review feedback) - **Separation of concerns**: HTTP transport layer (`shared.dart`) no longer controls UI — it detects and emits, `AppShell` subscribes and renders - **Consistent with 401 pattern**: 401 handling does domain actions (refresh/signout), UI reacts via auth state at higher layers. Clock skew now follows the same boundary. - **Testability**: Tests import and verify real production classes directly instead of duplicating private logic - **Scalable**: Broadcast stream pattern supports future global HTTP signals (rate limit warnings, maintenance mode, etc.) - `.coordination/` added to `.gitignore` - Fixed pre-existing `MyApp.navigatorKey` → `globalNavigatorKey` in `device_provider.dart` - Rebased onto latest `main` ## Test Results - **Backend**: 13/13 unit tests pass (`test_timeout_middleware.py`) - **App**: 28/28 unit tests pass (`clock_skew_detection_test.dart`) — tests real production classes - CP7 reviewer: PR_APPROVED_LGTM - CP8 tester: TESTS_APPROVED ## CP9 L2 Evidence (Backend + App Integrated) **Setup**: Local proxy backend (port 10150) returning 408 `{error: "clock_skew", skew_seconds: 900}` + Flutter app on emulator (commit `a47a06b1d`). **Screenshot** — snackbar visible at bottom:  **Verified behaviors**: 1. `ClockSkewDetector.parseResponse` correctly parses `clock_skew` JSON from 408 responses 2. Content-type check ignores non-JSON 408s (confirmed against prod API which returns `text/html`) 3. `AppShell` stream subscriber shows localized snackbar with correct minutes (900s → ~15 min) 4. Rate limiter: 48 concurrent 408s → 1 snackbar (45s cooldown) 5. Warning logging for all detections **Evidence**: [Screenshot](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp) · [Flutter logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_flutter_logs_v2.txt) · [Proxy logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_proxy_logs_v2.txt) ## Deployment Steps ### 1. Backend first (backward-compatible) ```bash # Set env var in Helm values (optional — default is 300s / 5 min) HTTP_CLOCK_SKEW_ALLOWANCE=300 # Deploy backend-listen gh workflow run gcp_backend.yml -f environment=prod -f branch=main # Verify: stale request returns JSON 408 (not plain text) curl -s -H "X-Request-Start-Time: $(echo "$(date +%s) - 900" | bc)" \ https://api.omiapi.com/health | python3 -m json.tool # Expected: {"error": "clock_skew", "server_time": ..., "skew_seconds": ...} ``` ### 2. App second (backward-compatible) - The app change is backward-compatible: it only parses 408 with `content-type: json` AND `error: "clock_skew"` - Old backend 408s (text/html) are safely ignored - Release via normal mobile release pipeline (App Store + Google Play) ### Order matters - Backend **must** deploy first so the structured JSON 408 is available - App can deploy anytime after — it gracefully handles both old (text) and new (JSON) 408 formats ## Changed-Path Coverage | Path | Changed code | Happy | Non-happy | L1 | L2 | |------|-------------|-------|-----------|----|----| | P1 | `timeout.py:dispatch` — skew tolerance + JSON 408 | Fresh → 200 | 15min stale → 408 JSON | PASS | PASS | | P2 | `clock_skew_detector.dart:parseResponse` — JSON parse | Valid 408 → parsed | Non-JSON/malformed → null | PASS (17 tests) | PASS | | P3 | `clock_skew_detector.dart:checkResponse` — rate-limit + stream emit | First → event emitted | Second <45s → suppressed; 45s exact → emits | PASS (8 tests) | PASS | | P4 | `shared.dart:makeApiCall` — delegates to detector | 408 → detector called | Non-408 → skipped | PASS | PASS | | P5 | `shared.dart:makeMultipartApiCall` — delegates to detector | 408 → detector called | Non-408 → skipped | PASS | PASS | | P6 | `app_shell.dart:initState` — stream subscriber | Event → snackbar shown | Unmounted → ignored | PASS | PASS | CP9C skipped (no cluster/infra deps). --- _by AI for @beastoin_
Summary
Backend sub-PR for issue #5929 collab. Adds clock skew tolerance to
TimeoutMiddlewareand returns JSON 408 with diagnostic info.Changes
backend/utils/other/timeout.py:HTTP_CLOCK_SKEW_ALLOWANCEenv var (default 5min) added to stale thresholdmax_age(5min) +skew_allowance(5min) = 10min{error, message, server_time, client_time, skew_seconds, hint}backend/tests/unit/test_timeout_middleware.py— 12 testsOwnership
backend/utils/other/timeout.py,backend/tests/unit/test_timeout_middleware.pyapp/files (separate sub-PR)Status
Fixes #5929 (backend portion)
🤖 Generated with Claude Code