Skip to content

Backend: clock skew tolerance + JSON 408 diagnostics (#5929)#5932

Merged
beastoin merged 2 commits intocollab/5929-integrationfrom
collab/5929-kenji-backend
Mar 23, 2026
Merged

Backend: clock skew tolerance + JSON 408 diagnostics (#5929)#5932
beastoin merged 2 commits intocollab/5929-integrationfrom
collab/5929-kenji-backend

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

Summary

Backend sub-PR for issue #5929 collab. Adds clock skew tolerance to TimeoutMiddleware and returns JSON 408 with diagnostic info.

Changes

backend/utils/other/timeout.py:

  • HTTP_CLOCK_SKEW_ALLOWANCE env var (default 5min) added to stale threshold
  • Effective threshold = max_age (5min) + skew_allowance (5min) = 10min
  • 408 returns JSON: {error, message, server_time, client_time, skew_seconds, hint}

backend/tests/unit/test_timeout_middleware.py — 12 tests

Ownership

  • kenji owns: backend/utils/other/timeout.py, backend/tests/unit/test_timeout_middleware.py
  • kelvin owns: app/ files (separate sub-PR)

Status

  • 12 unit tests pass
  • Codex reviewer: PR_APPROVED_LGTM
  • Codex tester: TESTS_APPROVED

Fixes #5929 (backend portion)

🤖 Generated with Claude Code

beastoin and others added 2 commits March 23, 2026 04:05
- Add HTTP_CLOCK_SKEW_ALLOWANCE env var (default 5min) for clock drift
- Effective stale threshold = max_age + skew_allowance (10min default)
- 408 response returns JSON with server_time, client_time, skew_seconds
  so the app can detect drift and show user-facing warning

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 tests: tolerance boundaries, JSON diagnostics fields, env var
config, zero-allowance fallback, multipart with skew, 504 timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 23, 2026

Greptile Summary

This PR extends TimeoutMiddleware with clock-skew tolerance and richer 408 diagnostics, addressing the root cause of spurious request rejections from mobile clients whose clocks drift behind the server. The effective stale threshold becomes max_age + skew_allowance (default 10 min total), and a rejected request now returns structured JSON with server_time, client_time, skew_seconds, and a user-facing hint rather than a bare 408 text body.

Key observations:

  • The clock skew allowance is correctly read once at middleware construction time (not per-request), so it's efficient and env-var overrides work cleanly via monkeypatch.
  • The skew_seconds field in the 408 payload is actually the full apparent request age (current_time − client_reported_time), not the isolated clock skew. Clients parsing this field to display a "your clock is X seconds off" message may mislead users; consider renaming to apparent_age_seconds or adjusting the value to request_age − maximum_age_seconds.
  • import os in the test file is unused and can be removed.
  • Test coverage (12 tests) is thorough, hitting boundary conditions, env-var overrides, malformed/missing/future headers, multipart uploads, and the 504 timeout path.

Confidence Score: 4/5

  • This PR is safe to merge; the two remaining notes are non-blocking style suggestions.
  • The core logic is correct, the feature works as described, and test coverage is comprehensive. The only open items are a misleading JSON field name (skew_seconds vs. apparent age) and a single unused import — neither is a runtime bug nor a data-loss/security risk.
  • No files require special attention; backend/utils/other/timeout.py has one minor naming concern in the JSON response payload.

Important Files Changed

Filename Overview
backend/utils/other/timeout.py Adds clock skew tolerance (configurable via HTTP_CLOCK_SKEW_ALLOWANCE, default 5 min) to the stale-request check, and replaces the plain-text 408 with a structured JSON diagnostic payload. Logic is sound; minor issue: the skew_seconds JSON field contains the full request apparent age, not the isolated clock skew.
backend/tests/unit/test_timeout_middleware.py 12 well-structured unit tests covering fresh requests, boundary conditions, clock skew tolerance, multipart uploads, malformed/missing/future headers, env-var overrides, and the 504 timeout path. One unused import os on line 13.

Sequence Diagram

sequenceDiagram
    participant Client
    participant TimeoutMiddleware
    participant App

    Client->>TimeoutMiddleware: Request + X-Request-Start-Time header
    TimeoutMiddleware->>TimeoutMiddleware: request_age = server_time - client_time
    alt request_age > max_age + skew_allowance (default 10 min)
        TimeoutMiddleware-->>Client: 408 JSON {error, server_time, client_time, skew_seconds, hint}
    else header absent or malformed
        TimeoutMiddleware->>App: pass through
    else request_age within threshold
        TimeoutMiddleware->>App: pass through
        App-->>TimeoutMiddleware: response
        TimeoutMiddleware-->>Client: response
    end
    note over TimeoutMiddleware: asyncio.wait_for wraps call_next
    alt execution exceeds per-method timeout
        TimeoutMiddleware-->>Client: 504 Gateway Timeout
    end
Loading

Reviews (1): Last reviewed commit: "Add unit tests for clock skew tolerance ..." | Re-trigger Greptile

"""

import asyncio
import os
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused import

os is imported but never referenced in the test file — all env manipulation goes through monkeypatch.setenv(). Safe to remove.

Suggested change
import os
import asyncio

"message": "Request rejected — your device clock may be out of sync",
"server_time": current_time,
"client_time": request_start_time,
"skew_seconds": round(request_age, 1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 skew_seconds field contains request age, not clock skew

round(request_age, 1) is the apparent age of the request from the server's perspective (i.e. current_time - client_reported_time). This includes both the genuine elapsed time since the request was sent and any clock skew component; it isn't the clock skew in isolation.

For a request that took 5 s to arrive with a 650 s skew, request_age would be ~655 s, but the actual skew is 650 s. Labelling this skew_seconds is semantically misleading for API consumers (e.g. mobile apps parsing the JSON to display a friendly error). Consider renaming to apparent_age_seconds (or similar) to be accurate, and optionally add a separate field for the estimated skew if needed:

Suggested change
"skew_seconds": round(request_age, 1),
"skew_seconds": round(request_age - self.maximum_age_seconds, 1),

or rename:

                            "apparent_age_seconds": round(request_age, 1),

The test assertion on line 84 (body["skew_seconds"] >= 900) also asserts the request age rather than the skew, confirming the current semantics.

@beastoin beastoin merged commit 19deaf6 into collab/5929-integration Mar 23, 2026
2 checks passed
@beastoin beastoin deleted the collab/5929-kenji-backend branch March 23, 2026 04:22
beastoin added a commit that referenced this pull request Mar 23, 2026
…5934)

Fixes #5929 — voice chat transcription fails with 408 when user's device
clock is out of sync with server.

## Changes

### Backend (kenji, sub-PR #5932)
- Add `HTTP_CLOCK_SKEW_ALLOWANCE` env var (default 5 min) to
`TimeoutMiddleware`
- Stale request threshold becomes `max_age + skew_allowance` (effective
10 min)
- Return structured JSON on 408 with `server_time`, `client_time`,
`skew_seconds`, `hint` for client-side detection
- 13 unit tests

### App (kelvin, sub-PRs #5937, #5938)
- **New: `ClockSkewDetector`** singleton
(`backend/http/clock_skew_detector.dart`) — parses 408 JSON, emits typed
`ClockSkewEvent` via broadcast stream, rate-limits (45s cooldown)
- **`shared.dart` cleaned** — delegates to
`ClockSkewDetector.instance.checkResponse()`, zero UI imports (no
`app_globals`, `AppLocalizations`, `AppSnackbar`)
- **`AppShell` subscribes** to `ClockSkewDetector.onClockSkew` stream —
shows localized snackbar with proper `BuildContext`
- Content-type check before JSON parsing (ignores HTML/text 408s from
proxies)
- `clockSkewWarning(minutes)` l10n key in all 34 locales
- 28 unit tests — parsing (17), skewMinutes (3), checkResponse
cooldown/emission/broadcast (8)

### Architecture (review feedback)
- **Separation of concerns**: HTTP transport layer (`shared.dart`) no
longer controls UI — it detects and emits, `AppShell` subscribes and
renders
- **Consistent with 401 pattern**: 401 handling does domain actions
(refresh/signout), UI reacts via auth state at higher layers. Clock skew
now follows the same boundary.
- **Testability**: Tests import and verify real production classes
directly instead of duplicating private logic
- **Scalable**: Broadcast stream pattern supports future global HTTP
signals (rate limit warnings, maintenance mode, etc.)
- `.coordination/` added to `.gitignore`
- Fixed pre-existing `MyApp.navigatorKey` → `globalNavigatorKey` in
`device_provider.dart`
- Rebased onto latest `main`

## Test Results

- **Backend**: 13/13 unit tests pass (`test_timeout_middleware.py`)
- **App**: 28/28 unit tests pass (`clock_skew_detection_test.dart`) —
tests real production classes
- CP7 reviewer: PR_APPROVED_LGTM
- CP8 tester: TESTS_APPROVED

## CP9 L2 Evidence (Backend + App Integrated)

**Setup**: Local proxy backend (port 10150) returning 408 `{error:
"clock_skew", skew_seconds: 900}` + Flutter app on emulator (commit
`a47a06b1d`).

**Screenshot** — snackbar visible at bottom:
![Clock skew
snackbar](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp)

**Verified behaviors**:
1. `ClockSkewDetector.parseResponse` correctly parses `clock_skew` JSON
from 408 responses
2. Content-type check ignores non-JSON 408s (confirmed against prod API
which returns `text/html`)
3. `AppShell` stream subscriber shows localized snackbar with correct
minutes (900s → ~15 min)
4. Rate limiter: 48 concurrent 408s → 1 snackbar (45s cooldown)
5. Warning logging for all detections

**Evidence**:
[Screenshot](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp)
· [Flutter
logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_flutter_logs_v2.txt)
· [Proxy
logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_proxy_logs_v2.txt)

## Deployment Steps

### 1. Backend first (backward-compatible)
```bash
# Set env var in Helm values (optional — default is 300s / 5 min)
HTTP_CLOCK_SKEW_ALLOWANCE=300

# Deploy backend-listen
gh workflow run gcp_backend.yml -f environment=prod -f branch=main

# Verify: stale request returns JSON 408 (not plain text)
curl -s -H "X-Request-Start-Time: $(echo "$(date +%s) - 900" | bc)" \
  https://api.omiapi.com/health | python3 -m json.tool
# Expected: {"error": "clock_skew", "server_time": ..., "skew_seconds": ...}
```

### 2. App second (backward-compatible)
- The app change is backward-compatible: it only parses 408 with
`content-type: json` AND `error: "clock_skew"`
- Old backend 408s (text/html) are safely ignored
- Release via normal mobile release pipeline (App Store + Google Play)

### Order matters
- Backend **must** deploy first so the structured JSON 408 is available
- App can deploy anytime after — it gracefully handles both old (text)
and new (JSON) 408 formats

## Changed-Path Coverage

| Path | Changed code | Happy | Non-happy | L1 | L2 |
|------|-------------|-------|-----------|----|----|
| P1 | `timeout.py:dispatch` — skew tolerance + JSON 408 | Fresh → 200 |
15min stale → 408 JSON | PASS | PASS |
| P2 | `clock_skew_detector.dart:parseResponse` — JSON parse | Valid 408
→ parsed | Non-JSON/malformed → null | PASS (17 tests) | PASS |
| P3 | `clock_skew_detector.dart:checkResponse` — rate-limit + stream
emit | First → event emitted | Second <45s → suppressed; 45s exact →
emits | PASS (8 tests) | PASS |
| P4 | `shared.dart:makeApiCall` — delegates to detector | 408 →
detector called | Non-408 → skipped | PASS | PASS |
| P5 | `shared.dart:makeMultipartApiCall` — delegates to detector | 408
→ detector called | Non-408 → skipped | PASS | PASS |
| P6 | `app_shell.dart:initState` — stream subscriber | Event → snackbar
shown | Unmounted → ignored | PASS | PASS |

CP9C skipped (no cluster/infra deps).

---
_by AI for @beastoin_
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
…rdware#5929) (BasedHardware#5934)

Fixes BasedHardware#5929 — voice chat transcription fails with 408 when user's device
clock is out of sync with server.

## Changes

### Backend (kenji, sub-PR BasedHardware#5932)
- Add `HTTP_CLOCK_SKEW_ALLOWANCE` env var (default 5 min) to
`TimeoutMiddleware`
- Stale request threshold becomes `max_age + skew_allowance` (effective
10 min)
- Return structured JSON on 408 with `server_time`, `client_time`,
`skew_seconds`, `hint` for client-side detection
- 13 unit tests

### App (kelvin, sub-PRs BasedHardware#5937, BasedHardware#5938)
- **New: `ClockSkewDetector`** singleton
(`backend/http/clock_skew_detector.dart`) — parses 408 JSON, emits typed
`ClockSkewEvent` via broadcast stream, rate-limits (45s cooldown)
- **`shared.dart` cleaned** — delegates to
`ClockSkewDetector.instance.checkResponse()`, zero UI imports (no
`app_globals`, `AppLocalizations`, `AppSnackbar`)
- **`AppShell` subscribes** to `ClockSkewDetector.onClockSkew` stream —
shows localized snackbar with proper `BuildContext`
- Content-type check before JSON parsing (ignores HTML/text 408s from
proxies)
- `clockSkewWarning(minutes)` l10n key in all 34 locales
- 28 unit tests — parsing (17), skewMinutes (3), checkResponse
cooldown/emission/broadcast (8)

### Architecture (review feedback)
- **Separation of concerns**: HTTP transport layer (`shared.dart`) no
longer controls UI — it detects and emits, `AppShell` subscribes and
renders
- **Consistent with 401 pattern**: 401 handling does domain actions
(refresh/signout), UI reacts via auth state at higher layers. Clock skew
now follows the same boundary.
- **Testability**: Tests import and verify real production classes
directly instead of duplicating private logic
- **Scalable**: Broadcast stream pattern supports future global HTTP
signals (rate limit warnings, maintenance mode, etc.)
- `.coordination/` added to `.gitignore`
- Fixed pre-existing `MyApp.navigatorKey` → `globalNavigatorKey` in
`device_provider.dart`
- Rebased onto latest `main`

## Test Results

- **Backend**: 13/13 unit tests pass (`test_timeout_middleware.py`)
- **App**: 28/28 unit tests pass (`clock_skew_detection_test.dart`) —
tests real production classes
- CP7 reviewer: PR_APPROVED_LGTM
- CP8 tester: TESTS_APPROVED

## CP9 L2 Evidence (Backend + App Integrated)

**Setup**: Local proxy backend (port 10150) returning 408 `{error:
"clock_skew", skew_seconds: 900}` + Flutter app on emulator (commit
`a47a06b1d`).

**Screenshot** — snackbar visible at bottom:
![Clock skew
snackbar](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp)

**Verified behaviors**:
1. `ClockSkewDetector.parseResponse` correctly parses `clock_skew` JSON
from 408 responses
2. Content-type check ignores non-JSON 408s (confirmed against prod API
which returns `text/html`)
3. `AppShell` stream subscriber shows localized snackbar with correct
minutes (900s → ~15 min)
4. Rate limiter: 48 concurrent 408s → 1 snackbar (45s cooldown)
5. Warning logging for all detections

**Evidence**:
[Screenshot](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_snackbar_v2.webp)
· [Flutter
logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_flutter_logs_v2.txt)
· [Proxy
logs](https://storage.googleapis.com/omi-pr-assets/pr-5934/cp9b_proxy_logs_v2.txt)

## Deployment Steps

### 1. Backend first (backward-compatible)
```bash
# Set env var in Helm values (optional — default is 300s / 5 min)
HTTP_CLOCK_SKEW_ALLOWANCE=300

# Deploy backend-listen
gh workflow run gcp_backend.yml -f environment=prod -f branch=main

# Verify: stale request returns JSON 408 (not plain text)
curl -s -H "X-Request-Start-Time: $(echo "$(date +%s) - 900" | bc)" \
  https://api.omiapi.com/health | python3 -m json.tool
# Expected: {"error": "clock_skew", "server_time": ..., "skew_seconds": ...}
```

### 2. App second (backward-compatible)
- The app change is backward-compatible: it only parses 408 with
`content-type: json` AND `error: "clock_skew"`
- Old backend 408s (text/html) are safely ignored
- Release via normal mobile release pipeline (App Store + Google Play)

### Order matters
- Backend **must** deploy first so the structured JSON 408 is available
- App can deploy anytime after — it gracefully handles both old (text)
and new (JSON) 408 formats

## Changed-Path Coverage

| Path | Changed code | Happy | Non-happy | L1 | L2 |
|------|-------------|-------|-----------|----|----|
| P1 | `timeout.py:dispatch` — skew tolerance + JSON 408 | Fresh → 200 |
15min stale → 408 JSON | PASS | PASS |
| P2 | `clock_skew_detector.dart:parseResponse` — JSON parse | Valid 408
→ parsed | Non-JSON/malformed → null | PASS (17 tests) | PASS |
| P3 | `clock_skew_detector.dart:checkResponse` — rate-limit + stream
emit | First → event emitted | Second <45s → suppressed; 45s exact →
emits | PASS (8 tests) | PASS |
| P4 | `shared.dart:makeApiCall` — delegates to detector | 408 →
detector called | Non-408 → skipped | PASS | PASS |
| P5 | `shared.dart:makeMultipartApiCall` — delegates to detector | 408
→ detector called | Non-408 → skipped | PASS | PASS |
| P6 | `app_shell.dart:initState` — stream subscriber | Event → snackbar
shown | Unmounted → ignored | PASS | PASS |

CP9C skipped (no cluster/infra deps).

---
_by AI for @beastoin_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant