Skip to content

feat: retry transient gateway 5xx via urllib3 Retry (#139)#144

Merged
martinkersner merged 1 commit into
mainfrom
feat/retry-transient-5xx
Jul 3, 2026
Merged

feat: retry transient gateway 5xx via urllib3 Retry (#139)#144
martinkersner merged 1 commit into
mainfrom
feat/retry-transient-5xx

Conversation

@martinkersner

Copy link
Copy Markdown
Member

Closes #139

Summary

First of the roadmap issues. The prod edge intermittently returns 502/503/504 under load — tests/conftest.py already documents and retries this, but only for the test suite, so real consumers got a ServerError on the first transient blip.

Mount a urllib3.util.retry.Retry adapter on the shared Session (the single one introduced in #137):

  • Retries retry_statuses (default 502, 503, 504) + connection/read failures, GET only (all endpoints are GET).
  • backoff_factor between attempts, honors Retry-After.
  • raise_on_status=False so an exhausted retry returns the final response and the SDK's own _handle_exception still raises ServerErrorerror contract unchanged, no leaked urllib3 MaxRetryError.

On by default (max_retries=3). Tunable via max_retries / retry_backoff / retry_statuses; max_retries=0 disables. Rides through **kwargs, so Datamaxi(api_key, max_retries=...) reaches every sub-client.

Design choices (confirmed before implementing)

  • urllib3 Retry on the Session over a manual loop — standard, handles connect/read/status + backoff + Retry-After in one place.
  • Enabled by default — the flakiness is real and already documented; opt-out via max_retries=0.

Tests / verification

  • New tests/test_retry.py (5 tests): asserts the mounted policy config (total, status_forcelist, backoff, respect_retry_after_header, raise_on_status, GET-only), max_retries=0, tunability, shared adapter, and propagation through Datamaxi.
  • responses intercepts above urllib3, so it can't drive retries — verified end-to-end against a local server instead: default client 503,503,200 → one transparent success (3 hits); max_retries=0ServerError on the first 503.
  • pytest -m "not integration": 139 passed, 11 skipped (was 134 + 5 new).
  • black + flake8 clean.

Notes

  • Mocked lane unaffected: responses bypasses the adapter Retry, so existing 5xx tests stay fast and still assert ServerError.
  • conftest's live-lane retry now sits on top of SDK retry (harmless redundancy); simplifying/removing it is a possible follow-up, left out to keep this PR focused.

Known failures

None.

…sion

Closes #139

The prod edge intermittently returns 502/503/504 under load (documented in
conftest, which retried them only for the test suite). Consumers got a
ServerError on the first blip. Mount a urllib3 Retry adapter on the shared
Session: retries transient statuses + connect/read failures on GETs, honors
Retry-After, backoff. raise_on_status=False so an exhausted retry still flows
through _handle_exception as ServerError (error contract unchanged).

On by default (max_retries=3); tunable via max_retries/retry_backoff/
retry_statuses, max_retries=0 to disable. Verified end-to-end (503,503,200
-> one transparent success; max_retries=0 -> ServerError on first 503).
@martinkersner martinkersner merged commit b679a5f into main Jul 3, 2026
5 checks passed
@martinkersner martinkersner deleted the feat/retry-transient-5xx branch July 3, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Built-in retry/backoff for transient gateway 5xx in API.send_request

1 participant