Skip to content

perf: O(1) qsize, AWS Full Jitter, buffered recordings#111

Merged
TexasCoding merged 3 commits into
mainfrom
perf/issues-103-104-105
May 17, 2026
Merged

perf: O(1) qsize, AWS Full Jitter, buffered recordings#111
TexasCoding merged 3 commits into
mainfrom
perf/issues-103-104-105

Conversation

@TexasCoding

Copy link
Copy Markdown
Owner

Summary

Three disjoint performance fixes from the v1.2 audit:

Test deltas

Issue New test file Note
#103 tests/ws/test_backpressure.py (+) Counter correctness across put/get/drop
#104 tests/test_backoff.py (new) Verifies upper-bound invariant; deterministic via random.seed
#105 tests/test_mock_transport.py (+) Counter-based regression: exactly 1 save_pairs call for 50 sequential requests (no wall-clock)

Full suite: 1413 passed (+6), 48 skipped. ruff + mypy --strict clean.

Closes #103
Closes #104
Closes #105

Test plan

  • uv run pytest tests/ --ignore=tests/integration -v
  • uv run ruff check .
  • uv run mypy kalshi/

TexasCoding and others added 3 commits May 17, 2026 06:41
Replace the O(n) linear scan over the deque (subtracting the at-most-one
sentinel) with an integer counter incremented in put() and decremented in
get()/__anext__() and on DROP_OLDEST eviction. Measured 600x regression at
maxsize=1000 in the audit becomes a single attribute read.

Sentinel never counts toward the running total, so qsize() continues to
report logical occupancy unchanged.

Regression test exercises put, get, DROP_OLDEST eviction, and
put_sentinel paths to pin the counter against logical occupancy.

Closes #103

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the fixed-magnitude additive jitter (base * 2**attempt + uniform(0, 0.5))
with AWS "Full Jitter": uniform(0, min(cap, base * 2**attempt)). The previous
formula collapsed to <3% jitter at higher attempts and forced every parallel
client into a narrow ~0.5s window on attempt 0, defeating the purpose of jitter
during retry storms. Full Jitter spreads colliding retries evenly across the
entire backoff window.

retry_max_delay continues to cap the exponential term before randomization.

Tests assert: every sample stays within [0, capped_upper]; attempt-0 samples
cover the full window with distinct values; the cap clamps high attempts.

Closes #104

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RecordingTransport / AsyncRecordingTransport previously did
load_pairs -> append -> save_pairs (with atomic .tmp rename) for every single
request. With N sequential requests to one endpoint that's O(N^2) work over
the session: the last request re-parses and re-writes all 999 prior pairs.

Switch to a session-scoped _RecorderBuffer that lazily loads any existing
on-disk pairs once per (method, path), appends to an in-memory list, and
flushes only the dirty keys to disk on close()/aclose(). Re-recording into
the same fixtures dir still extends prior pairs (load-once preserves the
original semantics).

Regression test counts save_pairs invocations: 50 requests to one endpoint
must produce exactly one save_pairs call at flush time, holding a list of
50 pairs. Second test verifies cross-session append safety.

Closes #105

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented May 17, 2026

Copy link
Copy Markdown

Code Review — PR #111: O(1) qsize, AWS Full Jitter, buffered recordings

Overall this is a solid, well-scoped performance PR. Each fix is independently motivated, the test coverage matches the changes, and the implementation choices are sound. A few things worth calling out.


#103 — O(1) qsize() counter (kalshi/ws/backpressure.py)

Correct and well-executed. The counter is updated on every path that mutates the buffer (put, DROP_OLDEST eviction, get, __anext__), and the sentinel is correctly excluded from the count because:

  • put_sentinel() doesn't increment _size
  • get() and __anext__() raise StopAsyncIteration before decrementing when they hit the sentinel

Since asyncio is single-threaded, there are no race conditions within the hot path (no await between eviction and append).

Minor gap: There's no test that mixes get() and __anext__() calls on the same queue. Both paths decrement _size, and while they're identical in structure, a quick combined exercise would close that coverage hole.


#104 — AWS Full Jitter (kalshi/_base_client.py)

The formula is correctly implemented — cap is applied before randomization, so the upper bound never exceeds retry_max_delay. The tests are well-targeted: upper-bound invariant, spread check, and cap-clamp at high attempt numbers.

One behavioral change worth acknowledging: the old formula had a guaranteed minimum delay of retry_base_delay on attempt 0 (e.g., 0.5s). Full Jitter allows essentially 0-delay retries at any attempt. For a trading API, a first-retry delay near 0 on a 429 could re-hammer a rate-limited server immediately. The test explicitly asserts min(samples) < 0.1, confirming this is intentional — but it may be worth a note in the _compute_backoff docstring or KalshiConfig that retry_base_delay no longer acts as a minimum floor.


#105 — Buffered RecordingTransport (kalshi/testing/_recorder.py)

Clean rewrite. The _dirty set is a nice touch — flush only rewrites files that were actually touched, avoiding unnecessary I/O for endpoints not recorded in a session.

Correctness of try/finally: The close()/aclose() wrapping is correct — the real transport closes even if flush raises.

Blocking I/O in async context: AsyncRecordingTransport.aclose() calls self._buffer.flush() synchronously. flush() does disk writes without await. This blocks the event loop briefly during test teardown. For test infrastructure it's acceptable, but worth a short comment acknowledging the intentional sync-in-async call.

Removed inline comment: The old handle_request had the note "Recordings are expected to run sequentially." It was removed from the method body but still partially appears in the sync class docstring. The async handle_async_request is now missing this note entirely — async recording has a stronger race risk than sync, so it's more important there.

Test design: The monkey-patch approach for counting save_pairs calls is correct — patching the module attribute makes the lookup in flush() hit the stub. The try/finally cleanup restoring the real function is present and necessary. The cross-session append test is particularly valuable.


Tests summary

Change Test quality
#103 qsize counter Good — covers put/get/drop/sentinel transitions
#104 full jitter Good — upper-bound invariant + spread check + cap-clamp
#105 buffered flush Good — call-count regression + cross-session semantics

Verdict

Approve with minor suggestions. No blocking issues. All three changes are correct and well-tested. Suggested follow-ups (none blocking):

  1. Add a test mixing get() and __anext__() on the same queue to fully close the counter coverage gap.
  2. Add a note to _compute_backoff or KalshiConfig.retry_base_delay that Full Jitter allows near-zero first-retry delays (the old minimum floor from additive jitter is gone).
  3. Restore the sequential-recordings comment in AsyncRecordingTransport.handle_async_request — it matters more there than in the sync version.

@TexasCoding TexasCoding merged commit ca0d125 into main May 17, 2026
4 checks passed
@TexasCoding TexasCoding deleted the perf/issues-103-104-105 branch May 17, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant