Skip to content

Conversation

@jkebinger
Copy link
Collaborator

@jkebinger jkebinger commented Jan 22, 2026

Summary

This PR addresses issues where customers were reporting stale configuration data. Investigation revealed several failure modes where the SSE streaming connection could silently fail or never start, leaving clients stuck with outdated configs.

Changes

  • SSE Watchdog: New monitoring thread that detects stuck SSE connections by tracking keepalive activity. If no data is received for 120 seconds (4 missed 30s keepalives), it triggers recovery by polling for fresh data and forcing SSE reconnection.

  • Fixed 401/403 handling: The previous code caught UnauthorizedException which is never raised by raise_for_status(). Now properly catches HTTPError and inspects response.status_code for 401/403.

  • Fixed silent loop exits: Changed except Exception to except BaseException and added finally block logging to detect when the streaming loop exits unexpectedly.

  • Fixed streaming startup on checkpoint failure: If checkpoint loading fails (CDN down, unexpected exception), streaming now starts as a fallback so SSE can potentially load configs. Preserves timeout behavior for get() calls.

  • Dev runner script: Added dev_runner.py for observing SDK behavior during development.

Files Changed

File Description
sdk_reforge/_sse_watchdog.py New watchdog implementation
sdk_reforge/_sse_connection_manager.py Watchdog integration, HTTPError handling, BaseException catch
sdk_reforge/config_sdk.py Watchdog wiring, checkpoint failure fallback
tests/test_sse_watchdog.py 16 new tests for watchdog
tests/test_sse_connection_manager.py Updated tests for 401/403 handling
tests/test_config_sdk.py New tests for checkpoint error handling
dev_runner.py Development script for observing SDK

Test plan

  • All 384 existing tests pass
  • New watchdog tests cover: touch updates, recovery triggers, exception handling, thread lifecycle
  • New 401/403 tests verify handle_unauthorized_response is called
  • New checkpoint error tests verify streaming starts as fallback
  • Manual testing with dev_runner.py to observe SSE connection and watchdog behavior

🤖 Generated with Claude Code

jkebinger and others added 4 commits January 22, 2026 14:07
Adds a watchdog thread that monitors SSE connection health by tracking
when data (including keepalives) is received. If no data is received for
120 seconds (configurable), the watchdog:
1. Logs a warning
2. Polls the checkpoint API for fresh config data
3. Closes the SSE client to force reconnection

This helps detect and recover from stuck SSE connections that may not
trigger normal timeout/error handling (e.g., proxy issues, half-open
connections).

Additional improvements:
- Changed except Exception to except BaseException to catch GeneratorExit
  and other BaseException subclasses that could silently kill the thread
- Added logging when streaming loop exits (with shutdown reason)
- Fixed backoff logging to show actual sleep time instead of pre-doubled value
- Removed dead code (ConfigSDK.sse_client was never assigned)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous code caught UnauthorizedException which is never raised
by raise_for_status(). Instead, HTTPError is raised. This change:

- Catches HTTPError and inspects response.status_code for 401/403
- Removes dead UnauthorizedException catch block
- Adds specific tests for 401 and 403 responses

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
If load_checkpoint() fails (no data found or unexpected exception),
streaming would never start because finish_init() was never called.

This fix starts streaming as a fallback when checkpoint loading fails,
but does NOT call finish_init() - this preserves the timeout behavior
where get() blocks until timeout if no data is available.

- Start streaming when CDN and cache both fail to load
- Start streaming when unexpected exception occurs
- Do NOT start streaming on UnauthorizedException (handled separately)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Bump version to 1.2.0 for SSE watchdog and error handling improvements
- Add dev_runner.py for observing SDK behavior during development

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Poetry 2.x installation was failing in GitHub Actions.
Pin to 1.8.5 for stability.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@jkebinger jkebinger enabled auto-merge (squash) January 22, 2026 20:11
Copy link
Contributor

@jdwyah jdwyah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@jkebinger jkebinger merged commit ffcb022 into main Jan 22, 2026
7 checks passed
@jkebinger jkebinger deleted the add-sse-watchdog branch January 22, 2026 20:12
@jkebinger jkebinger mentioned this pull request Jan 22, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants