Skip to content

fix(runner): add health probes and improve INITIAL_PROMPT error logging#68

Merged
maknop merged 1 commit into
mainfrom
fix/add-health-probes-and-improve-logging
May 8, 2026
Merged

fix(runner): add health probes and improve INITIAL_PROMPT error logging#68
maknop merged 1 commit into
mainfrom
fix/add-health-probes-and-improve-logging

Conversation

@maknop
Copy link
Copy Markdown

@maknop maknop commented May 8, 2026

Summary

This PR implements health probes for runner pods and improves error logging for INITIAL_PROMPT retries, matching the implementation from ambient-code#1529.

Changes

Kubernetes Health Probes

  • Added readiness probe to runner container (3s initial delay, 5s period)
  • Added liveness probe to runner container (20s initial delay, 30s period)
  • Probes check /health endpoint on the runner's FastAPI server

Error Logging Improvements

  • Enhanced retry error logging in app.py to include exception type
  • Previously logged empty strings for exceptions like asyncio.TimeoutError
  • Now logs: "error: TimeoutError: <details>" instead of "error: "

Benefits

  • Prevents premature traffic routing: Service won't route to pods until FastAPI is ready
  • Reduces 503 errors: Eliminates "runner unavailable" errors during pod startup
  • Better debugging: More informative error logs with exception types
  • Self-healing: Liveness probe enables automatic pod restarts on failure

Test Plan

  • Code compiles successfully (go vet passes)
  • Code formatting is correct (gofmt passes)
  • Deploy to test cluster and verify health probes are configured
  • Verify no 503 errors during pod startup
  • Verify error logs include exception types during connection failures

🤖 Generated with Claude Code

Kubernetes Health Probes:
- Added readiness probe (3s initial delay, 5s period)
- Added liveness probe (20s initial delay, 30s period)
- Prevents Service routing traffic before FastAPI is ready
- Reduces 503 "runner unavailable" errors

Error Logging Improvements:
- Enhanced retry error logging to include exception type
- Previously logged empty strings for exceptions like asyncio.TimeoutError
- Now logs: "error: TimeoutError: <details>" instead of "error: "

Benefits:
- Prevents premature traffic routing to starting pods
- More informative error logs for debugging
- Better system resilience through health probes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@maknop maknop merged commit 393378a into main May 8, 2026
52 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants