achieve: bypass circuit breaker on Claude auth failures#168
Conversation
When the Claude CLI is not authenticated, every session exits 1 with "Not logged in. Please run /login" before spending any API budget. Previously these failures incremented CONSECUTIVE_FAILURES identically to code bugs, tripping the circuit breaker (3 failures = daemon stop) and requiring a human to restart the daemon after re-authenticating. Add is_auth_failure() to lib-agent.sh: parses the stream-json session log for "not logged in" / "please run /login" in the result event. In daemon.sh, check is_auth_failure before the counter increment: if detected, notify_human + sleep 300 + continue (no counter change). The daemon now survives auth lapses without human restart. The human still needs to run /login, but the daemon waits and retries on its own. 6 regression tests in TestAuthFailureDetection.
|
|
||
| def test_detects_not_logged_in(self, tmp_path: Path) -> None: | ||
| """is_auth_failure returns 0 for a Claude 'Not logged in' result.""" | ||
| import json |
|
|
||
| def test_detects_please_run_login(self, tmp_path: Path) -> None: | ||
| """is_auth_failure returns 0 when result contains 'please run /login' (case-insensitive).""" | ||
| import json |
|
|
||
| def test_ignores_code_errors(self, tmp_path: Path) -> None: | ||
| """is_auth_failure returns 1 for a regular code failure (not an auth issue).""" | ||
| import json |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 39918ebb24
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result_text = str(event.get("result", "")).lower() | ||
| if any(pat in result_text for pat in AUTH_PATTERNS): |
There was a problem hiding this comment.
Require explicit auth signal before bypassing breaker
is_auth_failure() currently classifies a failure as authentication-related purely by substring match on the free-form result text, so any normal failed run that mentions "please run /login" (for example, while editing login-related docs/tests) will be treated as an auth outage. In that case the daemon skips incrementing CONSECUTIVE_FAILURES, which can effectively disable the circuit breaker for a real code regression and keep retrying indefinitely.
Useful? React with 👍 / 👎.
| notify_human "Authentication required" \ | ||
| "Daemon session $SESSION_ID failed because the agent is not logged in. Run /login (or equivalent) to restore service. The daemon will retry automatically in 5 minutes." || true | ||
| sleep 300 | ||
| continue |
There was a problem hiding this comment.
Throttle repeated auth alerts in retry loop
This branch sends notify_human on every auth-failed cycle and then continues, so during a prolonged credential outage it will open a new GitHub issue (and webhook post) every 5 minutes without bound. Because the circuit breaker is intentionally bypassed here, duplicate alert spam can persist for hours and create operational noise or rate-limit pressure unless notifications are deduplicated/cooldown-limited.
Useful? React with 👍 / 👎.
- Add Codex item.completed/agent_message path to is_auth_failure() so auth failures for the Codex agent also bypass the circuit breaker - Add test_detects_codex_auth_failure covering the Codex format - Use ensure_ascii=False in test json.dumps calls to match the real UTF-8 log format (middot as bytes, not \u00b7 ASCII escape)
Autonomy score: 81/100 (was 71/100). Dependency eliminated: auth failures no longer stop the daemon. PR #168 merged.
- docs/strategy/2026-04-06.md: full strategy review covering 70 sessions, 15 PRs, cost analysis (Sonnet $2.30/test vs Opus $5.09/test), and prompt health. Key finding: eval gate deadlocked on stale 53/100. - Pentest validation: all 3 prompt-alert changes confirmed legitimate (PR #168). Finding #172 (eval fabrication) and #125 (clean-state) already tracked. New task #176 for autonomy first-match bug. - Task #176: fix read_latest_autonomy_score() to use re.findall()[-1] - Task #177: re-run Step 0 eval to unblock the BUILD eval gate - Task #178: fix cost classifier to recognize role-based session types
Summary
is_auth_failure()toscripts/lib-agent.sh: parses stream-json session log for "not logged in" / "please run /login" in theresulteventscripts/daemon.shcircuit breaker: callsis_auth_failurebefore incrementingCONSECUTIVE_FAILURES; auth failures triggernotify_human+ 300s wait +continue(no counter change)TestAuthFailureDetectionWhy
Today's session index shows 9 consecutive auth failures (11:46–11:54) that triggered the circuit breaker twice, stopping the daemon entirely. The root cause: Claude CLI returned "Not logged in · Please run /login" but these were counted identically to code bugs. The daemon died and required a human restart after the human re-authenticated.
With this fix, the daemon survives auth lapses without a human restart. The human still needs to run
/login, but the daemon waits 5 minutes and retries on its own.Impact
Autonomy score: +5 Self-Improving (session success rate from ~50% to 90%+)
Test plan
make checkpasses (1085 tests)TestAuthFailureDetection: 6 tests covering auth detection, code-error passthrough, missing log, daemon.sh contract, lib-agent.sh contractis_auth_failurecorrectly identifies "Not logged in" from real session log atdocs/sessions/20260406-114606.log