Conversation
rhencke
approved these changes
Apr 7, 2026
rhencke
approved these changes
Apr 7, 2026
FidoCanCode
added a commit
that referenced
this pull request
Apr 10, 2026
Two compounding bugs in the self-restart path manifested when a kennel PR merged (#236) tonight: 1. _pull_with_backoff used `git pull`, which fails on a divergent local branch with "fatal: Need to specify how to reconcile divergent branches" (exit 128). The runner clone is supposed to be read-only, but if it ever drifts even one commit, every subsequent pull fails the same way and the retry loop never recovers. Replace `git pull` with `git fetch origin main && git reset --hard origin/main` so the runner is forcibly synced regardless of local state. 2. _self_restart called `stop_and_join(repo_name)` BEFORE the pull. When the pull failed, the function returned but the worker thread for the merged repo was already torn down — silently leaving kennel without a fido worker on its own repo, with no log line indicating why. Move the stop_and_join AFTER the pull succeeds so a failed pull leaves the worker intact and the existing code keeps running. Both fixes are independent but interact: with #1 alone the pull would have succeeded and the test wouldn't have caught #2. With #2 alone we still need #1 to actually deliver new code. Closes #268
FidoCanCode
added a commit
that referenced
this pull request
Apr 10, 2026
…#269) Two compounding bugs in the self-restart path manifested when a kennel PR merged (#236) tonight: 1. _pull_with_backoff used `git pull`, which fails on a divergent local branch with "fatal: Need to specify how to reconcile divergent branches" (exit 128). The runner clone is supposed to be read-only, but if it ever drifts even one commit, every subsequent pull fails the same way and the retry loop never recovers. Replace `git pull` with `git fetch origin main && git reset --hard origin/main` so the runner is forcibly synced regardless of local state. 2. _self_restart called `stop_and_join(repo_name)` BEFORE the pull. When the pull failed, the function returned but the worker thread for the merged repo was already torn down — silently leaving kennel without a fido worker on its own repo, with no log line indicating why. Move the stop_and_join AFTER the pull succeeds so a failed pull leaves the worker intact and the existing code keeps running. Both fixes are independent but interact: with #1 alone the pull would have succeeded and the test wouldn't have caught #2. With #2 alone we still need #1 to actually deliver new code. Closes #268 Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
This was referenced Apr 14, 2026
This was referenced Apr 26, 2026
This was referenced May 1, 2026
Open
Open
FidoCanCode
added a commit
that referenced
this pull request
May 3, 2026
…1281) ## Summary Closes #1280 — the inbox-counter leak that parked the home worker for ~58 min after a single PR comment landed. **Most likely root cause** (consistent with the 19:48:28 timeline): the BG rescope thread either crashed in the prelude or hung inside an Opus call. Either way the finally never ran, the inbox hold leaked, and the worker yielded forever waiting on `wait_for_inbox_drain(timeout=None)`. Three layers, root-cause first: 1. **Move BG prelude into the try/finally** (`events.py`). `set_thread_kind`, `set_thread_repo`, `set_rescoping(True)` were OUTSIDE the try. A raise from any of them would skip the finally entirely and leak the count. Now everything from the first prelude line through the loop is inside the try; the finally fires regardless of where the thread crashes. 2. **Reorder the finally so release fires first**. `exit_untriaged` calls now run BEFORE `set_rescoping(False)` and the thread-context cleanup. A failure in any later cleanup step can't swallow the release. Same reordering for the spawn-failure except path. 3. **Backstop watchdog + breadcrumb logs**. `WorkerRegistry.force_clear_untriaged` zeroes a leaked count, signals the drained event, and logs WARNING with the count cleared. `_admit_worker_turn` caps `wait_for_inbox_drain` at 5 minutes and force-clears on timeout. The BG thread now logs `starting`, `iteration N starting/complete`, `entering finally (iterations=N)`, and `finally complete` — so the next leak is immediately diagnosable from journalctl. ## Tests - `test_release_runs_before_set_rescoping_in_finally` — call-order invariant via `mock_calls` ordering - `test_release_runs_even_when_prelude_set_rescoping_raises` — the new prelude-inside-try invariant; without #1, this would leak - `test_release_survives_set_rescoping_raising_on_thread_start_failure` — same invariant on the spawn-failure path - `test_force_clear_*` (registry) — three cases: no-op, warns + resets, signals waiter event - `test_force_clears_when_inbox_drain_times_out` (worker) — backstop fires on timeout - `test_waits_before_asserting_when_inbox_non_empty` — updated assertion, no force-clear in the happy path New tests use hand-rolled `_FakeRescopeRegistry` / `_FakeActivityReporter` per Rob's no-MagicMock feedback. Existing MagicMock tests in the same files left alone. ## Test plan - [x] `./fido tests` — 3869 passed, 100% coverage - [x] Manual review confirmed prelude-into-try is the most plausible-and-cheapest root-cause patch for the field leak Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
This was referenced May 3, 2026
FidoCanCode
added a commit
that referenced
this pull request
May 16, 2026
…andled gating Three changes from codex review + Rob's design Qs: 1. **Route by comment_type (codex P2 #1).** RescopeIntent now carries comment_type (default "pulls" for back-compat); the per-intent notifier skips issue comments with a log line — same policy as the deleted _notify_thread_change ("webhook already replied"). Backfilled at create-time in SynthesisExecutor._maybe_trigger_rescope from the CommentTarget. 2. **Drop the duplicate _notify_thread_change path (codex P2 #2).** Per-task on_changes notifications and per-intent dispositions were both firing for the same rescope event, creating duplicate replies on the same comment. The per-intent path is now the canonical reply mechanism — the old per-task path is gone along with its TestNotifyThreadChange tests and the on_changes wiring in _make_reorder_kwargs. Filed #1748 (per-PR comment cache w/ webhook patching, mirroring IssueTreeCache) and #1749 (collapse RescopeIntent to comment-id reference, fetch via the cache) so the past-batch-intent case can be addressed cleanly in the new model. BUG_MINED_INVARIANTS.md J cluster updated to note the gap and the tracked follow-ups. 3. **Unhandled fires only when an EARLIER attributed sibling exists (Rob's directive).** When every acted-on sibling is strictly newer, the original "got it" reply still stands — the work continues under the later cover, no need to ping the original commenter. Solo-unhandled and all-unhandled batches stay silent for the same reason. Coverage: 4376 tests, 100%. Removed defensive None guard in _has_earlier_attributed_sibling per the no-defensive-code principle (classifier always produces a disposition per batch intent).
FidoCanCode
added a commit
that referenced
this pull request
May 24, 2026
## P1 (events.py:2942) — HOL-28 dispatcher saw stale task status ``notify_terminal_task_thread`` gated on ``task["status"]`` but the worker passed the pre-completion snapshot, so the dispatcher saw ``"in_progress"`` and silently returned for normal completions — the whole live-path emission was suppressed. Fix: restructure the dispatcher API to take ``new_tasks: list[dict]`` (the live post-completion snapshot) plus ``anchor_intent`` and derive the completed/skipped lists from EVERY task anchored at ``anchor_intent.comment_id``. The worker passes ``self._tasks.list()`` directly. No more stale-snapshot trap. ## P2 (events.py:2943) — aggregate summary missed sibling tasks Building completed/skipped lists from a SINGLE task argument meant multi-task threads got a closing summary mentioning only the trigger task; sibling completed tasks were lost. Same fix as P1: the dispatcher now iterates ``new_tasks`` filtering on anchor and builds the lists from all siblings. Log line now includes the sibling count for traceability. ## P2 (events.py:2496) — insight pre-reply failure permanently lost the insight Both production reply paths called ``execute_effects_only(..., insights_already_filed=True)`` unconditionally. If the pre-reply filing raised (transient GitHub failure), the post-reply effects pass suppressed the retry and the insight was permanently lost. Fix: ``_safe_file_insights_pre_reply`` now returns ``bool`` (True on clean success, False on raised exception). Both production reply paths capture the result into ``pre_reply_insights_filed`` and pass it as ``insights_already_filed=`` to ``execute_effects_only``. On pre-reply failure the post-reply effects re-attempt the file path, preserving the insight. ## P2 (worker.py:3851) — fabricated change_request leaked into reply The HOL-28 worker wire constructed a stub ``RescopeIntent`` with ``change_request="(task-anchor reconstruction)"`` and the framing interpolated that field verbatim into the LLM prompt, which then leaked into the generated closing reply. Fix: worker now passes ``change_request=""``. The framing skips the "This commenter's change request:" line entirely when the field is empty/whitespace. The task titles + descriptions still carry the story; an empty change_request degrades cleanly. ## Tests - ``test_completed_task_fires_reply_at_anchor`` / ``test_skipped_task_fires_reply`` — updated to the new ``new_tasks`` API. - ``test_non_terminal_anchored_task_skips_dispatch`` — multi-task thread with a pending sibling does NOT fire. - ``test_aggregate_includes_all_sibling_tasks`` — pin the multi- task framing covers ALL siblings (P2 #1). - ``test_anchor_empty_change_request_not_leaked`` — pin the empty change_request → framing-omits-line guard (P2 #4). - ``test_no_anchored_tasks_skips_dispatch`` — anchor with no matching live tasks → vacuous skip. - ``test_post_failure_logged_not_raised`` — updated to new API. - Worker ``_StubDispatcher.notify_terminal_task_thread`` updated to the new ``new_tasks`` signature. - Worker test assertion changed from ``call["task"] is task`` to ``task in call["new_tasks"]``. Full ``./fido ci`` green, 4945 tests, 100% coverage.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds full CI infrastructure:
Fixes the ID collision bug in task generation (millisecond → millisecond+random).