Skip to content

fix(code): coalesce writeLocalLogs per taskRunId to stop main-thread storm#2255

Open
k11kirky wants to merge 4 commits into
mainfrom
posthog-code/debounce-write-local-logs
Open

fix(code): coalesce writeLocalLogs per taskRunId to stop main-thread storm#2255
k11kirky wants to merge 4 commits into
mainfrom
posthog-code/debounce-write-local-logs

Conversation

@k11kirky
Copy link
Copy Markdown
Contributor

Summary

Layer 1 of a multi-PR fix for the launch hang and "can't click cloud task" symptoms reported by cloud-task users.

Extracts local NDJSON cache handling (readLocalLogs / writeLocalLogs) into a singleton LocalLogsService and single-flights writes per taskRunId with latest-wins coalescing. If a write is already in flight when another arrives for the same run, the new content replaces any queued content rather than spawning a parallel fs.promises.writeFile.

Why this is needed

When the renderer's gap-reconcile loop fires on every SSE snapshot — which happens whenever parseLogContent silently drops corrupted lines and processedLineCount never catches up to the server's expectedCount — the old fire-and-forget writeLocalLogs piles fs.promises.writeFile continuations onto the main process, producing the FileHandle::CloseReq::Resolve saturation signature we saw at app launch.

This is one of two distinct main-thread hangs reported on the same crash signature:

  • #2242 (merged) addressed the startup unzipSync path that affected all users.
  • This PR addresses the cloud-task corruption-feedback loop that only manifested for users with cloud tasks.

What this does NOT fix

Stops the storm, but doesn't address the underlying corruption-amplification loop or unbounded reconnects — those are layer 2 and 3 below.

Follow-ups (separate PRs to be stacked on this one)

Layer 2 — break the corruption-amplification feedback loop.
parseLogContent (renderer service.ts:3539) silently drops malformed lines, so processedLineCount < expectedCount forever and every SSE snapshot triggers another gap-reconcile + S3 fetch + overwrite. Need to either:

  • Track dropped-line counts and feed them into the reconciliation math so a known-corrupted file stops triggering gap-reconcile after one observation, or
  • Hash-compare local vs S3 content and short-circuit re-write when they match.

Fixes the "can't click on cloud task" symptom for users whose local NDJSON is already poisoned.

Layer 3 — bound the reconnect budget.
MAX_SSE_RECONNECT_ATTEMPTS = 5 (cloud-task/service.ts:21) is defeated by two paths:

  • handleStreamCompletion reconnects with countAttempt: false for non-terminal clean EOF (cloud-task/service.ts:1057).
  • retry / retryUnhealthyCloudSessions resets the counter on every focus.

Need a per-run cumulative cap and an explicit unrecoverable terminal state so the UI can surface "this run is broken" instead of looping silently.

Separate ticket — S3 source corruption.
The agent's local writer (packages/agent/src/session-log-writer.ts:391) correctly appends \n, but user logs show records concatenated without separators across days. Missing newlines are being introduced somewhere in the agent-server upload/aggregation path. This PR limits the blast radius of corruption but doesn't stop it from being produced.

Test plan

  • Unit tests cover: single-flight coalescing, multi-run independence, propagation of in-flight resolution to all coalesced callers, recovery after write rejection, queue draining after completion.
  • pnpm --filter code typecheck clean.
  • pnpm --filter code test — new tests pass; remaining failures are pre-existing archive integration tests unrelated to this change.
  • Manual verification on a dev build with cloud tasks (post-merge).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
apps/code/src/main/services/local-logs/service.test.ts:79-93
**Prefer parameterised tests for equivalent null-return cases**

The two `readLocalLogs` tests — `"returns null when the file is missing"` and `"returns null on other read errors"` — differ only in the error object injected. Per the team style, these should be collapsed into a single `it.each`.

Reviews (1): Last reviewed commit: "fix(code): coalesce writeLocalLogs per t..." | Re-trigger Greptile

Comment on lines +79 to +93
const service = new LocalLogsService();
await expect(service.readLocalLogs(RUN_ID)).resolves.toBeNull();
});

it("returns null on other read errors", async () => {
mockReadFile.mockRejectedValue(new Error("boom"));
const service = new LocalLogsService();
await expect(service.readLocalLogs(RUN_ID)).resolves.toBeNull();
});
});

describe("writeLocalLogs", () => {
it("writes content to the run's NDJSON path", async () => {
const service = new LocalLogsService();
await service.writeLocalLogs(RUN_ID, "line1\n");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Prefer parameterised tests for equivalent null-return cases

The two readLocalLogs tests — "returns null when the file is missing" and "returns null on other read errors" — differ only in the error object injected. Per the team style, these should be collapsed into a single it.each.

Context Used: Do not attempt to comment on incorrect alphabetica... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/code/src/main/services/local-logs/service.test.ts
Line: 79-93

Comment:
**Prefer parameterised tests for equivalent null-return cases**

The two `readLocalLogs` tests — `"returns null when the file is missing"` and `"returns null on other read errors"` — differ only in the error object injected. Per the team style, these should be collapsed into a single `it.each`.

**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@k11kirky k11kirky force-pushed the posthog-code/debounce-write-local-logs branch from b81c449 to f2b4def Compare May 21, 2026 12:32
Copy link
Copy Markdown
Contributor Author

k11kirky commented May 21, 2026

k11kirky added 4 commits May 22, 2026 13:55
…storm

Extract local NDJSON cache handling into a singleton LocalLogsService and
single-flight writes per taskRunId with latest-wins coalescing. If a write
is already in flight when another arrives for the same run, the new content
replaces any queued content rather than spawning a parallel
fs.promises.writeFile.

When the renderer's gap-reconcile loop fires on every SSE snapshot (which
happens whenever parseLogContent silently drops corrupted lines and the
processedLineCount never catches up to the server's expectedCount), the old
fire-and-forget writeLocalLogs would pile fs.promises.writeFile continuations
onto the main thread, producing the FileHandle::CloseReq::Resolve hang
signature we saw at app launch.

Generated-By: PostHog Code
Task-Id: 0270afde-6140-4867-bd57-52ae8a767a4d
…-content writes

Tighten WriteState shape (drop the placeholder cast), skip mkdir after the
first successful write per drain, skip writeFile when the coalesced content
matches the last written payload, and use the DATA_DIR constant.

Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
The duplicate `INBOX_VIEWED` entries in `ANALYTICS_EVENTS` (one before
PR #2228 added it as a typed event, and the second inside the proper
Inbox section) were tripping `tsc`. Remove the legacy untyped placeholder
plus the no-props `track(INBOX_VIEWED)` call in `navigateToInbox` — the
properly-typed event already fires from `InboxSignalsTab` with the full
report counts.

Also trim the LocalLogsService docblock to focus on the non-obvious
single-flight rationale.

Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
Generated-By: PostHog Code
Task-Id: 65171e71-4765-479c-b0c8-e4d0dd5c5bc7
@k11kirky k11kirky force-pushed the posthog-code/debounce-write-local-logs branch from 8913168 to d7b92a8 Compare May 22, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants