Skip to content

fix(llma): stop capturing 4xx MCP API errors as exceptions#59231

Merged
rafaeelaudibert merged 2 commits into
masterfrom
posthog-code/mcp-4xx-no-exception-capture
May 20, 2026
Merged

fix(llma): stop capturing 4xx MCP API errors as exceptions#59231
rafaeelaudibert merged 2 commits into
masterfrom
posthog-code/mcp-4xx-no-exception-capture

Conversation

@posthog
Copy link
Copy Markdown
Contributor

@posthog posthog Bot commented May 20, 2026

Problem

The PostHog team is being paged with new error tracking issues every time an LLM agent passes a bad parameter to an MCP tool — noise rather than real bugs.

Concretely, two new error tracking issues showed up within ~20 seconds of each other (error-tracking-symbol-sets-retrieve and error-tracking-symbol-sets-download-retrieve), both reporting MCPToolError: Request failed... 404 (Not Found) for …/symbol_sets/00000000-0000-0000-0000-000000000000/[download/]. Per the $exception event stream, each fingerprint had exactly 1 occurrence over the last 30 days — but every additional MCP tool that gets handed an all-zeros placeholder UUID would create yet another distinct issue.

The PostHog API was correctly returning 404 via self.get_object() in products/error_tracking/backend/api/symbol_sets.py. The real issue lived in the MCP service: handleToolError in services/mcp/src/lib/errors.ts called getPostHogClient().captureException(...) on every non-permission error, fingerprinted by $exception_fingerprint: mcpError.tool. So any agent-induced 4xx response got logged as a brand-new PostHog exception keyed by tool name.

Changes

  • New typed error PostHogApiError in services/mcp/src/lib/errors.ts carrying status, statusText, body, url, method.
  • services/mcp/src/api/client.ts (text-response and JSON-response paths) and services/mcp/src/api/fetcher.ts now throw PostHogApiError instead of generic Errors, preserving the existing message shapes for any string-matching consumers.
  • handleToolError walks the Error.cause chain for PostHogApiError and PostHogValidationError; if the underlying response is 4xx, it returns the typed message to the LLM without calling captureException. 5xx and unexpected non-HTTP errors still capture as before — this mirrors the existing recoverable-error short-circuit for MissingProjectContextError.

No backend change is needed; the 404 behavior in symbol_sets.py is correct — the right move is to treat 4xx as agent input error (recoverable, expected) at the MCP layer.

How did you test this code?

I am an agent (PostHog Code). I did not perform manual testing. I ran the following automated tests:

  • pnpm test run tests/unit/permission-errors.test.ts tests/unit/fetcher.test.ts from services/mcp/ — 45 tests pass (26 + 19), including 5 new tests covering: 404 short-circuit, validation-error short-circuit, cause-chain unwrapping, 5xx still captures, unexpected non-HTTP errors still capture.
  • pnpm exec tsc --noEmit — no new type errors in the changed files (pre-existing @posthog/quill / react UI-app errors remain, unrelated to this change).
  • pnpm exec oxlint on the changed files — clean (one pre-existing warning on tests/unit/fetcher.test.ts:199 unrelated to this change).

Publish to changelog?

no

Docs update

No docs change needed.

🤖 Agent context

  • Agent: PostHog Code (Claude). Task acted on a signal report describing the dual-issue burst and pointing at services/mcp/src/lib/errors.ts:275 as the captureException site (added in commit 042d86b, made reachable for symbol-set tools in a70f94d).
  • Approach considered and rejected: dedup at fingerprint level (e.g. include status code in the fingerprint). Rejected because the underlying signal is still noise — a 4xx from agent input is not actionable for engineers and shouldn't reach the issue tracker at all.
  • Chosen approach: introduce a typed PostHogApiError (so future call sites get the same classification for free) and short-circuit 4xx in handleToolError, mirroring the existing MissingProjectContextError short-circuit pattern.
  • One judgment call: the fetcher's existing error message format (Failed request: [404] {...}) is preserved via the PostHogApiError message override so any string-matching consumers keep working, while the client.ts paths use the default Request failed:\nURL: ... format that was already in place.

Generated-By: PostHog Code
Task-Id: 22c240c9-b71a-4acf-b137-1187eb679ba8

handleToolError captured every non-permission tool error via
captureException with a fingerprint of the tool name. Any LLM-induced
4xx (e.g. a placeholder UUID that returns 404) created a brand-new
error tracking issue per tool, paging the team for noise rather than
real bugs.

Introduce a typed PostHogApiError that carries the HTTP status, and
short-circuit 4xx + validation errors in handleToolError so the typed
message is returned to the LLM (which can self-correct) without
capturing an exception. 5xx and unexpected non-HTTP errors still
capture as before.

Generated-By: PostHog Code
Task-Id: 22c240c9-b71a-4acf-b137-1187eb679ba8
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
services/mcp/src/api/fetcher.ts:97-108
Variable shadowing: the inner `const body` declaration reuses the same name as the outer `body` (which holds the request body for POST/PUT/PATCH). TypeScript/eslint `no-shadow` rules will catch this, and it makes the two semantically different values harder to distinguish at a glance. Renaming the error-body variable avoids the ambiguity.

```suggestion
                if (!response.ok) {
                    const errorResponse = await response.json()
                    const errorBody = JSON.stringify(errorResponse)
                    throw new PostHogApiError({
                        status: response.status,
                        statusText: response.statusText,
                        body: errorBody,
                        url: input.url.toString(),
                        method: input.method.toUpperCase(),
                        message: `Failed request: [${response.status}] ${errorBody}`,
                    })
                }
```

### Issue 2 of 2
services/mcp/tests/unit/permission-errors.test.ts:409-462
**Prefer parameterised tests for status-code boundary cases**

The new test suite has four individual `it()` blocks that could be collapsed into two `it.each()` tables: one for the "short-circuit / no capture" family (400, 404, 422, 499) and one for the "capture as exception" family (500, 502, 503). The exact boundary between 4xx and 5xx is the core invariant being tested, so covering it with multiple representative values matters. For example, an off-by-one in the `status < 500` guard would only be caught if 499 and 500 are both in the table. The same applies to the `fetcher.test.ts` rewrites.

Reviews (1): Last reviewed commit: "fix(llma): stop capturing 4xx MCP API er..." | Re-trigger Greptile

Comment thread services/mcp/src/api/fetcher.ts
Comment thread services/mcp/tests/unit/permission-errors.test.ts
…ze boundary tests

- fetcher.ts: rename `body` (inner JSON-string of error response) to
  `errorBody` so it no longer shadows the outer request `body`
- permission-errors.test.ts: collapse the four `it()` blocks for 4xx
  short-circuit / 5xx capture into two `it.each()` tables, and add 499
  + 500 explicitly so an off-by-one in the `status < 500` guard would
  be caught

Generated-By: PostHog Code
Task-Id: 22c240c9-b71a-4acf-b137-1187eb679ba8
@rafaeelaudibert rafaeelaudibert enabled auto-merge (squash) May 20, 2026 16:29
@rafaeelaudibert rafaeelaudibert merged commit 070fe8c into master May 20, 2026
145 checks passed
@rafaeelaudibert rafaeelaudibert deleted the posthog-code/mcp-4xx-no-exception-capture branch May 20, 2026 16:53
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 20, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-20 17:34 UTC Run
prod-us ✅ Deployed 2026-05-20 17:53 UTC Run
prod-eu ✅ Deployed 2026-05-20 17:52 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant