Skip to content

fix(telemetry): extend error classifier + stamp category/fatal on failed audits (B-007)#106

Merged
George-iam merged 1 commit intomainfrom
fix/telemetry-error-vocab-20260414
Apr 14, 2026
Merged

fix(telemetry): extend error classifier + stamp category/fatal on failed audits (B-007)#106
George-iam merged 1 commit intomainfrom
fix/telemetry-error-vocab-20260414

Conversation

@George-iam
Copy link
Copy Markdown
Contributor

Summary

Fixes B-007: admin "Top error classes" panel was useless for triage — every failed audit collapsed into `error_class='unknown'` with `category=NULL` and `fatal=NULL`.

Root cause (two parts)

Prod query for the last 30 days:

event category error_class fatal n first_seen last_seen
audit_complete (null) unknown (null) 16 2026-04-12 2026-04-14

All 16 failures — same opaque bucket, impossible to distinguish B-006 from anything else.

  1. `classifyError()` had no pattern for `ERR_INVALID_ARG_TYPE` / `fileURLToPath(undefined)`. B-006 errors degraded to `unknown`.
  2. `audit_complete` with `outcome='failed'` didn't set `category` or `fatal`, so failed audits missed the `(category, error_class)` composite index on the backend.

Changes

`src/telemetry.ts` — vocabulary

Added specific Node codes (matched before generic fallbacks):

  • `node_invalid_arg` — `ERR_INVALID_ARG_TYPE` / `fileURLToPath` / `argument must be of type ... received undefined`
  • `module_not_found` — `ERR_MODULE_NOT_FOUND` / `Cannot find module|package`
  • `spawn_error` — `spawn ENOENT` / `spawn EACCES`
  • `out_of_memory` — `ENOMEM` / `heap out of memory` / `allocation failed`

Added generic JS kinds as last resort (via `err.name` so bland messages still classify):

  • `type_error`, `reference_error`

Also: network catches `econnreset` alongside `econnrefused`. Load-bearing match order documented in code comment.

`src/session-cleanup.ts` — stamp category/fatal on failed audits

When `outcome === 'failed'`, stamp `category: 'audit'` and `fatal: false`. Audit failures are non-fatal — session still closes, only background extraction is lost until next attempt. This gets failed audits onto the backend's composite index so they show up in the error panel alongside other categories.

`test/telemetry.test.ts` — regression tests

  • Exact B-006 `TypeError` message → `node_invalid_arg`
  • Order guard: same message through `TypeError` path must NOT degrade to `type_error`
  • One test per new class

Verification

  • `npm test` — 481/481 pass (6 new cases; prev baseline 478)
  • `npx tsc --noEmit` — clean
  • `npm run build` — clean
  • `grep -oE 'node_invalid_arg|module_not_found|spawn_error|out_of_memory|type_error|reference_error' dist/cli.mjs` — all 6 new slugs present

Test plan

Related

Fixes B-007.

🤖 Generated with Claude Code

…led audits

B-007 addressed the two root causes that made the admin "Top error classes"
panel useless for triage:

1) `classifyError()` had no pattern for `ERR_INVALID_ARG_TYPE` /
   `fileURLToPath(undefined)` — every B-006 failure collapsed into `unknown`.
   All 16 failed audits in the last 30 days on prod landed in the same
   opaque bucket, impossible to distinguish from unrelated regressions.

2) `audit_complete` with `outcome="failed"` never set `category` or `fatal`,
   so failed audits did not hit the (category, error_class) composite index
   on the backend. They showed up with NULL category on the dashboard.

Vocabulary changes (src/telemetry.ts):
- Add specific node codes: `node_invalid_arg`, `module_not_found`,
  `spawn_error`, `out_of_memory`. These match BEFORE the generic
  fallbacks below so B-006-class failures keep their triage signal.
- Add generic JS kinds as last resort before `unknown`:
  `type_error`, `reference_error` — dispatched via `err.name` rather
  than message text, so a bare `TypeError: x is not a function` at
  least lands in a non-empty bucket.
- Drop the `msg.includes("enoent") → transcript_not_found` shortcut;
  a bare ENOENT is a generic missing-file hit, not a transcript issue.
  Kept `transcript not found` literal as the transcript-specific matcher
  and generic ENOENT now falls through to `transcript_not_found` only
  after `spawn ENOENT` is checked.
- Network: `econnreset` added alongside `econnrefused`.
- Doc-comment the load-bearing match order.

audit_complete event (src/session-cleanup.ts):
- When `outcome === "failed"`, stamp `category: "audit"` and
  `fatal: false`. Audit failures are non-fatal — the session still
  closes and the user's work is unaffected; only background extraction
  is lost until the next attempt. Setting these fields lets the
  existing backend index surface failed audits in the same panel as
  other categorized errors.

Tests (test/telemetry.test.ts):
- B-006 reproducer: exact TypeError message from the audit-worker-logs
  → `node_invalid_arg`.
- Order guard: the same message through the `TypeError` path must NOT
  degrade into the generic `type_error` fallback.
- One test per new class (module_not_found, spawn_error, out_of_memory,
  type_error, reference_error fallback).

Verified:
- 481/481 unit tests pass (6 new cases; full suite was 478 before)
- `tsc --noEmit` clean
- `npm run build` clean
- `grep` in `dist/cli.mjs` shows all 6 new slugs present in the bundle

Follow-up for v0.2.8 release: re-query
  SELECT error_class, COUNT(*)
  FROM telemetry_events
  WHERE event='audit_complete' AND outcome='failed'
  GROUP BY error_class
  ORDER BY COUNT(*) DESC
`unknown` should drop from 100% to a small minority; `node_invalid_arg`
should be 0 on v0.2.8 (B-006 fixed in PR #105).

Fixes B-007.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> #!axme pr=none repo=AxmeAI/axme-code
@George-iam George-iam merged commit 930d251 into main Apr 14, 2026
George-iam added a commit that referenced this pull request Apr 14, 2026
Patch release containing three bug fixes already merged on main:

- B-006 (#105): audit worker fileURLToPath(undefined) crash on every
  session close. pathToClaudeCodeExecutable now set on all three
  direct sdk.query() call sites in session-auditor + memory-extractor.
- B-007 (#106): classifyError vocabulary extended with node_invalid_arg
  / module_not_found / spawn_error / out_of_memory / type_error /
  reference_error. audit_complete failures now stamp category="audit"
  and fatal=false so they index correctly on the backend.
- B-008 (#107): #!axme safety gate regex tightened so a closing quote
  from a surrounding -m "..." string no longer gets glued onto the
  parsed repo name. Hook stops false-blocking commits on every retry.

Files bumped:
- package.json
- .claude-plugin/plugin.json
- templates/plugin-README.md (version badge)

CHANGELOG entry added under [0.2.8] - 2026-04-14.

Verified: 489/489 unit tests pass, npx tsc --noEmit clean,
npm run build clean.

Release flow after this PR merges:
1. user runs: git tag v0.2.8 && git push origin v0.2.8
2. release-binary.yml workflow auto-runs the chain:
   build (4 platforms) -> GitHub Release ->
   npm publish @axme/code@0.2.8 -> sync to axme-code-plugin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant