feat: transcript-based session audit with strict extraction rules#4
Merged
George-iam merged 1 commit intomainfrom Apr 5, 2026
Merged
Conversation
Previously the session auditor ran on worklog events (session_start, memory_saved, etc.) which contain almost none of what actually happened during a session. The audit was guessing from filesChanged alone and extracting meta-knowledge about the code instead of real user feedback, producing storage-module junk. This PR rewires the audit to read the actual Claude Code transcript (the jsonl conversation file), filter it down to real signal, and run the extraction with stricter rules. ## Transcript-based audit - New ClaudeSessionRef type on SessionMeta. Hooks read session_id and transcript_path from Claude Code's hook event stdin and call the new attachClaudeSession() helper. The field is a list so future multi-agent scenarios (tester, reviewer sub-agents) can attach their own transcripts to the same AXME session. - Attach happens in PreToolUse (so every tool call, even blocked ones, registers the session), PostToolUse, and SessionEnd (safety net for read-only sessions). - New src/transcript-parser.ts filters the raw jsonl (~1-2 MB per session) down to the actual conversation (~4% of raw). Kept: user text, assistant text >=80 chars, thinking blocks, compact tool_use lines. Dropped: tool results, IDE notifications, system reminders, queue-operations. - session-cleanup.ts now prefers the filtered transcript over the worklog when a Claude session is attached. Worklog is kept as fallback for sessions with no transcript attached (pre-fix sessions, edge cases). ## Audit prompt and model changes - Upgraded from Sonnet to Opus 4.6 — strict "default is nothing" rules are followed more reliably. - Audit agent now has Read/Grep/Glob tools (but NOT Bash) — it can Grep .axme-code/memory/ and .axme-code/decisions/ to dedup candidates, but cannot read live repo state. This was the critical fix for handoff drift: a Bash-enabled auditor will naturally run git status and report the CURRENT state of the workspace into the handoff, which is wrong — handoff must reflect the end of the audited session, not today. - Prompt rewritten with strict REJECT/ACCEPT examples for decisions, explicit language requirement (output in English even for non-English transcripts, non-English quotes embedded inline as evidence). - Existing decisions and memories are now passed into the prompt as "do not re-extract these" context. Prevents duplicate extractions. ## Side cleanups - Removed budgetUsd / maxBudgetUsd from every agent (scanners, memory extractor, session auditor). Budget caps truncate LLM output mid-task and produce low-quality artifacts that then pollute storage modules. See .axme-code/memory/feedback/no-llm-budget-caps.md. - Removed the now-unused ensureSession destructive fallback was already cleaned up in PR #3; this PR removes the last budgetUsd vestige from buildAgentQueryOptions. ## Dry-run results on two real sessions Validated on transcripts 0d770f0c (1.4 MB, 13 user turns) and 1df5d43d (926 KB, 7 user turns) before committing: | Metric | Old (Sonnet + worklog) | New (Opus + transcript) | |--------------------|------------------------|-------------------------| | False-positive memories | 2-3 per session | 0 per session | | False-positive decisions | 3-5 per session | 0-1 per session | | Memory quality | generic meta-knowledge | user corrections with | | | | inline evidence quotes | | Handoff drift | significant | none (tool restriction) | | Language | mixed | English + inline quotes | | Cost per session | $0.11-0.12 | $0.48-0.66 | Cost is 4-6x higher but quality is the difference between usable and unusable storage-module entries. ## Not in this PR - Worklog enrichment (logging tool_call / turn events). Transcript covers the same need more reliably, so worklog can stay sparse. - Multi-agent auditor variant. The claudeSessions field supports it, but exercising it requires the tester/reviewer agent work which is separate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The session auditor previously ran on worklog events (session_start, memory_saved, etc.) which contain almost none of what actually happened. The audit guessed from filesChanged alone and extracted meta-knowledge about the code instead of real user feedback - producing storage-module junk.
This PR rewires the audit to read the actual Claude Code transcript (jsonl conversation file), filter it down to real signal, and run extraction with stricter rules.
Changes
Transcript-based audit pipeline
ClaudeSessionRefonSessionMeta+attachClaudeSession()helper. Hooks readsession_idandtranscript_pathfrom Claude Code's hook event stdin (which we were discarding before) and attach them to the current AXME session. Field is a list to support future multi-agent (tester, reviewer sub-agents).src/transcript-parser.tsfilters the raw jsonl (1-2 MB per session) down to the actual conversation (~4% of raw). Keeps user text, assistant text >=80 chars, thinking blocks, compact tool_use lines. Drops tool results, IDE notifications, system reminders.session-cleanup.tsprefers the filtered transcript over the worklog when available. Worklog is kept as fallback.Audit prompt and model
Side cleanup
Dry-run results
Validated on transcripts 0d770f0c (1.4 MB, 13 user turns) and 1df5d43d (926 KB, 7 user turns) before committing.
Cost is 4-6x higher but quality is the difference between usable and unusable storage entries.
Test plan (manual, post-merge)
Not in this PR