feat: transcript-based session audit with strict extraction rules by George-iam · Pull Request #4 · AxmeAI/axme-code

George-iam · 2026-04-05T08:31:29Z

Summary

The session auditor previously ran on worklog events (session_start, memory_saved, etc.) which contain almost none of what actually happened. The audit guessed from filesChanged alone and extracted meta-knowledge about the code instead of real user feedback - producing storage-module junk.

This PR rewires the audit to read the actual Claude Code transcript (jsonl conversation file), filter it down to real signal, and run extraction with stricter rules.

Changes

Transcript-based audit pipeline

New ClaudeSessionRef on SessionMeta + attachClaudeSession() helper. Hooks read session_id and transcript_path from Claude Code's hook event stdin (which we were discarding before) and attach them to the current AXME session. Field is a list to support future multi-agent (tester, reviewer sub-agents).
Attach happens in PreToolUse, PostToolUse, and SessionEnd.
New src/transcript-parser.ts filters the raw jsonl (1-2 MB per session) down to the actual conversation (~4% of raw). Keeps user text, assistant text >=80 chars, thinking blocks, compact tool_use lines. Drops tool results, IDE notifications, system reminders.
session-cleanup.ts prefers the filtered transcript over the worklog when available. Worklog is kept as fallback.

Audit prompt and model

Upgraded Sonnet -> Opus 4.6. Strict 'default is nothing' rules are followed more reliably.
Audit agent now has Read/Grep/Glob tools but NOT Bash. This was the critical fix for handoff drift: a Bash-enabled auditor naturally runs `git status` and reports the CURRENT state of the workspace into the handoff, which is wrong - handoff must reflect the end of the audited session.
Prompt rewritten with explicit REJECT/ACCEPT examples for decisions, English output requirement (with non-English quotes embedded inline as evidence).
Existing decisions and memories are passed into the prompt as 'do not re-extract these' context.

Side cleanup

Removed `budgetUsd` / `maxBudgetUsd` from every agent (scanners, memory extractor, session auditor, buildAgentQueryOptions). Budget caps truncate LLM output mid-task and pollute storage modules with low-quality artifacts.

Dry-run results

Validated on transcripts 0d770f0c (1.4 MB, 13 user turns) and 1df5d43d (926 KB, 7 user turns) before committing.

Metric	Old (Sonnet + worklog)	New (Opus + transcript)
False-positive memories	2-3 per session	0 per session
False-positive decisions	3-5 per session	0-1 per session
Memory quality	generic meta-knowledge	user corrections with inline evidence quotes
Handoff drift	significant	none
Language	mixed	English + inline quotes
Cost per session	$0.11-0.12	$0.48-0.66

Cost is 4-6x higher but quality is the difference between usable and unusable storage entries.

Test plan (manual, post-merge)

Not in this PR

Multi-agent auditor variant. `claudeSessions` field supports it, but exercising requires the tester/reviewer agent work.
Worklog enrichment. Transcript covers the same need more reliably.

Previously the session auditor ran on worklog events (session_start, memory_saved, etc.) which contain almost none of what actually happened during a session. The audit was guessing from filesChanged alone and extracting meta-knowledge about the code instead of real user feedback, producing storage-module junk. This PR rewires the audit to read the actual Claude Code transcript (the jsonl conversation file), filter it down to real signal, and run the extraction with stricter rules. ## Transcript-based audit - New ClaudeSessionRef type on SessionMeta. Hooks read session_id and transcript_path from Claude Code's hook event stdin and call the new attachClaudeSession() helper. The field is a list so future multi-agent scenarios (tester, reviewer sub-agents) can attach their own transcripts to the same AXME session. - Attach happens in PreToolUse (so every tool call, even blocked ones, registers the session), PostToolUse, and SessionEnd (safety net for read-only sessions). - New src/transcript-parser.ts filters the raw jsonl (~1-2 MB per session) down to the actual conversation (~4% of raw). Kept: user text, assistant text >=80 chars, thinking blocks, compact tool_use lines. Dropped: tool results, IDE notifications, system reminders, queue-operations. - session-cleanup.ts now prefers the filtered transcript over the worklog when a Claude session is attached. Worklog is kept as fallback for sessions with no transcript attached (pre-fix sessions, edge cases). ## Audit prompt and model changes - Upgraded from Sonnet to Opus 4.6 — strict "default is nothing" rules are followed more reliably. - Audit agent now has Read/Grep/Glob tools (but NOT Bash) — it can Grep .axme-code/memory/ and .axme-code/decisions/ to dedup candidates, but cannot read live repo state. This was the critical fix for handoff drift: a Bash-enabled auditor will naturally run git status and report the CURRENT state of the workspace into the handoff, which is wrong — handoff must reflect the end of the audited session, not today. - Prompt rewritten with strict REJECT/ACCEPT examples for decisions, explicit language requirement (output in English even for non-English transcripts, non-English quotes embedded inline as evidence). - Existing decisions and memories are now passed into the prompt as "do not re-extract these" context. Prevents duplicate extractions. ## Side cleanups - Removed budgetUsd / maxBudgetUsd from every agent (scanners, memory extractor, session auditor). Budget caps truncate LLM output mid-task and produce low-quality artifacts that then pollute storage modules. See .axme-code/memory/feedback/no-llm-budget-caps.md. - Removed the now-unused ensureSession destructive fallback was already cleaned up in PR #3; this PR removes the last budgetUsd vestige from buildAgentQueryOptions. ## Dry-run results on two real sessions Validated on transcripts 0d770f0c (1.4 MB, 13 user turns) and 1df5d43d (926 KB, 7 user turns) before committing: | Metric | Old (Sonnet + worklog) | New (Opus + transcript) | |--------------------|------------------------|-------------------------| | False-positive memories | 2-3 per session | 0 per session | | False-positive decisions | 3-5 per session | 0-1 per session | | Memory quality | generic meta-knowledge | user corrections with | | | | inline evidence quotes | | Handoff drift | significant | none (tool restriction) | | Language | mixed | English + inline quotes | | Cost per session | $0.11-0.12 | $0.48-0.66 | Cost is 4-6x higher but quality is the difference between usable and unusable storage-module entries. ## Not in this PR - Worklog enrichment (logging tool_call / turn events). Transcript covers the same need more reliably, so worklog can stay sparse. - Multi-agent auditor variant. The claudeSessions field supports it, but exercising it requires the tester/reviewer agent work which is separate.

George-iam merged commit ad88bd0 into main Apr 5, 2026

George-iam deleted the feat/transcript-based-audit-20260405 branch April 7, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: transcript-based session audit with strict extraction rules#4

feat: transcript-based session audit with strict extraction rules#4
George-iam merged 1 commit intomainfrom
feat/transcript-based-audit-20260405

George-iam commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

George-iam commented Apr 5, 2026

Summary

Changes

Transcript-based audit pipeline

Audit prompt and model

Side cleanup

Dry-run results

Test plan (manual, post-merge)

Not in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant