Skip to content

feat(claude-plugin): record guideline usage per session in audit.log#239

Merged
visahak merged 10 commits intoAgentToolkit:mainfrom
vinodmut:provenance-usage
May 2, 2026
Merged

feat(claude-plugin): record guideline usage per session in audit.log#239
visahak merged 10 commits intoAgentToolkit:mainfrom
vinodmut:provenance-usage

Conversation

@vinodmut
Copy link
Copy Markdown
Contributor

@vinodmut vinodmut commented Apr 30, 2026

Summary

Adds the reverse provenance direction to complement #236: now we know not just which session produced a guideline, but also which sessions used it, and whether it actually influenced the agent's behavior.

  • recall/retrieve_entities.py — every UserPromptSubmit appends one JSONL record to .evolve/audit.log with the served entity slugs and the session_id derived from transcript_path. Audit failures are swallowed so logging cannot break the user-visible recall path.
  • learn/SKILL.md — adds Step 4 (Assess Influence). The forked learn skill reads .evolve/audit.log, reconstructs the set of guidelines served to this session, and for each one emits a verdict (followed | contradicted | not_applicable) with a short evidence sentence grounded in the transcript.
  • learn/scripts/log_influence.py — new stdin → audit bridge. Validates the verdict against the allowed set and writes it back via the existing audit.append() helper (lib/audit.py).
  • tests/e2e/test_sandbox_learn_recall.py — the existing end-to-end test is extended with assertions that after session 2, the audit log contains both a recall event (with at least one slug from session 1) and an influence event (with a valid verdict).

No new dependencies, no entity-file mutation, no per-entity write contention — everything is append-only JSONL.

Example audit log from a real run:

{"event":"recall","session_id":"4bce...","entities":["in-this-environment-use-raw-python-with-the-struct-module"],"ts":"..."}
{"event":"influence","session_id":"4bce...","entity":"in-this-environment-use-raw-python-with-the-struct-module","verdict":"followed","evidence":"Agent immediately used raw Python struct parsing for EXIF extraction without attempting exiftool or PIL first.","ts":"..."}

Addresses the second half of #180 (Explainability & Provenance) — specifically "Track entity usage — record when an entity is retrieved and whether it influenced the outcome" and "A user can see which entities were used in a given session."

Test plan

  • End-to-end: pytest tests/e2e/test_sandbox_learn_recall.py -m e2e passes (3:15 on macOS with the claude-sandbox image).
  • Manually inspected the resulting audit.log — both recall and influence events land with session_id correctly derived and evidence grounded in the transcript.
  • Codex and bob integrations still use their own recall scripts and don't emit these events — follow-up PRs will propagate the pattern.

Follow-ups (not in this PR)

  • Viz rendering of the new events (altk_evolve/viz/) — natural place to surface "sessions that used guideline X" once the log schema settles here.
  • Per-turn influence granularity (currently we emit one verdict per (session, entity); a long multi-turn session can only report one verdict even if relevance changed across turns).
  • Bob and codex parity.

Summary by CodeRabbit

  • New Features

    • Adds automated post-processing that links recalled guidelines to session impact, emitting validated per-entity verdicts (followed/contradicted/not_applicable) with brief evidence and recording influence events.
    • Improves entity identification to avoid ambiguity across projects.
  • Documentation

    • Updated learning workflow docs to describe the new assessment/evidence step.
  • Tests

    • Expanded tests to cover provenance audit parsing, influence recording, validation rules, and malformed-input handling.

Adds the reverse provenance direction: which sessions used a given
guideline. Complements the existing source-trajectory stamp on each
entity.

- recall/retrieve_entities.py now appends one `recall` event per
  UserPromptSubmit to .evolve/audit.log, listing the served entity
  slugs and the session_id derived from transcript_path. Failures are
  swallowed so logging cannot break the user-visible recall path.
- learn/SKILL.md gains a Step 4 that reads audit.log, reconstructs the
  set of guidelines served to this session, and emits per-entity
  verdicts (followed | contradicted | not_applicable) with a short
  evidence line.
- A new log_influence.py script validates and writes those verdicts
  back into audit.log.
- The e2e test asserts both event types land correctly after session 2.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Warning

Rate limit exceeded

@vinodmut has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 32 minutes and 27 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9b82a557-0564-498d-830a-98ea79864e21

📥 Commits

Reviewing files that changed from the base of the PR and between a18073c and cbef977.

📒 Files selected for processing (1)
  • tests/platform_integrations/test_log_influence.py
📝 Walkthrough

Walkthrough

Adds recall-to-influence provenance: recall events now include entity _ids; a new log_influence.py ingests JSON assessments and writes influence audit events; recall script appends recall events; tests validate recall and influence entries in .evolve/audit.log.

Changes

Cohort / File(s) Summary
Workflow Documentation
platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
Adds a new step describing linking recall provenance to impact evaluation and emitting a JSON payload to log_influence.py for the current session.
Influence Logging Script
platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
New executable: reads JSON from stdin, validates payload and verdicts (followed, contradicted, not_applicable), appends influence events to the evolve audit log, logs counts, and exits non-zero on malformed top-level payload/JSON.
Entity Recall + Audit
platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
Adds _id derived from entity filepath (relative stem) when loading entities; performs a best-effort audit.append(event="recall", session_id, entities) using get_evolve_dir/audit, swallowing audit errors so recall still succeeds.
End-to-End Tests
tests/e2e/test_sandbox_learn_recall.py
Extends E2E test to parse .evolve/audit.log (ndjson), assert a recall event for session 2 containing at least one entity from session 1, and assert at least one influence event for session 2 with an allowed verdict.
Unit/Integration Tests for Script
tests/platform_integrations/test_log_influence.py
New test suite running log_influence.py via subprocess: verifies correct influence audit writes for valid payloads, that invalid assessments are skipped, and that malformed payloads / bad JSON exit with code 1 and do not write events.

Sequence Diagram

sequenceDiagram
    participant Transcript as Input Transcript
    participant Recall as Recall Service
    participant Audit as Audit Log
    participant Assess as log_influence.py
    participant Test as Test Runner

    Transcript->>Recall: request recall (transcript_path)
    Recall->>Recall: load entity files, set `_id` from path
    Recall->>Audit: append "recall" event (project_root, session_id, entities)

    Assess->>Audit: read `.evolve/audit.log` for session's recall events
    Assess->>Assess: for each recalled entity load guideline, determine verdict & evidence
    Assess->>Audit: append "influence" event (project_root, session_id, entity, verdict, evidence)

    Test->>Audit: read audit log entries
    Test->>Test: assert presence/structure of recall & influence events
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

Suggested Reviewers

  • visahak
  • gaodan-fang

Poem

🐇 I hopped through logs with curious paws,

Found stems and IDs beneath the laws.
I scribble verdicts, short and spry,
Then stamp the audit—watch them fly. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: recording guideline usage (recall and influence verdicts) per session in the audit log. It directly reflects the core objective of adding reverse provenance logging.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 32 minutes and 27 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/e2e/test_sandbox_learn_recall.py (1)

168-213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add unit tests for the new influence logging paths.

This adds solid e2e coverage, but the new feature still needs unit tests for log_influence.py validation branches (bad payload shape, non-dict assessments, invalid verdict skipping) so failures are localized and deterministic.

As per coding guidelines tests/**/*.py: “All new features need tests (unit + e2e where applicable)”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_sandbox_learn_recall.py` around lines 168 - 213, Add unit
tests for log_influence.py that target its validation branches: create tests
that call the public entry points (e.g., the main handler function in
log_influence.py such as process_influence or validate_influence_payload) with
(1) a malformed payload shape (missing keys / wrong types) to assert it raises
or returns the expected validation error, (2) payloads where "assessments" is
not a dict or contains non-dict entries to assert those entries are
skipped/flagged, and (3) assessments containing invalid "verdict" values to
assert those assessments are ignored and do not produce influence events; assert
the function logs or returns the correct diagnostics and does not produce
side-effect influence records for invalid inputs. Ensure tests import
log_influence.py functions directly and stub/mock any I/O so failures are
deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py`:
- Around line 39-55: The handler currently assumes payload and its items are
dicts (using payload.get(...) and a.get(...)), which raises AttributeError for
non-dict JSON; first validate that payload is a mapping (e.g.,
isinstance(payload, dict)) before accessing payload.get("session_id") and
payload.get("assessments"), and ensure assessments is a list of dicts by
filtering or checking each item in the for loop (in the loop over assessments,
replace direct a.get(...) with a type guard: if not isinstance(a, dict): log and
continue), then extract entity/verdict/evidence from dicts only and skip/log
invalid entries; update the checks around session_id, assessments, and the loop
over a in assessments (refer to session_id, assessments, and the loop variable
a) to avoid AttributeError and to produce clear logs when items are skipped.

In `@platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md`:
- Around line 125-127: The current flow reads recalled slugs from
`.evolve/audit.log` (event == "recall", session_id) and then only opens
`.evolve/entities/guideline/<slug>.md`, which misses slugs served from
subscribed trees; update the resolution step so for each slug you attempt to
resolve it across subscribed trees rather than only
`.evolve/entities/guideline/`. Implement or call a resolver (e.g.,
resolveEntityPath(slug, subscribedTrees) or similar) that searches
`.evolve/entities/<tree>/.../<slug>.md` (falling back to
`.evolve/entities/guideline/<slug>.md` if needed), then read the resolved file
content + trigger for influence assessment. Ensure the union of `entities` from
audit.log is used with this multi-tree resolution.

---

Outside diff comments:
In `@tests/e2e/test_sandbox_learn_recall.py`:
- Around line 168-213: Add unit tests for log_influence.py that target its
validation branches: create tests that call the public entry points (e.g., the
main handler function in log_influence.py such as process_influence or
validate_influence_payload) with (1) a malformed payload shape (missing keys /
wrong types) to assert it raises or returns the expected validation error, (2)
payloads where "assessments" is not a dict or contains non-dict entries to
assert those entries are skipped/flagged, and (3) assessments containing invalid
"verdict" values to assert those assessments are ignored and do not produce
influence events; assert the function logs or returns the correct diagnostics
and does not produce side-effect influence records for invalid inputs. Ensure
tests import log_influence.py functions directly and stub/mock any I/O so
failures are deterministic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f1f237e-addc-4971-96a9-9efbbf92934f

📥 Commits

Reviewing files that changed from the base of the PR and between 6166cf0 and 98e7c32.

📒 Files selected for processing (4)
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
  • platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
  • tests/e2e/test_sandbox_learn_recall.py

Comment thread platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md Outdated
Fixes failing CI check: check-formatting
Validate that the JSON payload is a mapping before calling
payload.get(), and skip non-dict assessment items instead of letting
a.get() raise AttributeError. Logs each skipped item for traceability.

Addresses CodeRabbit review finding: Guard payload and assessment item types before calling `.get()`
Step 4 previously hard-coded .evolve/entities/guideline/<slug>.md,
which misses entities served from subscribed repositories at
.evolve/entities/subscribed/<source>/guideline/<slug>.md. Switch the
instructions to a recursive search under .evolve/entities/ so the
influence assessment can reach the actual file wherever it lives.

Addresses CodeRabbit review finding: Resolve recalled slugs beyond only `.evolve/entities/guideline/`
Covers happy path (single/multiple assessments, default evidence),
per-item skip paths (invalid verdict, missing entity, non-dict item),
and top-level input validation (non-dict payload, missing session_id,
non-list assessments, invalid JSON).

Complements the e2e sandbox test with fast, deterministic coverage of
the helper's input-validation surface.

Addresses CodeRabbit review finding: Add unit tests for log_influence.py
@vinodmut vinodmut requested review from illeatmyhat and visahak April 30, 2026 18:44
@visahak
Copy link
Copy Markdown
Collaborator

visahak commented May 1, 2026

Summary

This PR adds a provenance loop to the Claude evolve-lite plugin: recall now audits which guideline slugs were served, learn can log post-hoc influence assessments, and the stop hook includes the
saved trajectory path so new guidelines can point back to their source session. It also adds focused tests for log_influence.py, a sandbox e2e for the learn/recall flow, and docs for running that
sandbox test.

Findings

  1. Recall provenance collapses duplicate local/subscribed slugs (confidence: 96/100)
    • Why it matters: the new audit trail is supposed to let learn assess whether recalled guidance was followed, but it currently drops the subscribed-source identity and keeps only the filename stem.
      If the same slug exists locally and in a subscribed repo, the later influence step can attribute behavior to the wrong guideline.
    • Evidence: platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py:86-95 already knows whether an entity came from entities/subscribed/{source}/..., but
      :141-148 writes only entities=[_slug] to audit.log. Then platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md:125-127 tells learn to reopen the "first match" for that slug
      under .evolve/entities/, which is ambiguous across local and subscribed trees.
    • Example: I reproduced this with both .evolve/entities/guideline/shared-slug.md and .evolve/entities/subscribed/alice/guideline/shared-slug.md present. Recall served both entries, but the audit
      row recorded only entities: ["shared-slug"], so the later influence pass has no way to know which one was actually recalled.

Testing

  • Targeted agent e2e checks passed after installing the required extras:
    • uv run pytest tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s -v: passed
    • uv run pytest tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s -v: passed
  • Focused validation on the changed Claude plugin paths passed:
    • uv run pytest tests/platform_integrations/test_log_influence.py -v --run-e2e: 11 passed
    • uv run pytest tests/platform_integrations/test_retrieve.py -v: 18 passed
  • Sandbox learn/recall e2e was not runnable in this environment:
    • uv run pytest tests/e2e/test_sandbox_learn_recall.py -v --run-e2e: skipped locally because sandbox prerequisites/credentials were unavailable
  • Full suite status after installing extras:
    • uv run pytest -m e2e --run-e2e -v: 6 failed, 155 passed, 1 skipped, 32 errors, 524 deselected

Follow-up Issue

We are opening a separate issue to track the currently failing full-suite e2e/platform tests, since they do not clearly line up with this PR’s write set.

Failed tests from the full run:

  • tests/e2e/test_e2e_segmentation.py::test_segment_trajectory_min_subtasks
  • tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_skips_invalid_subscription_name
  • tests/platform_integrations/test_bob_sharing.py::TestBobSync::test_rejects_dot_and_double_dot_names
  • tests/platform_integrations/test_codex_sharing.py::TestCodexSharingScripts::test_subscribe_warns_when_audit_write_fails
  • tests/platform_integrations/test_codex_sharing.py::TestCodexSharingScripts::test_sync_skips_invalid_subscription_name
  • tests/platform_integrations/test_sync.py::TestSync::test_skips_invalid_subscription_name

Additional full-run errors to track in that issue:

  • tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[manual_phoenix]
  • tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[simple_openai]
  • all selected tests/e2e/test_mcp.py cases from the full run
  • all selected tests/e2e/test_sharing.py cases from the full run

Notes for that issue:

  • tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[smolagents] passed when run individually
  • tests/e2e/test_e2e_pipeline.py::test_e2e_pipeline_agent[openai_agents] passed when run individually
  • the earlier package/import issue was resolved by installing the required extras
  • the full-suite pipeline errors were caused by the Phoenix fixture failing to bind gRPC on port 4317
  • the invalid-subscription-name failures are in sync/subscribe paths not modified by this PR

@visahak
Copy link
Copy Markdown
Collaborator

visahak commented May 1, 2026

@vinodmut Not sure if the findings are a problem.

visahak
visahak previously approved these changes May 1, 2026
…trees

The recall audit log stored bare filename stems, so the same slug from
a local entity and a subscribed repo collapsed into one entry and the
influence step couldn't tell which guideline actually fired. Switch
the stored id to the path relative to .evolve/entities/ (without .md):
"guideline/foo" for local entries, "subscribed/alice/guideline/foo"
for subscribed ones. The id is unambiguous, names exactly one file,
and lets learn open it directly (no recursive search).

SKILL.md Step 4 is simplified accordingly — no more find / multi-tree
resolver; just Read .evolve/entities/<id>.md.

The e2e test matches session 1's guidelines against the new qualified
ids, and the existing log_influence unit tests pass unchanged.

Addresses review feedback from visahak
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/platform_integrations/test_log_influence.py (1)

42-199: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use temp_project_dir instead of tmp_path in these platform integration tests.

From Line 42 onward, these tests run a subprocess and do file I/O under tmp_path. For tests/platform_integrations/test_*.py, these operations should use temp_project_dir to keep integration-test isolation consistent with suite conventions.

Suggested patch pattern
-class TestLogInfluence:
-    def test_writes_single_assessment(self, tmp_path):
-        evolve_dir = tmp_path / ".evolve"
+class TestLogInfluence:
+    def test_writes_single_assessment(self, temp_project_dir):
+        evolve_dir = temp_project_dir / ".evolve"
         result = run_log_influence(
-            tmp_path,
+            temp_project_dir,
             {
                 "session_id": "abc-123",
                 "assessments": [
                     {"entity": "slug-a", "verdict": "followed", "evidence": "because"},
                 ],
             },
             evolve_dir=evolve_dir,
         )

Apply the same replacement across the remaining test methods.

As per coding guidelines, “All file operations in platform integration tests must use temp_project_dir fixture - never touch real files”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/platform_integrations/test_log_influence.py` around lines 42 - 199,
Replace the use of the tmp_path fixture with the temp_project_dir fixture in
these platform integration tests: change each test method signature (e.g.,
test_writes_single_assessment(self, tmp_path) ->
test_writes_single_assessment(self, temp_project_dir)) and update all usages of
tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir /
".evolve" and run_log_influence(tmp_path, ...) ->
run_log_influence(temp_project_dir, ...)); apply the same replacement for every
test function in this file (test_writes_multiple_assessments,
test_skips_assessments_with_invalid_verdict,
test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items,
test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string,
test_rejects_non_dict_payload, test_rejects_missing_session_id,
test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all
subprocess and file I/O use temp_project_dir.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 42-199: Replace the use of the tmp_path fixture with the
temp_project_dir fixture in these platform integration tests: change each test
method signature (e.g., test_writes_single_assessment(self, tmp_path) ->
test_writes_single_assessment(self, temp_project_dir)) and update all usages of
tmp_path inside the test (e.g., tmp_path / ".evolve" -> temp_project_dir /
".evolve" and run_log_influence(tmp_path, ...) ->
run_log_influence(temp_project_dir, ...)); apply the same replacement for every
test function in this file (test_writes_multiple_assessments,
test_skips_assessments_with_invalid_verdict,
test_skips_assessments_missing_entity, test_skips_non_dict_assessment_items,
test_empty_assessments_list_is_ok, test_evidence_defaults_to_empty_string,
test_rejects_non_dict_payload, test_rejects_missing_session_id,
test_rejects_non_list_assessments, test_rejects_invalid_json) to ensure all
subprocess and file I/O use temp_project_dir.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: adf197b4-c721-4d19-aef0-faf4c1751a51

📥 Commits

Reviewing files that changed from the base of the PR and between 98e7c32 and c005954.

📒 Files selected for processing (5)
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
  • platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
  • tests/e2e/test_sandbox_learn_recall.py
  • tests/platform_integrations/test_log_influence.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py

@vinodmut
Copy link
Copy Markdown
Contributor Author

vinodmut commented May 1, 2026

@visahak thanks — your first finding reproduced exactly, and is fixed in c005954.

Change: the recall audit now records each entity as a path relative to .evolve/entities/ without the .md suffix, instead of a bare filename stem. Same-slug entries in different trees are now distinct.

  • Local: guideline/<slug>
  • Subscribed: subscribed/<source>/guideline/<slug>

Where:

  • retrieve_entities.py — computes str(md.relative_to(entities_dir).with_suffix("")) and stores it in the entities array of the recall row.
  • learn/SKILL.md Step 4 — simplified. The id names exactly one file, so the instruction is now "open .evolve/entities/<id>.md with Read" — no recursive find, no multi-tree resolver.
  • The influence example in SKILL.md now shows "entity": "guideline/<slug>" with a note that the prefix must match whatever appeared in the recall event.
  • tests/e2e/test_sandbox_learn_recall.pysession1_ids builds qualified ids to match.
  • test_log_influence.py — unchanged; the helper treats entity as an opaque string.

Repro on this commit — same setup you described, local guideline/shared-slug.md plus a subscribed copy at subscribed/alice/guideline/shared-slug.md. Simulated a UserPromptSubmit hook with a fake session_id=fake-session-abc123:

{"event":"recall","session_id":"fake-session-abc123",
 "entities":["guideline/shared-slug","subscribed/alice/guideline/shared-slug"],
 "ts":"2026-05-01T15:32:08Z"}

Before the fix, both would have collapsed to "shared-slug" in the set literal. Now they're distinct identities, and the influence step can attribute verdicts to the correct guideline.

Example from the sandbox e2e (single-source case, just to show round-trip on a real run):

{
  "event": "recall",
  "session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
  "entities": ["guideline/in-this-environment-extract-jpeg-exif-metadata-including"],
  "ts": "2026-05-01T15:06:05Z"
}
{
  "event": "influence",
  "session_id": "e8df99dd-fecb-4a75-840d-06c290bcd3dd",
  "entity": "guideline/in-this-environment-extract-jpeg-exif-metadata-including",
  "verdict": "followed",
  "evidence": "Agent went directly to raw Python struct-based JPEG/EXIF/TIFF parsing without attempting exiftool, ImageMagick, or PIL/Pillow",
  "ts": "2026-05-01T15:07:14Z"
}

On the follow-up full-suite issue — I can confirm locally that test_e2e_pipeline.py -k smolagents|openai_agents need OPENAI_API_KEY for the evolve sync phoenix step (gpt-4o via LiteLLM), which this PR doesn't touch. Agreed those belong in a separate issue.

vinodmut added 3 commits May 1, 2026 10:44
Merging public/main brought in AgentToolkit#243's SKILL.md restructure, which
inserted a "Review Existing Guidelines" step and shifted Save Entities
to Step 4 — collidingwith the "Step 4: Assess Influence" section this
branch added. Rename the influence section to Step 5 and update its
sub-steps to reference Step 4 (save) and derive session_id from the
saved_trajectory_path variable (the post-AgentToolkit#243 name) instead of the
removed transcript_path.
Align with the rest of tests/platform_integrations/ which use the
temp_project_dir fixture (a thin wrapper over tmp_path that creates
an isolated test_project/ subdir). All 11 tests still pass.

Addresses CodeRabbit review finding: Replace the use of the tmp_path fixture with the temp_project_dir fixture
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/platform_integrations/test_log_influence.py (1)

194-199: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the missing “no audit side effect” assertion in invalid-JSON case.

This reject-path test checks exit code/stderr, but unlike other reject tests it does not verify audit.log remains empty.

✅ Suggested patch
     def test_rejects_invalid_json(self, temp_project_dir):
         evolve_dir = temp_project_dir / ".evolve"
         result = run_log_influence(temp_project_dir, None, raw_input="{not valid json", evolve_dir=evolve_dir)
         assert result.returncode == 1
         assert "json" in result.stderr.lower()
+        assert read_audit(evolve_dir) == []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/platform_integrations/test_log_influence.py` around lines 194 - 199,
The test_rejects_invalid_json case is missing the "no audit side effect"
assertion; after calling run_log_influence (in test_rejects_invalid_json) add an
assertion that the audit log in evolve_dir (e.g., check evolve_dir /
"audit.log") is either non-existent or empty to match other reject-path tests
and ensure no auditing occurred when raw_input is invalid JSON.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/platform_integrations/test_log_influence.py`:
- Around line 194-199: The test_rejects_invalid_json case is missing the "no
audit side effect" assertion; after calling run_log_influence (in
test_rejects_invalid_json) add an assertion that the audit log in evolve_dir
(e.g., check evolve_dir / "audit.log") is either non-existent or empty to match
other reject-path tests and ensure no auditing occurred when raw_input is
invalid JSON.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c4edbc29-5e2b-44df-ac32-47bae691e334

📥 Commits

Reviewing files that changed from the base of the PR and between c005954 and a18073c.

📒 Files selected for processing (2)
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
  • tests/platform_integrations/test_log_influence.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md

Mirrors the "no audit side effect" assertion present in the other
reject-path tests so we catch any regression where invalid JSON would
sneak a partial write into audit.log.

Addresses CodeRabbit review finding: The test_rejects_invalid_json case is missing the "no audit side effect" assertion
@vinodmut
Copy link
Copy Markdown
Contributor Author

vinodmut commented May 1, 2026

@visahak ready for another look when you have a minute.

Since your last review:

  • c005954 — qualify recall audit ids with the subscribed-source prefix (addresses your main finding; repro in this earlier comment)
  • Merged public/main (picks up fix(learn): stop the learn-skill permission-prompt storm at Stop #243's SKILL.md restructure) and renumbered the influence section to Step 5 so it doesn't collide with the renamed Save step
  • a18073c — swap tmp_pathtemp_project_dir in test_log_influence.py to match the rest of tests/platform_integrations/
  • cbef977 — add the "no audit side effect" assertion to test_rejects_invalid_json

All 16 CI checks are green.

@visahak visahak merged commit 6cc2a5b into AgentToolkit:main May 2, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants