Skip to content

feat(telemetry): add human-driven eval foundation#88

Merged
khaliqgant merged 4 commits intomainfrom
feat/sage-v2-substrate-followups
May 8, 2026
Merged

feat(telemetry): add human-driven eval foundation#88
khaliqgant merged 4 commits intomainfrom
feat/sage-v2-substrate-followups

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Summary

  • adds Agent Assistant telemetry eval primitives for human-authored suites and deterministic checks
  • enriches human review artifacts with input, Must/Must Not criteria, deterministic check status, and captured output
  • documents product adoption so Sage can consume this foundation instead of owning generic eval mechanics locally

Validation

  • npm --prefix ../agent-assistant run build -w @agent-assistant/telemetry
  • npm --prefix ../agent-assistant run test -w @agent-assistant/telemetry -- human-suite.test.ts eval-runner.test.ts

Follow-up

  • after publish, bump Sage's @agent-assistant/telemetry dependency and remove the sibling-package fallback from the Sage eval runner

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e80017fc-30c3-445b-9a18-5c2b39e6fdef

📥 Commits

Reviewing files that changed from the base of the PR and between 4d42e56 and 616664d.

📒 Files selected for processing (2)
  • packages/telemetry/src/evals/human-suite.ts
  • packages/telemetry/test/human-suite.test.ts

📝 Walkthrough

Walkthrough

This PR extends HumanEvalTrial with optional input and expected fields, adds a deterministic maxQuestionMarks check, and makes renderHumanEvalReview conditionally emit input bullets, Must/Must Not sections, a filtered Deterministic Checks subsection, and an Output block only when data exists; tests updated accordingly.

Changes

Human Review Rendering Enhancements

Layer / File(s) Summary
Data Type Extension
packages/telemetry/src/evals/human-suite.ts
HumanEvalTrial interface adds optional input?: Record<string, unknown> and expected?: HumanEvalExpected.
Deterministic Check
packages/telemetry/src/evals/human-suite.ts
assertHumanEvalExpected adds maxQuestionMarks gate that counts ? in normalized actual content and fails when the count exceeds the configured limit.
Rendering: input fields
packages/telemetry/src/evals/human-suite.ts
renderHumanEvalReview conditionally renders input.message and input.actualPath bullets when present.
Rendering: rubric subsections
packages/telemetry/src/evals/human-suite.ts
renderHumanEvalReview emits "Must" and "Must Not" subsections only when expected.must/expected.mustNot arrays are non-empty.
Rendering: deterministic checks
packages/telemetry/src/evals/human-suite.ts
Adds a "Deterministic Checks" subsection containing only checks whose names do not start with human:, rendered only when such checks exist.
Rendering: output block
packages/telemetry/src/evals/human-suite.ts
"Output" subsection is emitted only when test.actual.content is non-empty; content is truncated to 4000 characters in the code block.
Test Verification
packages/telemetry/test/human-suite.test.ts
Tests extended: deterministic expected includes maxQuestionMarks; new test ensures maxQuestionMarks fails when exceeded; render/write test updated to include input/expected and assert presence of "### Must"/"### Must Not" and expected bullets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through trials with careful cheer,

counting questions far and near,
I print the musts when they appear,
skip the blanks and keep it clear,
a tidy review, one carrot near.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(telemetry): add human-driven eval foundation' directly and clearly summarizes the main change: adding human-driven evaluation primitives to the telemetry package.
Description check ✅ Passed The description is well-related to the changeset, explaining the addition of eval primitives, enrichment of review artifacts, and product adoption documentation that align with the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/sage-v2-substrate-followups

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 397-412: The Must/Must Not sections use
(test.expected?.must?.length ?? 0) and iterate with for...of without verifying
the values are arrays; change both guards to use
Array.isArray(test.expected?.must) and Array.isArray(test.expected?.mustNot)
before pushing headers or iterating, and only run the for...of loops when those
Array.isArray checks pass (mirroring the existing safe pattern used for other
expected fields in this file); update the conditional around the
`lines.push('### Must')`/`lines.push('### Must Not')` blocks to use these
Array.isArray checks for `test.expected.must` and `test.expected.mustNot`
respectively.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2fc95317-1342-48c8-bc51-c4309a5d090c

📥 Commits

Reviewing files that changed from the base of the PR and between c6d5517 and f3b756c.

📒 Files selected for processing (2)
  • packages/telemetry/src/evals/human-suite.ts
  • packages/telemetry/test/human-suite.test.ts

Comment thread packages/telemetry/src/evals/human-suite.ts Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/telemetry/src/evals/human-suite.ts (1)

407-422: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard must/mustNot with Array.isArray() before iterating.

Line 407 and Line 415 rely on .length and then iterate with for...of; malformed JSON like { must: { length: 1 } } passes the length check but throws at runtime during iteration.

Proposed fix
-    if ((test.expected?.must?.length ?? 0) > 0) {
+    const mustItems = Array.isArray(test.expected?.must) ? test.expected.must : [];
+    if (mustItems.length > 0) {
       lines.push('### Must');
       lines.push('');
-      for (const item of test.expected?.must ?? []) {
+      for (const item of mustItems) {
         lines.push(`- ${item}`);
       }
       lines.push('');
     }
-    if ((test.expected?.mustNot?.length ?? 0) > 0) {
+    const mustNotItems = Array.isArray(test.expected?.mustNot) ? test.expected.mustNot : [];
+    if (mustNotItems.length > 0) {
       lines.push('### Must Not');
       lines.push('');
-      for (const item of test.expected?.mustNot ?? []) {
+      for (const item of mustNotItems) {
         lines.push(`- ${item}`);
       }
       lines.push('');
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/telemetry/src/evals/human-suite.ts` around lines 407 - 422, The code
in human-suite.ts checks test.expected?.must?.length and mustNot?.length then
uses for...of, which can crash if must/mustNot are non-array objects with a
length property; update both checks and iteration to first verify
Array.isArray(test.expected?.must) and Array.isArray(test.expected?.mustNot)
(e.g., if (Array.isArray(test.expected?.must) && test.expected.must.length > 0)
{ ... for (const item of test.expected.must) { ... } }) so only real arrays are
iterated. Ensure you change both the length guards and the for...of loops that
reference test.expected?.must and test.expected?.mustNot.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 407-422: The code in human-suite.ts checks
test.expected?.must?.length and mustNot?.length then uses for...of, which can
crash if must/mustNot are non-array objects with a length property; update both
checks and iteration to first verify Array.isArray(test.expected?.must) and
Array.isArray(test.expected?.mustNot) (e.g., if
(Array.isArray(test.expected?.must) && test.expected.must.length > 0) { ... for
(const item of test.expected.must) { ... } }) so only real arrays are iterated.
Ensure you change both the length guards and the for...of loops that reference
test.expected?.must and test.expected?.mustNot.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 98efd583-46b8-4a6c-a56f-43e19c1f9b77

📥 Commits

Reviewing files that changed from the base of the PR and between f3b756c and 8280381.

📒 Files selected for processing (2)
  • packages/telemetry/src/evals/human-suite.ts
  • packages/telemetry/test/human-suite.test.ts

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 292-300: The check for maxQuestionMarks currently allows JS
coercion of malformed rubric values; update the block that references
expected.maxQuestionMarks, questionMarkCount, addCheck and checks to first
validate that expected.maxQuestionMarks is a finite number (e.g., using typeof
=== 'number' && Number.isFinite(expected.maxQuestionMarks')) before performing
the comparison against questionMarkCount computed from actual.content, and if
the value is not a finite number, skip adding the maxQuestionMarks check so
invalid config is ignored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 028a0dd3-9679-49ae-a04f-b783b2bef153

📥 Commits

Reviewing files that changed from the base of the PR and between 8280381 and 4d42e56.

📒 Files selected for processing (2)
  • packages/telemetry/src/evals/human-suite.ts
  • packages/telemetry/test/human-suite.test.ts

Comment thread packages/telemetry/src/evals/human-suite.ts Outdated
@khaliqgant khaliqgant merged commit ba495b7 into main May 8, 2026
1 of 2 checks passed
@khaliqgant khaliqgant deleted the feat/sage-v2-substrate-followups branch May 8, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant