feat(telemetry): add human-driven eval foundation by khaliqgant · Pull Request #88 · AgentWorkforce/agent-assistant

khaliqgant · 2026-05-08T07:51:40Z

Summary

adds Agent Assistant telemetry eval primitives for human-authored suites and deterministic checks
enriches human review artifacts with input, Must/Must Not criteria, deterministic check status, and captured output
documents product adoption so Sage can consume this foundation instead of owning generic eval mechanics locally

Validation

npm --prefix ../agent-assistant run build -w @agent-assistant/telemetry
npm --prefix ../agent-assistant run test -w @agent-assistant/telemetry -- human-suite.test.ts eval-runner.test.ts

Follow-up

after publish, bump Sage's @agent-assistant/telemetry dependency and remove the sibling-package fallback from the Sage eval runner

coderabbitai · 2026-05-08T07:51:51Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e80017fc-30c3-445b-9a18-5c2b39e6fdef

📥 Commits

Reviewing files that changed from the base of the PR and between 4d42e56 and 616664d.

📒 Files selected for processing (2)

packages/telemetry/src/evals/human-suite.ts
packages/telemetry/test/human-suite.test.ts

📝 Walkthrough

Walkthrough

This PR extends HumanEvalTrial with optional input and expected fields, adds a deterministic maxQuestionMarks check, and makes renderHumanEvalReview conditionally emit input bullets, Must/Must Not sections, a filtered Deterministic Checks subsection, and an Output block only when data exists; tests updated accordingly.

Changes

Human Review Rendering Enhancements

Layer / File(s)	Summary
Data Type Extension `packages/telemetry/src/evals/human-suite.ts`	`HumanEvalTrial` interface adds optional `input?: Record<string, unknown>` and `expected?: HumanEvalExpected`.
Deterministic Check `packages/telemetry/src/evals/human-suite.ts`	`assertHumanEvalExpected` adds `maxQuestionMarks` gate that counts `?` in normalized actual content and fails when the count exceeds the configured limit.
Rendering: input fields `packages/telemetry/src/evals/human-suite.ts`	`renderHumanEvalReview` conditionally renders `input.message` and `input.actualPath` bullets when present.
Rendering: rubric subsections `packages/telemetry/src/evals/human-suite.ts`	`renderHumanEvalReview` emits "Must" and "Must Not" subsections only when `expected.must`/`expected.mustNot` arrays are non-empty.
Rendering: deterministic checks `packages/telemetry/src/evals/human-suite.ts`	Adds a "Deterministic Checks" subsection containing only checks whose names do not start with `human:`, rendered only when such checks exist.
Rendering: output block `packages/telemetry/src/evals/human-suite.ts`	"Output" subsection is emitted only when `test.actual.content` is non-empty; content is truncated to 4000 characters in the code block.
Test Verification `packages/telemetry/test/human-suite.test.ts`	Tests extended: deterministic expected includes `maxQuestionMarks`; new test ensures `maxQuestionMarks` fails when exceeded; render/write test updated to include `input`/`expected` and assert presence of "### Must"/"### Must Not" and expected bullets.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through trials with careful cheer,

counting questions far and near,
I print the musts when they appear,
skip the blanks and keep it clear,
a tidy review, one carrot near.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(telemetry): add human-driven eval foundation' directly and clearly summarizes the main change: adding human-driven evaluation primitives to the telemetry package.
Description check	✅ Passed	The description is well-related to the changeset, explaining the addition of eval primitives, enrichment of review artifacts, and product adoption documentation that align with the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/sage-v2-substrate-followups

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 397-412: The Must/Must Not sections use
(test.expected?.must?.length ?? 0) and iterate with for...of without verifying
the values are arrays; change both guards to use
Array.isArray(test.expected?.must) and Array.isArray(test.expected?.mustNot)
before pushing headers or iterating, and only run the for...of loops when those
Array.isArray checks pass (mirroring the existing safe pattern used for other
expected fields in this file); update the conditional around the
`lines.push('### Must')`/`lines.push('### Must Not')` blocks to use these
Array.isArray checks for `test.expected.must` and `test.expected.mustNot`
respectively.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2fc95317-1342-48c8-bc51-c4309a5d090c

📥 Commits

Reviewing files that changed from the base of the PR and between c6d5517 and f3b756c.

📒 Files selected for processing (2)

packages/telemetry/src/evals/human-suite.ts
packages/telemetry/test/human-suite.test.ts

coderabbitai

♻️ Duplicate comments (1)

packages/telemetry/src/evals/human-suite.ts (1)

407-422: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard must/mustNot with Array.isArray() before iterating.

Line 407 and Line 415 rely on .length and then iterate with for...of; malformed JSON like { must: { length: 1 } } passes the length check but throws at runtime during iteration.

Proposed fix

-    if ((test.expected?.must?.length ?? 0) > 0) {
+    const mustItems = Array.isArray(test.expected?.must) ? test.expected.must : [];
+    if (mustItems.length > 0) {
       lines.push('### Must');
       lines.push('');
-      for (const item of test.expected?.must ?? []) {
+      for (const item of mustItems) {
         lines.push(`- ${item}`);
       }
       lines.push('');
     }
-    if ((test.expected?.mustNot?.length ?? 0) > 0) {
+    const mustNotItems = Array.isArray(test.expected?.mustNot) ? test.expected.mustNot : [];
+    if (mustNotItems.length > 0) {
       lines.push('### Must Not');
       lines.push('');
-      for (const item of test.expected?.mustNot ?? []) {
+      for (const item of mustNotItems) {
         lines.push(`- ${item}`);
       }
       lines.push('');
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/telemetry/src/evals/human-suite.ts` around lines 407 - 422, The code
in human-suite.ts checks test.expected?.must?.length and mustNot?.length then
uses for...of, which can crash if must/mustNot are non-array objects with a
length property; update both checks and iteration to first verify
Array.isArray(test.expected?.must) and Array.isArray(test.expected?.mustNot)
(e.g., if (Array.isArray(test.expected?.must) && test.expected.must.length > 0)
{ ... for (const item of test.expected.must) { ... } }) so only real arrays are
iterated. Ensure you change both the length guards and the for...of loops that
reference test.expected?.must and test.expected?.mustNot.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 407-422: The code in human-suite.ts checks
test.expected?.must?.length and mustNot?.length then uses for...of, which can
crash if must/mustNot are non-array objects with a length property; update both
checks and iteration to first verify Array.isArray(test.expected?.must) and
Array.isArray(test.expected?.mustNot) (e.g., if
(Array.isArray(test.expected?.must) && test.expected.must.length > 0) { ... for
(const item of test.expected.must) { ... } }) so only real arrays are iterated.
Ensure you change both the length guards and the for...of loops that reference
test.expected?.must and test.expected?.mustNot.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 98efd583-46b8-4a6c-a56f-43e19c1f9b77

📥 Commits

Reviewing files that changed from the base of the PR and between f3b756c and 8280381.

📒 Files selected for processing (2)

packages/telemetry/src/evals/human-suite.ts
packages/telemetry/test/human-suite.test.ts

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/telemetry/src/evals/human-suite.ts`:
- Around line 292-300: The check for maxQuestionMarks currently allows JS
coercion of malformed rubric values; update the block that references
expected.maxQuestionMarks, questionMarkCount, addCheck and checks to first
validate that expected.maxQuestionMarks is a finite number (e.g., using typeof
=== 'number' && Number.isFinite(expected.maxQuestionMarks')) before performing
the comparison against questionMarkCount computed from actual.content, and if
the value is not a finite number, skip adding the maxQuestionMarks check so
invalid config is ignored.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 028a0dd3-9679-49ae-a04f-b783b2bef153

📥 Commits

Reviewing files that changed from the base of the PR and between 8280381 and 4d42e56.

📒 Files selected for processing (2)

packages/telemetry/src/evals/human-suite.ts
packages/telemetry/test/human-suite.test.ts

feat(telemetry): enrich human eval review artifacts

f3b756c

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread packages/telemetry/src/evals/human-suite.ts Outdated

feat(telemetry): add question count eval check

8280381

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

fix(telemetry): guard malformed eval rubric fields

4d42e56

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread packages/telemetry/src/evals/human-suite.ts Outdated

fix(telemetry): validate question count eval config

616664d

khaliqgant merged commit ba495b7 into main May 8, 2026
1 of 2 checks passed

khaliqgant deleted the feat/sage-v2-substrate-followups branch May 8, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): add human-driven eval foundation#88

feat(telemetry): add human-driven eval foundation#88
khaliqgant merged 4 commits intomainfrom
feat/sage-v2-substrate-followups

khaliqgant commented May 8, 2026

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khaliqgant commented May 8, 2026

Summary

Validation

Follow-up

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 8, 2026 •

edited

Loading