chore: skill eval fix batch 1 by miyoungc · Pull Request #4470 · NVIDIA/NemoClaw

miyoungc · 2026-05-28T22:15:51Z

Summary

Related Issue

Changes

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Your Name your-email@example.com

Summary by CodeRabbit

Tests
- Enhanced evaluation expectations for agent skills and inference configuration modules with detailed behavior validation criteria
- Strengthened testing framework to ensure consistent agent performance across multiple operational scenarios

miyoungc · 2026-05-28T22:16:01Z

/nvskills-ci

coderabbitai · 2026-05-28T22:16:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28a10a04-44b0-4f23-b82c-ea499c55d699

📥 Commits

Reviewing files that changed from the base of the PR and between 442f64b and 07601e9.

📒 Files selected for processing (4)

.agents/skills/nemoclaw-user-agent-skills/evals/evals.json
.agents/skills/nemoclaw-user-configure-inference/evals/evals.json
skills/nemoclaw-user-agent-skills/evals/evals.json
skills/nemoclaw-user-configure-inference/evals/evals.json

📝 Walkthrough

Walkthrough

Four evaluation JSON files are updated to add expected_behavior metadata arrays specifying required skill usage, reference documentation, and task answering criteria. Agent skills and inference configuration evals are strengthened across both .agents/skills/ and skills/ directory structures.

Changes

Evaluation Specification Updates

Layer / File(s)	Summary
Agent Skills Evaluation Specifications `.agents/skills/nemoclaw-user-agent-skills/evals/evals.json`, `skills/nemoclaw-user-agent-skills/evals/evals.json`	Three agent skill test cases (`docs-resources-agent-skills-001`, `-002`, `-003`) are enhanced with `expected_behavior` arrays requiring use of the `nemoclaw-user-agent-skills` skill, reference to `references/agent-skills.md`, and direct task answering. Ground truth strings are also refined to clarify delegation guidance, scoping rationale, and trust/control framing.
Configure Inference Evaluation Specifications `.agents/skills/nemoclaw-user-configure-inference/evals/evals.json`, `skills/nemoclaw-user-configure-inference/evals/evals.json`	Multiple inference configuration test cases (covering options selection, provider credentials, local routing, provider switching, sub-agent setup, and tool-calling reliability) are updated with `expected_behavior` arrays requiring use of the `nemoclaw-user-configure-inference` skill, reference to scenario-specific markdown under `references/`, and direct task answering. Existing `question` and `ground_truth` fields remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

NVIDIA/NemoClaw#4463: Directly conflicts with this PR—removes expected_behavior field from configure-inference eval schema while this PR adds it.

Suggested reviewers

jyaunches

Poem

🐰 JSON evals bloom with clarity,
Expected behaviors crystalline and bright,
References guide the way with clarity,
Agent skills and inference aligned just right,
A structured path to truth takes flight! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'chore: skill eval fix batch 1' is vague and generic, using non-descriptive terms that don't convey what skill evaluations are being fixed or the specific nature of the improvements.	Consider a more descriptive title that specifies which skills are being updated and what aspect of evaluations is being improved, e.g., 'chore: add expected_behavior to agent-skills and configure-inference eval cases'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch skills-eval-val-1

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-28T22:16:47Z

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

None. No NemoClaw E2E is recommended. The changes are limited to skill evaluation JSON metadata used to assess expected skill/reference usage; they do not affect runtime/user flows or security-sensitive paths. Prefer the existing NVSkills/static evaluation validation path rather than sandbox E2E.

Optional E2E

None.

New E2E recommendations

None.

github-actions · 2026-05-28T22:16:48Z

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

None.

Relevant changed files

None.

coderabbitai · 2026-05-28T22:17:58Z

Actionable comments posted: 0

github-actions · 2026-05-28T22:18:14Z

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Top item: No blocking code-review findings

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

copy-pr-bot · 2026-05-28T23:15:42Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

chore: skill eval fix batch 1

07601e9

miyoungc marked this pull request as draft May 28, 2026 23:15

wscurran added fix enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills. labels May 29, 2026

Conversation

miyoungc commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

miyoungc commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions Bot commented May 28, 2026

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Uh oh!

github-actions Bot commented May 28, 2026

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

PR Review Advisor

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miyoungc commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading