Skip to content

chore: skill eval fix batch 1#4470

Draft
miyoungc wants to merge 1 commit into
mainfrom
skills-eval-val-1
Draft

chore: skill eval fix batch 1#4470
miyoungc wants to merge 1 commit into
mainfrom
skills-eval-val-1

Conversation

@miyoungc
Copy link
Copy Markdown
Collaborator

@miyoungc miyoungc commented May 28, 2026

Summary

Related Issue

Changes

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Your Name your-email@example.com

Summary by CodeRabbit

  • Tests
    • Enhanced evaluation expectations for agent skills and inference configuration modules with detailed behavior validation criteria
    • Strengthened testing framework to ensure consistent agent performance across multiple operational scenarios

Review Change Stack

@miyoungc
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 28a10a04-44b0-4f23-b82c-ea499c55d699

📥 Commits

Reviewing files that changed from the base of the PR and between 442f64b and 07601e9.

📒 Files selected for processing (4)
  • .agents/skills/nemoclaw-user-agent-skills/evals/evals.json
  • .agents/skills/nemoclaw-user-configure-inference/evals/evals.json
  • skills/nemoclaw-user-agent-skills/evals/evals.json
  • skills/nemoclaw-user-configure-inference/evals/evals.json

📝 Walkthrough

Walkthrough

Four evaluation JSON files are updated to add expected_behavior metadata arrays specifying required skill usage, reference documentation, and task answering criteria. Agent skills and inference configuration evals are strengthened across both .agents/skills/ and skills/ directory structures.

Changes

Evaluation Specification Updates

Layer / File(s) Summary
Agent Skills Evaluation Specifications
.agents/skills/nemoclaw-user-agent-skills/evals/evals.json, skills/nemoclaw-user-agent-skills/evals/evals.json
Three agent skill test cases (docs-resources-agent-skills-001, -002, -003) are enhanced with expected_behavior arrays requiring use of the nemoclaw-user-agent-skills skill, reference to references/agent-skills.md, and direct task answering. Ground truth strings are also refined to clarify delegation guidance, scoping rationale, and trust/control framing.
Configure Inference Evaluation Specifications
.agents/skills/nemoclaw-user-configure-inference/evals/evals.json, skills/nemoclaw-user-configure-inference/evals/evals.json
Multiple inference configuration test cases (covering options selection, provider credentials, local routing, provider switching, sub-agent setup, and tool-calling reliability) are updated with expected_behavior arrays requiring use of the nemoclaw-user-configure-inference skill, reference to scenario-specific markdown under references/, and direct task answering. Existing question and ground_truth fields remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4463: Directly conflicts with this PR—removes expected_behavior field from configure-inference eval schema while this PR adds it.

Suggested reviewers

  • jyaunches

Poem

🐰 JSON evals bloom with clarity,
Expected behaviors crystalline and bright,
References guide the way with clarity,
Agent skills and inference aligned just right,
A structured path to truth takes flight! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'chore: skill eval fix batch 1' is vague and generic, using non-descriptive terms that don't convey what skill evaluations are being fixed or the specific nature of the improvements. Consider a more descriptive title that specifies which skills are being updated and what aspect of evaluations is being improved, e.g., 'chore: add expected_behavior to agent-skills and configure-inference eval cases'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skills-eval-val-1

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. No NemoClaw E2E is recommended. The changes are limited to skill evaluation JSON metadata used to assess expected skill/reference usage; they do not affect runtime/user flows or security-sensitive paths. Prefer the existing NVSkills/static evaluation validation path rather than sandbox E2E.

Optional E2E

  • None.

New E2E recommendations

  • None.

@github-actions
Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • None. No scenario workflow, scenario metadata, scenario runtime, or validation-suite files changed.

Optional scenario E2E

  • None.

Relevant changed files

  • None.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@github-actions
Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 0 needs attention, 0 worth checking, 0 nice ideas
Top item: No blocking code-review findings

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@miyoungc miyoungc marked this pull request as draft May 28, 2026 23:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wscurran wscurran added fix enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills. labels May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement: skill Improvements to NemoCall repository hygiene or user functionality with skills. fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants