Skip to content

test: regenerate NeMo Relay evals with NV-BASE#226

Open
abhisawa-Nvidia wants to merge 3 commits into
NVIDIA:mainfrom
abhisawa-Nvidia:onboard-nvskills-evals-nvbase
Open

test: regenerate NeMo Relay evals with NV-BASE#226
abhisawa-Nvidia wants to merge 3 commits into
NVIDIA:mainfrom
abhisawa-Nvidia:onboard-nvskills-evals-nvbase

Conversation

@abhisawa-Nvidia
Copy link
Copy Markdown
Contributor

@abhisawa-Nvidia abhisawa-Nvidia commented Jun 4, 2026

Overview

Follow-up to #225. The original PR added eval datasets for the public NeMo Relay skills and was merged before the NV-BASE regeneration pass landed on the fork branch. This PR replaces those hand-seeded datasets with NV-BASE-generated datasets so the skills follow the verified-skills onboarding guide and are ready for the official NVIDIA skills catalog flow.

  • I confirm this contribution is my own work, or I have the right to submit it under this project's license.
  • I searched existing issues and open pull requests, and this does not duplicate existing work.

Details

  • Regenerates evals/evals.json for all 14 public nemo-relay-* consumer skills using NV-BASE.
  • Uses nv-base create-eval-dataset --full --force to produce 4 cases per skill: 3 positive routing/use cases plus 1 negative case.
  • Converts the datasets from the earlier { "skill": ..., "cases": [...] } wrapper shape to NV-BASE's top-level array format.
  • Keeps the PR scoped to eval dataset updates only; no runtime, binding, docs-site, exporter, plugin, or adaptive implementation changes.

Validated with:

  • nv-base create-eval-dataset skills/<skill> --force --full for all 14 public NeMo Relay skills
  • jq empty skills/*/evals/evals.json
  • verified all 14 eval files are top-level arrays with 56 total cases
  • nv-base validate skills --external --no-dedup --fail-fast

Where should the reviewer start?

Start with skills/nemo-relay-start/evals/evals.json to see the updated NV-BASE dataset shape, then spot-check skills/nemo-relay-migrate-from-flow/evals/evals.json for script-aware cases and skills/nemo-relay-tune-adaptive-hints/evals/evals.json for sibling-skill negative routing.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Follow-up

After review, a NeMo Relay maintainer/admin should comment /nvskills-ci on this PR to generate the benchmark, skill-card, and signature artifacts required for NVIDIA/skills publication.

Summary by CodeRabbit

  • Tests
    • Updated evaluation cases across many NeMo Relay skills: expanded and standardized scenarios (more language-specific and multi-surface coverage), added negative/out-of-scope checks, improved troubleshooting/export/observability/performance/instrumentation coverage, and clarified expected behaviors for each case.

Signed-off-by: asawarkar <asawarkar@nvidia.com>
@abhisawa-Nvidia abhisawa-Nvidia requested a review from a team as a code owner June 4, 2026 21:25
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added size:XL PR is extra large Test Test related labels Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2c8722c6-d387-41e9-923b-ec6b2f8991f8

📥 Commits

Reviewing files that changed from the base of the PR and between c03e8bc and 8c5ee3e.

📒 Files selected for processing (14)
  • skills/nemo-relay-build-plugin/evals/evals.json
  • skills/nemo-relay-debug-runtime-integration/evals/evals.json
  • skills/nemo-relay-export-atif-trajectories/evals/evals.json
  • skills/nemo-relay-export-openinference/evals/evals.json
  • skills/nemo-relay-export-otel/evals/evals.json
  • skills/nemo-relay-instrument-calls/evals/evals.json
  • skills/nemo-relay-migrate-from-flow/evals/evals.json
  • skills/nemo-relay-setup-observability/evals/evals.json
  • skills/nemo-relay-start/evals/evals.json
  • skills/nemo-relay-tune-adaptive-config/evals/evals.json
  • skills/nemo-relay-tune-adaptive-hints/evals/evals.json
  • skills/nemo-relay-tune-performance/evals/evals.json
  • skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
  • skills/nemo-relay-use-context-isolation/evals/evals.json
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{md,mdx,py,sh,yaml,yml,toml,json}

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

Keep package names, repo references, and build commands current

Files:

  • skills/nemo-relay-tune-adaptive-hints/evals/evals.json
  • skills/nemo-relay-debug-runtime-integration/evals/evals.json
  • skills/nemo-relay-migrate-from-flow/evals/evals.json
  • skills/nemo-relay-instrument-calls/evals/evals.json
  • skills/nemo-relay-use-context-isolation/evals/evals.json
  • skills/nemo-relay-build-plugin/evals/evals.json
  • skills/nemo-relay-tune-performance/evals/evals.json
  • skills/nemo-relay-start/evals/evals.json
  • skills/nemo-relay-export-atif-trajectories/evals/evals.json
  • skills/nemo-relay-export-otel/evals/evals.json
  • skills/nemo-relay-export-openinference/evals/evals.json
  • skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
  • skills/nemo-relay-setup-observability/evals/evals.json
  • skills/nemo-relay-tune-adaptive-config/evals/evals.json
**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.

Project Overview

NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.

The shared runtime model is:

  1. Scope stacks decide where work belongs and which scope-local behavior is visible.
  2. Middleware registries decide what guardrails and intercepts run around managed calls.
  3. Plugins install reusable runtime behavior from configuration.
  4. Events record runtime behavior in ATOF form.
  5. Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.

Repository Structure

The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.

crates/
  core/       # Rust core runtime crate, published as nemo-relay
  adaptive/   # Adaptive runtime primitives and plugin components
  python/     # PyO3 native extension for the Python package
  ffi/        # Raw C ABI layer used by downstream bindings such as Go
  node/       # NAPI Node.js binding and JavaScript/TypeScript entry points
  wasm/       # wasm-bindgen WebAssembly binding and JS wrappers
python/
  nemo_relay/  # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers
  tests/      # Python tests
go/
  nemo_relay/  # Experimental Go CGo binding and tests
fern/         # Fern documentation site
scripts/      # Stable wrappers and helper scripts; build/test/docs entry points live in justfile
third_party/  # P...

Files:

  • skills/nemo-relay-tune-adaptive-hints/evals/evals.json
  • skills/nemo-relay-debug-runtime-integration/evals/evals.json
  • skills/nemo-relay-migrate-from-flow/evals/evals.json
  • skills/nemo-relay-instrument-calls/evals/evals.json
  • skills/nemo-relay-use-context-isolation/evals/evals.json
  • skills/nemo-relay-build-plugin/evals/evals.json
  • skills/nemo-relay-tune-performance/evals/evals.json
  • skills/nemo-relay-start/evals/evals.json
  • skills/nemo-relay-export-atif-trajectories/evals/evals.json
  • skills/nemo-relay-export-otel/evals/evals.json
  • skills/nemo-relay-export-openinference/evals/evals.json
  • skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
  • skills/nemo-relay-setup-observability/evals/evals.json
  • skills/nemo-relay-tune-adaptive-config/evals/evals.json
🔇 Additional comments (16)
skills/nemo-relay-build-plugin/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-debug-runtime-integration/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-atif-trajectories/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-openinference/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-otel/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-instrument-calls/evals/evals.json (2)

1-58: AI summary count mismatch.

The AI summary claims "a top-level array of five new/updated evaluation cases" but the annotated code contains exactly 4 cases (nemo-relay-instrument-calls-001 through -004).


1-58: LGTM!

skills/nemo-relay-migrate-from-flow/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-setup-observability/evals/evals.json (2)

1-58: LGTM!


1-1: Add SPDX metadata for skills/nemo-relay-setup-observability/evals/evals.json (or document an exemption). This repo sometimes encodes SPDX in JSON via top-level keys (e.g., "SPDX-License-Identifier": "Apache-2.0"), so make evals.json follow the same convention or add an explicit policy stating evaluation-data JSON is exempt.

skills/nemo-relay-start/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-adaptive-config/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-adaptive-hints/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-performance/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-typed-wrappers-codecs/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-use-context-isolation/evals/evals.json (1)

1-58: LGTM!


Walkthrough

This PR restructures evaluation test data across 14 NeMo Relay skill directories. Each evals.json file migrates from a wrapped object schema to a top-level array, with expanded test cases featuring explicit id, richer ground_truth, and explicit out-of-scope/negative cases (expected_skill: null).

Changes

Evaluation Fixture Schema Migration

Layer / File(s) Summary
Evaluation JSON schema migration and scenario expansion
skills/nemo-relay-build-plugin/evals/evals.json, skills/nemo-relay-debug-runtime-integration/evals/evals.json, skills/nemo-relay-export-atif-trajectories/evals/evals.json, skills/nemo-relay-export-openinference/evals/evals.json, skills/nemo-relay-export-otel/evals/evals.json, skills/nemo-relay-instrument-calls/evals/evals.json, skills/nemo-relay-migrate-from-flow/evals/evals.json, skills/nemo-relay-setup-observability/evals/evals.json, skills/nemo-relay-start/evals/evals.json, skills/nemo-relay-tune-adaptive-config/evals/evals.json, skills/nemo-relay-tune-adaptive-hints/evals/evals.json, skills/nemo-relay-tune-performance/evals/evals.json, skills/nemo-relay-typed-wrappers-codecs/evals/evals.json, skills/nemo-relay-use-context-isolation/evals/evals.json
All 14 skill evaluation files restructured from a { skill, cases: [...] } wrapper to a top-level array. Each case now uses explicit id, question, expected_skill, expected_script, ground_truth, and detailed expected_behavior. Files add language-specific scenarios, exporter/instrumentation flows, codec/isolation checks, and negative cases that assert expected_skill: null for out-of-scope queries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NeMo-Relay#225: Initial introduction and earlier restructuring of these evaluation JSON fixtures; this PR further standardizes and expands those fixtures.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title follows Conventional Commits format with 'test' type and concise imperative summary under 72 characters.
Description check ✅ Passed Description includes all required template sections: Overview with confirmations, Details with comprehensive change summary, reviewer guidance, and related issues.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

willkill07
willkill07 previously approved these changes Jun 4, 2026
@willkill07
Copy link
Copy Markdown
Member

/ok to test c03e8bc

Signed-off-by: asawarkar <asawarkar@nvidia.com>
@willkill07 willkill07 self-assigned this Jun 5, 2026
@willkill07 willkill07 added this to the 0.4 milestone Jun 5, 2026
@willkill07
Copy link
Copy Markdown
Member

/ok to test 8c5ee3e

@willkill07
Copy link
Copy Markdown
Member

/ok to test 3e2e31b

@willkill07
Copy link
Copy Markdown
Member

/nvskills-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR is extra large Test Test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants