test: regenerate NeMo Relay evals with NV-BASE by abhisawa-Nvidia · Pull Request #226 · NVIDIA/NeMo-Relay

abhisawa-Nvidia · 2026-06-04T21:25:23Z

Overview

Follow-up to #225. The original PR added eval datasets for the public NeMo Relay skills and was merged before the NV-BASE regeneration pass landed on the fork branch. This PR replaces those hand-seeded datasets with NV-BASE-generated datasets so the skills follow the verified-skills onboarding guide and are ready for the official NVIDIA skills catalog flow.

I confirm this contribution is my own work, or I have the right to submit it under this project's license.
I searched existing issues and open pull requests, and this does not duplicate existing work.

Details

Regenerates evals/evals.json for all 14 public nemo-relay-* consumer skills using NV-BASE.
Uses nv-base create-eval-dataset --full --force to produce 4 cases per skill: 3 positive routing/use cases plus 1 negative case.
Converts the datasets from the earlier { "skill": ..., "cases": [...] } wrapper shape to NV-BASE's top-level array format.
Keeps the PR scoped to eval dataset updates only; no runtime, binding, docs-site, exporter, plugin, or adaptive implementation changes.

Validated with:

nv-base create-eval-dataset skills/<skill> --force --full for all 14 public NeMo Relay skills
jq empty skills/*/evals/evals.json
verified all 14 eval files are top-level arrays with 56 total cases
nv-base validate skills --external --no-dedup --fail-fast

Where should the reviewer start?

Start with skills/nemo-relay-start/evals/evals.json to see the updated NV-BASE dataset shape, then spot-check skills/nemo-relay-migrate-from-flow/evals/evals.json for script-aware cases and skills/nemo-relay-tune-adaptive-hints/evals/evals.json for sibling-skill negative routing.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to test: add NeMo Relay skill eval datasets #225.
Relates to NeMo Relay onboarding for the official NVIDIA skills catalog.

Follow-up

After review, a NeMo Relay maintainer/admin should comment /nvskills-ci on this PR to generate the benchmark, skill-card, and signature artifacts required for NVIDIA/skills publication.

Summary by CodeRabbit

Tests
- Updated evaluation cases across many NeMo Relay skills: expanded and standardized scenarios (more language-specific and multi-surface coverage), added negative/out-of-scope checks, improved troubleshooting/export/observability/performance/instrumentation coverage, and clarified expected behaviors for each case.

Signed-off-by: asawarkar <asawarkar@nvidia.com>

copy-pr-bot · 2026-06-04T21:25:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-04T21:25:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2c8722c6-d387-41e9-923b-ec6b2f8991f8

📥 Commits

Reviewing files that changed from the base of the PR and between c03e8bc and 8c5ee3e.

📒 Files selected for processing (14)

skills/nemo-relay-build-plugin/evals/evals.json
skills/nemo-relay-debug-runtime-integration/evals/evals.json
skills/nemo-relay-export-atif-trajectories/evals/evals.json
skills/nemo-relay-export-openinference/evals/evals.json
skills/nemo-relay-export-otel/evals/evals.json
skills/nemo-relay-instrument-calls/evals/evals.json
skills/nemo-relay-migrate-from-flow/evals/evals.json
skills/nemo-relay-setup-observability/evals/evals.json
skills/nemo-relay-start/evals/evals.json
skills/nemo-relay-tune-adaptive-config/evals/evals.json
skills/nemo-relay-tune-adaptive-hints/evals/evals.json
skills/nemo-relay-tune-performance/evals/evals.json
skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
skills/nemo-relay-use-context-isolation/evals/evals.json

📜 Recent review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{md,mdx,py,sh,yaml,yml,toml,json}

📄 CodeRabbit inference engine (.agents/skills/contribute-docs/SKILL.md)

Keep package names, repo references, and build commands current

Files:

skills/nemo-relay-tune-adaptive-hints/evals/evals.json
skills/nemo-relay-debug-runtime-integration/evals/evals.json
skills/nemo-relay-migrate-from-flow/evals/evals.json
skills/nemo-relay-instrument-calls/evals/evals.json
skills/nemo-relay-use-context-isolation/evals/evals.json
skills/nemo-relay-build-plugin/evals/evals.json
skills/nemo-relay-tune-performance/evals/evals.json
skills/nemo-relay-start/evals/evals.json
skills/nemo-relay-export-atif-trajectories/evals/evals.json
skills/nemo-relay-export-otel/evals/evals.json
skills/nemo-relay-export-openinference/evals/evals.json
skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
skills/nemo-relay-setup-observability/evals/evals.json
skills/nemo-relay-tune-adaptive-config/evals/evals.json

**

⚙️ CodeRabbit configuration file

**:

AGENTS.md

This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.

Project Overview

NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.

The shared runtime model is:

Scope stacks decide where work belongs and which scope-local behavior is visible.

Middleware registries decide what guardrails and intercepts run around managed calls.

Plugins install reusable runtime behavior from configuration.

Events record runtime behavior in ATOF form.

Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.

Repository Structure

The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.
crates/
  core/       # Rust core runtime crate, published as nemo-relay
  adaptive/   # Adaptive runtime primitives and plugin components
  python/     # PyO3 native extension for the Python package
  ffi/        # Raw C ABI layer used by downstream bindings such as Go
  node/       # NAPI Node.js binding and JavaScript/TypeScript entry points
  wasm/       # wasm-bindgen WebAssembly binding and JS wrappers
python/
  nemo_relay/  # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers
  tests/      # Python tests
go/
  nemo_relay/  # Experimental Go CGo binding and tests
fern/         # Fern documentation site
scripts/      # Stable wrappers and helper scripts; build/test/docs entry points live in justfile
third_party/  # P...

Files:

skills/nemo-relay-tune-adaptive-hints/evals/evals.json
skills/nemo-relay-debug-runtime-integration/evals/evals.json
skills/nemo-relay-migrate-from-flow/evals/evals.json
skills/nemo-relay-instrument-calls/evals/evals.json
skills/nemo-relay-use-context-isolation/evals/evals.json
skills/nemo-relay-build-plugin/evals/evals.json
skills/nemo-relay-tune-performance/evals/evals.json
skills/nemo-relay-start/evals/evals.json
skills/nemo-relay-export-atif-trajectories/evals/evals.json
skills/nemo-relay-export-otel/evals/evals.json
skills/nemo-relay-export-openinference/evals/evals.json
skills/nemo-relay-typed-wrappers-codecs/evals/evals.json
skills/nemo-relay-setup-observability/evals/evals.json
skills/nemo-relay-tune-adaptive-config/evals/evals.json

🔇 Additional comments (16)

skills/nemo-relay-build-plugin/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-debug-runtime-integration/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-atif-trajectories/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-openinference/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-export-otel/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-instrument-calls/evals/evals.json (2)

1-58: AI summary count mismatch.

The AI summary claims "a top-level array of five new/updated evaluation cases" but the annotated code contains exactly 4 cases (nemo-relay-instrument-calls-001 through -004).

1-58: LGTM!

skills/nemo-relay-migrate-from-flow/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-setup-observability/evals/evals.json (2)

1-58: LGTM!

1-1: Add SPDX metadata for skills/nemo-relay-setup-observability/evals/evals.json (or document an exemption). This repo sometimes encodes SPDX in JSON via top-level keys (e.g., "SPDX-License-Identifier": "Apache-2.0"), so make evals.json follow the same convention or add an explicit policy stating evaluation-data JSON is exempt.

skills/nemo-relay-start/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-adaptive-config/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-adaptive-hints/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-tune-performance/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-typed-wrappers-codecs/evals/evals.json (1)

1-58: LGTM!

skills/nemo-relay-use-context-isolation/evals/evals.json (1)

1-58: LGTM!

Walkthrough

This PR restructures evaluation test data across 14 NeMo Relay skill directories. Each evals.json file migrates from a wrapped object schema to a top-level array, with expanded test cases featuring explicit id, richer ground_truth, and explicit out-of-scope/negative cases (expected_skill: null).

Changes

Evaluation Fixture Schema Migration

Layer / File(s)	Summary
Evaluation JSON schema migration and scenario expansion `skills/nemo-relay-build-plugin/evals/evals.json`, `skills/nemo-relay-debug-runtime-integration/evals/evals.json`, `skills/nemo-relay-export-atif-trajectories/evals/evals.json`, `skills/nemo-relay-export-openinference/evals/evals.json`, `skills/nemo-relay-export-otel/evals/evals.json`, `skills/nemo-relay-instrument-calls/evals/evals.json`, `skills/nemo-relay-migrate-from-flow/evals/evals.json`, `skills/nemo-relay-setup-observability/evals/evals.json`, `skills/nemo-relay-start/evals/evals.json`, `skills/nemo-relay-tune-adaptive-config/evals/evals.json`, `skills/nemo-relay-tune-adaptive-hints/evals/evals.json`, `skills/nemo-relay-tune-performance/evals/evals.json`, `skills/nemo-relay-typed-wrappers-codecs/evals/evals.json`, `skills/nemo-relay-use-context-isolation/evals/evals.json`	All 14 skill evaluation files restructured from a `{ skill, cases: [...] }` wrapper to a top-level array. Each case now uses explicit `id`, `question`, `expected_skill`, `expected_script`, `ground_truth`, and detailed `expected_behavior`. Files add language-specific scenarios, exporter/instrumentation flows, codec/isolation checks, and negative cases that assert `expected_skill: null` for out-of-scope queries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/NeMo-Relay#225: Initial introduction and earlier restructuring of these evaluation JSON fixtures; this PR further standardizes and expands those fixtures.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title follows Conventional Commits format with 'test' type and concise imperative summary under 72 characters.
Description check	✅ Passed	Description includes all required template sections: Overview with confirmations, Details with comprehensive change summary, reviewer guidance, and related issues.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

willkill07 · 2026-06-04T21:44:50Z

/ok to test c03e8bc

Signed-off-by: asawarkar <asawarkar@nvidia.com>

willkill07 · 2026-06-05T01:26:45Z

/ok to test 8c5ee3e

willkill07 · 2026-06-05T01:27:42Z

/ok to test 3e2e31b

willkill07 · 2026-06-05T01:27:55Z

/nvskills-ci

test: regenerate NeMo Relay evals with NV-BASE

c03e8bc

Signed-off-by: asawarkar <asawarkar@nvidia.com>

abhisawa-Nvidia requested a review from a team as a code owner June 4, 2026 21:25

github-actions Bot added size:XL PR is extra large Test Test related labels Jun 4, 2026

willkill07 previously approved these changes Jun 4, 2026

View reviewed changes

chore: add trailing newlines to eval datasets

8c5ee3e

Signed-off-by: asawarkar <asawarkar@nvidia.com>

abhisawa-Nvidia dismissed willkill07’s stale review via 8c5ee3e June 4, 2026 22:23

willkill07 self-assigned this Jun 5, 2026

willkill07 added this to the 0.4 milestone Jun 5, 2026

Merge branch 'main' into onboard-nvskills-evals-nvbase

3e2e31b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: regenerate NeMo Relay evals with NV-BASE#226

test: regenerate NeMo Relay evals with NV-BASE#226
abhisawa-Nvidia wants to merge 3 commits into
NVIDIA:mainfrom
abhisawa-Nvidia:onboard-nvskills-evals-nvbase

abhisawa-Nvidia commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

AGENTS.md

Project Overview

Repository Structure

Uh oh!

willkill07 commented Jun 4, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abhisawa-Nvidia commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Follow-up

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AGENTS.md

Project Overview

Repository Structure

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

willkill07 commented Jun 4, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

willkill07 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhisawa-Nvidia commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading