Skip to content

test: add NeMo Relay skill eval datasets#225

Merged
rapids-bot[bot] merged 2 commits into
NVIDIA:mainfrom
abhisawa-Nvidia:onboard-nvskills-evals
Jun 4, 2026
Merged

test: add NeMo Relay skill eval datasets#225
rapids-bot[bot] merged 2 commits into
NVIDIA:mainfrom
abhisawa-Nvidia:onboard-nvskills-evals

Conversation

@abhisawa-Nvidia
Copy link
Copy Markdown
Contributor

@abhisawa-Nvidia abhisawa-Nvidia commented Jun 4, 2026

Overview

This PR adds P0 eval datasets for all public NeMo Relay consumer skills so the skills can move through NVIDIA verified-skills onboarding and become available in the official NVIDIA skills catalog. The datasets give NVSkills/NVCARPS the required evals/evals.json inputs before benchmark, skill-card, and signature artifacts are generated.

  • I confirm this contribution is my own work, or I have the right to submit it under this project's license.
  • I searched existing issues and open pull requests, and this does not duplicate existing work.

Details

  • Adds NV-BASE-generated evals/evals.json for all 14 public nemo-relay-* consumer skills.
  • Includes 4 cases per skill from nv-base create-eval-dataset --full, with positive routing coverage and one negative case per skill.
  • Keeps the first-pass datasets scoped to P0 smoke coverage and NVSkills CI onboarding.
  • Covers discoverability and routing boundaries across startup, instrumentation, observability, ATIF, OTEL, OpenInference, plugins, adaptive tuning, context isolation, typed wrappers, runtime debugging, and Flow-to-Relay migration.
  • Normalizes missing author metadata for nemo-relay-migrate-from-flow to match the other public skills.
  • Avoids runtime, binding, docs-site, exporter, plugin, or adaptive implementation changes in this PR.

Validated with:

  • nv-base create-eval-dataset skills/<skill> --force --full for all 14 public NeMo Relay skills
  • jq empty skills/*/evals/evals.json
  • verified all 14 eval files are top-level arrays with 56 total cases
  • nv-base validate skills --external --no-dedup --fail-fast

Where should the reviewer start?

Start with skills/nemo-relay-start/evals/evals.json to see the eval shape, then spot-check sibling routing cases in skills/nemo-relay-instrument-calls/evals/evals.json, skills/nemo-relay-setup-observability/evals/evals.json, and skills/nemo-relay-migrate-from-flow/evals/evals.json. The only SKILL.md metadata change is in skills/nemo-relay-migrate-from-flow/SKILL.md.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • Relates to NeMo Relay onboarding for the official NVIDIA skills catalog.

Follow-up

After review, a NeMo Relay maintainer/admin should comment /nvskills-ci on this PR to generate the benchmark, skill-card, and signature artifacts required for NVIDIA/skills publication.

Signed-off-by: asawarkar <asawarkar@nvidia.com>
@abhisawa-Nvidia abhisawa-Nvidia requested a review from a team as a code owner June 4, 2026 20:21
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the size:M PR is medium label Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Walkthrough

This PR adds evaluation test specifications for 14 NeMo Relay skills across the complete instrumentation, observability, optimization, and migration pipeline. Each evals.json file defines positive and negative test cases with expected behaviors, skill routing, and validation checklists. One metadata addition updates the migration skill's author field.

Changes

NeMo Relay Evaluation Test Suite

Layer / File(s) Summary
Startup and core instrumentation evaluation
skills/nemo-relay-start/evals/evals.json, skills/nemo-relay-instrument-calls/evals/evals.json
nemo-relay-start initializes scope/tool/LLM bindings and defers observability; nemo-relay-instrument-calls wraps existing calls while preserving metadata and delegating export setup.
Observability setup and export pathway evaluation
skills/nemo-relay-setup-observability/evals/evals.json, skills/nemo-relay-export-atif-trajectories/evals/evals.json, skills/nemo-relay-export-otel/evals/evals.json, skills/nemo-relay-export-openinference/evals/evals.json
Setup skill routes to specialized export skills: ATIF handles offline trajectory replay, OTLP handles OpenTelemetry tracing with metadata attachment, and OpenInference preserves semantic LLM span fields.
Advanced features and diagnostics evaluation
skills/nemo-relay-use-context-isolation/evals/evals.json, skills/nemo-relay-typed-wrappers-codecs/evals/evals.json, skills/nemo-relay-debug-runtime-integration/evals/evals.json
Context isolation manages scope stacks across concurrent boundaries; typed wrappers preserve middleware and trace metadata; debug skill diagnoses missing events through binding/propagation/registration checks.
Optimization and adaptive configuration evaluation
skills/nemo-relay-tune-performance/evals/evals.json, skills/nemo-relay-tune-adaptive-config/evals/evals.json, skills/nemo-relay-tune-adaptive-hints/evals/evals.json
Performance tuning requires baseline observability and defines metric-driven rollout; adaptive config manages plugin state/telemetry with rollout constraints; hints skill safely consumes adaptive hints with fallback recommendations.
Migration support and plugin packaging evaluation
skills/nemo-relay-migrate-from-flow/SKILL.md, skills/nemo-relay-migrate-from-flow/evals/evals.json, skills/nemo-relay-build-plugin/evals/evals.json
Flow-to-Relay migration handles Python/manifest naming updates with dry-run and scoped write; build-plugin packages reusable config-activated plugins with JSON config, stable kind, deterministic validation, and rollback semantics. Author metadata added to migration skill.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed Title follows Conventional Commits format with lowercase type 'test' and concise imperative summary, under 72 characters.
Description check ✅ Passed PR description is comprehensive, well-structured, and addresses all template requirements with clear details on changes, validation, and next steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@abhisawa-Nvidia abhisawa-Nvidia changed the title Add NeMo Relay skill eval datasets test: add NeMo Relay skill eval datasets Jun 4, 2026
@github-actions github-actions Bot added the Test Test related label Jun 4, 2026
Signed-off-by: asawarkar <asawarkar@nvidia.com>
@willkill07 willkill07 self-assigned this Jun 4, 2026
@willkill07 willkill07 added this to the 0.4 milestone Jun 4, 2026
@willkill07
Copy link
Copy Markdown
Member

/ok to test 4c57d2f

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

/ok to test 4c57d2f

@willkill07, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@willkill07
Copy link
Copy Markdown
Member

/ok to test 0513bc5

@willkill07
Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot Bot merged commit 380208d into NVIDIA:main Jun 4, 2026
27 checks passed
@abhisawa-Nvidia
Copy link
Copy Markdown
Contributor Author

/nvskills-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR is medium Test Test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants