Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136
Open
hannahli-nv wants to merge 8 commits into
Open
Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136hannahli-nv wants to merge 8 commits into
hannahli-nv wants to merge 8 commits into
Conversation
Changes: - Move 7 cuTile skill folders from .agents/skills/ to skills/. - Add .agents/skills and .claude/skills symlinks pointing to ../skills for backward compatibility. - Update LICENSE, CONTRIBUTING.md, and .github/scripts/check_spdx_headers.py to reference the new skills/ path. - Split skills/cutile-autotuning/SKILL.md: move API Reference, Step-by-Step Workflow, and Pitfall Checklist into new files under references/ to keep SKILL.md concise. Signed-off-by: Hannah Li <hanli@nvidia.com>
Signed-off-by: Hannah Li <hanli@nvidia.com>
Adds the skill evaluation dataset for `adding-cutile-kernel`. The question targets two TileGym CI naming conventions required for new operators — the `test_op*` test-function prefix that gates which pytest functions are collected by the CI `-k test_op` filter, and the `-TFLOPS` / `-GBps` suffix required for the benchmark `plot_name` parameter so results are parsed into the CI summary. These conventions are documented in the adding-cutile-kernel `SKILL.md` and are not reliably available from general training data, so the skill provides clear lift over the no-skill baseline. The case has produced zero high-severity regression findings across three prior nvskills-ci runs. Schema fields used: `id`, `question`, `expected_skill`, `expected_script`, `ground_truth`, `expected_behavior`. After this PR merges, the publication pipeline auto-generates `BENCHMARK.md`, `skill-card.md`, and the detached signature `skill.oms.sig` for this skill, and the nvidia/skills sync workflow publishes it to the public catalog (per NVIDIA/skills#121). The remaining 6 cuTile skills will receive their own `evals/evals.json` in follow-up PRs, scoped per-skill to keep each evaluation run within the per-job time budget and avoid blocking each other through the global gate. Signed-off-by: Hannah Li <hanli@nvidia.com>
…m-adding-cutile-kernel
Replace two implementation-heavy positive cases with one orientation-style positive case and two additional negative cases to cover related-but-out-of-scope topics (performance tuning and multi-GPU distribution). Adjust expected_behavior phrasing to be agent-agnostic.
Parallel experiment to PR #135. Same scaffolding (skill renames + nvskills signatures) but evals.json is attached to tilegym-cutile-autotuning instead of tilegym-adding-cutile-kernel. 5-case design: 2 positive (impl + diagnostic), 3 negative (logging, install/version, multi-GPU/NCCL). All expected_behavior phrased positively to avoid binary 'did not invoke skill' gates that have been tripping up claude in the prior pipeline runs.
Collaborator
Author
|
/ok to test 514bd49 |
Collaborator
Author
|
/nvskills-ci |
1 similar comment
Collaborator
Author
|
/nvskills-ci |
Bessss-zyw
approved these changes
May 29, 2026
Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
Collaborator
Author
|
/ok to test c2285e6 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Parallel experiment to PR #135. Same scaffolding (skill renames with
tilegym-prefix, NVSkills signature files, workflow.md fix) but theevals.jsonis attached totilegym-cutile-autotuninginstead oftilegym-adding-cutile-kernel.PR #135's Tier 3 AGENT_EVAL keeps failing because codex regresses by -0.05 lift on the
adding-cutile-kernelskill (the skill's MUST-style execution rules cause codex to walk extra steps it isn't equipped for). This branch tests whethercutile-autotuningclears Tier 3 more reliably:ct.launchinto tune-once/cache/launch)cutile-autotuninghad claude lift -0.02 in earlier runs (closest to the -0.01 gate among the non-adding-cutile-kernel skills)Eval design (5 cases)
adding-cutile-kernelevalsadding-cutile-kernelevalsAll
expected_behaviorentries phrased positively ("addressed X"rather than"did NOT invoke skill") — based on the lesson learned in PR #135 that binary "did not invoke" gates penalize claude even for cursory SKILL.md reads.Open questions for reviewers
CI Configuration
🤖 Generated with Claude Code