Add 5-case evals for tilegym-cutile-autotuning (parallel to #135) by hannahli-nv · Pull Request #136 · NVIDIA/TileGym

hannahli-nv · 2026-05-29T07:20:03Z

Summary

Parallel experiment to PR #135. Same scaffolding (skill renames with tilegym- prefix, NVSkills signature files, workflow.md fix) but the evals.json is attached to tilegym-cutile-autotuning instead of tilegym-adding-cutile-kernel.

PR #135's Tier 3 AGENT_EVAL keeps failing because codex regresses by -0.05 lift on the adding-cutile-kernel skill (the skill's MUST-style execution rules cause codex to walk extra steps it isn't equipped for). This branch tests whether cutile-autotuning clears Tier 3 more reliably:

Less prescriptive SKILL.md — no "MUST TodoWrite + walk 6 steps" execution rules; pattern is built around copy-paste reference snippets that both agents can imitate
Pattern-matching topic — autotune config / search-space design plays to codex's strength (mechanical refactor of an existing ct.launch into tune-once/cache/launch)
Historical signal — cutile-autotuning had claude lift -0.02 in earlier runs (closest to the -0.01 gate among the non-adding-cutile-kernel skills)

Eval design (5 cases)

#	Type	Topic	Why it should pass for both agents
001	positive impl	Add occupancy-only autotune to RMSNorm	Simple refactor, both agents can copy the Quick Reference pattern
002	positive diagnostic	In-place RoPE corruption after first call	Focused single-issue diagnosis (Pitfall #1), no full kernel rewrite required
003	negative	TileGym logging configuration	Clear out-of-domain, reused from successful `adding-cutile-kernel` evals
004	negative	Python / PyTorch version requirements	Pure environment question, totally unrelated to autotune
005	negative	Multi-GPU NCCL all-reduce	Distributed topic, reused (scored claude/codex ≈ 0) in `adding-cutile-kernel` evals

All expected_behavior entries phrased positively ("addressed X" rather than "did NOT invoke skill") — based on the lesson learned in PR #135 that binary "did not invoke" gates penalize claude even for cursory SKILL.md reads.

Open questions for reviewers

This is a parallel experiment to PR Initial NVSkills-CI onboarding for TileGym skills #135, not a replacement. Only one will ultimately merge.
If both pass CI, prefer this one (cleaner, less coercive SKILL.md → broader future-proofing).
If only PR Initial NVSkills-CI onboarding for TileGym skills #135 passes, this branch can be closed.

CI Configuration

config:
  build: true
  # valid options are "ops" and "benchmark"
  test: []

🤖 Generated with Claude Code

Changes: - Move 7 cuTile skill folders from .agents/skills/ to skills/. - Add .agents/skills and .claude/skills symlinks pointing to ../skills for backward compatibility. - Update LICENSE, CONTRIBUTING.md, and .github/scripts/check_spdx_headers.py to reference the new skills/ path. - Split skills/cutile-autotuning/SKILL.md: move API Reference, Step-by-Step Workflow, and Pitfall Checklist into new files under references/ to keep SKILL.md concise. Signed-off-by: Hannah Li <hanli@nvidia.com>

Signed-off-by: Hannah Li <hanli@nvidia.com>

Adds the skill evaluation dataset for `adding-cutile-kernel`. The question targets two TileGym CI naming conventions required for new operators — the `test_op*` test-function prefix that gates which pytest functions are collected by the CI `-k test_op` filter, and the `-TFLOPS` / `-GBps` suffix required for the benchmark `plot_name` parameter so results are parsed into the CI summary. These conventions are documented in the adding-cutile-kernel `SKILL.md` and are not reliably available from general training data, so the skill provides clear lift over the no-skill baseline. The case has produced zero high-severity regression findings across three prior nvskills-ci runs. Schema fields used: `id`, `question`, `expected_skill`, `expected_script`, `ground_truth`, `expected_behavior`. After this PR merges, the publication pipeline auto-generates `BENCHMARK.md`, `skill-card.md`, and the detached signature `skill.oms.sig` for this skill, and the nvidia/skills sync workflow publishes it to the public catalog (per NVIDIA/skills#121). The remaining 6 cuTile skills will receive their own `evals/evals.json` in follow-up PRs, scoped per-skill to keep each evaluation run within the per-job time budget and avoid blocking each other through the global gate. Signed-off-by: Hannah Li <hanli@nvidia.com>

…m-adding-cutile-kernel

Replace two implementation-heavy positive cases with one orientation-style positive case and two additional negative cases to cover related-but-out-of-scope topics (performance tuning and multi-GPU distribution). Adjust expected_behavior phrasing to be agent-agnostic.

Parallel experiment to PR #135. Same scaffolding (skill renames + nvskills signatures) but evals.json is attached to tilegym-cutile-autotuning instead of tilegym-adding-cutile-kernel. 5-case design: 2 positive (impl + diagnostic), 3 negative (logging, install/version, multi-GPU/NCCL). All expected_behavior phrased positively to avoid binary 'did not invoke skill' gates that have been tripping up claude in the prior pipeline runs.

copy-pr-bot · 2026-05-29T07:20:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hannahli-nv · 2026-05-29T07:20:12Z

/ok to test 514bd49

hannahli-nv · 2026-05-29T07:22:17Z

/nvskills-ci

hannahli-nv · 2026-05-29T07:39:37Z

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

hannahli-nv · 2026-05-29T08:21:49Z

/ok to test c2285e6

hannahli-nv and others added 7 commits May 28, 2026 07:55

Fix sibling-link paths in references/workflow.md

6131ed3

Signed-off-by: Hannah Li <hanli@nvidia.com>

Attach NVSkills validation signatures

0531fbc

Add tilegym- prefix to skill folder names and 4-case evals for tilegy…

4f6ce19

…m-adding-cutile-kernel

Bessss-zyw approved these changes May 29, 2026

View reviewed changes

Attach NVSkills validation signatures

c2285e6

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136

Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136
hannahli-nv wants to merge 8 commits into
mainfrom
add-cutile-autotuning-evals

hannahli-nv commented May 29, 2026

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hannahli-nv commented May 29, 2026

Summary

Eval design (5 cases)

Open questions for reviewers

CI Configuration

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

hannahli-nv commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants