Skip to content

Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136

Open
hannahli-nv wants to merge 8 commits into
mainfrom
add-cutile-autotuning-evals
Open

Add 5-case evals for tilegym-cutile-autotuning (parallel to #135)#136
hannahli-nv wants to merge 8 commits into
mainfrom
add-cutile-autotuning-evals

Conversation

@hannahli-nv
Copy link
Copy Markdown
Collaborator

Summary

Parallel experiment to PR #135. Same scaffolding (skill renames with tilegym- prefix, NVSkills signature files, workflow.md fix) but the evals.json is attached to tilegym-cutile-autotuning instead of tilegym-adding-cutile-kernel.

PR #135's Tier 3 AGENT_EVAL keeps failing because codex regresses by -0.05 lift on the adding-cutile-kernel skill (the skill's MUST-style execution rules cause codex to walk extra steps it isn't equipped for). This branch tests whether cutile-autotuning clears Tier 3 more reliably:

  • Less prescriptive SKILL.md — no "MUST TodoWrite + walk 6 steps" execution rules; pattern is built around copy-paste reference snippets that both agents can imitate
  • Pattern-matching topic — autotune config / search-space design plays to codex's strength (mechanical refactor of an existing ct.launch into tune-once/cache/launch)
  • Historical signalcutile-autotuning had claude lift -0.02 in earlier runs (closest to the -0.01 gate among the non-adding-cutile-kernel skills)

Eval design (5 cases)

# Type Topic Why it should pass for both agents
001 positive impl Add occupancy-only autotune to RMSNorm Simple refactor, both agents can copy the Quick Reference pattern
002 positive diagnostic In-place RoPE corruption after first call Focused single-issue diagnosis (Pitfall #1), no full kernel rewrite required
003 negative TileGym logging configuration Clear out-of-domain, reused from successful adding-cutile-kernel evals
004 negative Python / PyTorch version requirements Pure environment question, totally unrelated to autotune
005 negative Multi-GPU NCCL all-reduce Distributed topic, reused (scored claude/codex ≈ 0) in adding-cutile-kernel evals

All expected_behavior entries phrased positively ("addressed X" rather than "did NOT invoke skill") — based on the lesson learned in PR #135 that binary "did not invoke" gates penalize claude even for cursory SKILL.md reads.

Open questions for reviewers

CI Configuration

config:
  build: true
  # valid options are "ops" and "benchmark"
  test: []

🤖 Generated with Claude Code

hannahli-nv and others added 7 commits May 28, 2026 07:55
Changes:
- Move 7 cuTile skill folders from .agents/skills/ to skills/.
- Add .agents/skills and .claude/skills symlinks pointing to ../skills
  for backward compatibility.
- Update LICENSE, CONTRIBUTING.md, and .github/scripts/check_spdx_headers.py
  to reference the new skills/ path.
- Split skills/cutile-autotuning/SKILL.md: move API Reference,
  Step-by-Step Workflow, and Pitfall Checklist into new files under
  references/ to keep SKILL.md concise.

Signed-off-by: Hannah Li <hanli@nvidia.com>
Signed-off-by: Hannah Li <hanli@nvidia.com>
Adds the skill evaluation dataset for `adding-cutile-kernel`. The
question targets two TileGym CI naming conventions required for new
operators — the `test_op*` test-function prefix that gates which
pytest functions are collected by the CI `-k test_op` filter, and
the `-TFLOPS` / `-GBps` suffix required for the benchmark
`plot_name` parameter so results are parsed into the CI summary.

These conventions are documented in the adding-cutile-kernel
`SKILL.md` and are not reliably available from general training
data, so the skill provides clear lift over the no-skill baseline.
The case has produced zero high-severity regression findings across
three prior nvskills-ci runs.

Schema fields used: `id`, `question`, `expected_skill`,
`expected_script`, `ground_truth`, `expected_behavior`.

After this PR merges, the publication pipeline auto-generates
`BENCHMARK.md`, `skill-card.md`, and the detached signature
`skill.oms.sig` for this skill, and the nvidia/skills sync workflow
publishes it to the public catalog (per
NVIDIA/skills#121).

The remaining 6 cuTile skills will receive their own
`evals/evals.json` in follow-up PRs, scoped per-skill to keep each
evaluation run within the per-job time budget and avoid blocking
each other through the global gate.

Signed-off-by: Hannah Li <hanli@nvidia.com>
Replace two implementation-heavy positive cases with one orientation-style positive case and two additional negative cases to cover related-but-out-of-scope topics (performance tuning and multi-GPU distribution). Adjust expected_behavior phrasing to be agent-agnostic.
Parallel experiment to PR #135. Same scaffolding (skill renames + nvskills
signatures) but evals.json is attached to tilegym-cutile-autotuning instead
of tilegym-adding-cutile-kernel.

5-case design: 2 positive (impl + diagnostic), 3 negative (logging,
install/version, multi-GPU/NCCL). All expected_behavior phrased positively
to avoid binary 'did not invoke skill' gates that have been tripping up
claude in the prior pipeline runs.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test 514bd49

@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

1 similar comment
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

Signed-off-by: nvskills-svc-account <svc-nvskills-signing@nvidia.com>
@hannahli-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test c2285e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants