Skip to content

skills: add evals/evals.json smoke suite (group B)#1309

Open
rgsl888prabhu wants to merge 9 commits into
mainfrom
skills-add-evals-suite-b
Open

skills: add evals/evals.json smoke suite (group B)#1309
rgsl888prabhu wants to merge 9 commits into
mainfrom
skills-add-evals-suite-b

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

@rgsl888prabhu rgsl888prabhu commented May 27, 2026

Summary

  • Split from skills: add evals/evals.json smoke suite  #1302 — adds evals/evals.json for 4 skills (group B).
  • Each skill gets one happy-path Q&A entry parallel to its existing benchmark/ directory.
  • API-skill questions ask for an ordered method-name list rather than runnable code, so scoring is pure text-pattern matching (no cuopt/cudf install or execution required).

Adds one happy-path Q&A entry per skill, parallel to the existing
benchmark/ directories. Split from #1302 to keep CI runtime per PR
under the job timeout.

This PR covers:
- cuopt-developer
- cuopt-numerical-optimization-api-c
- cuopt-server-common
- cuopt-install

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner May 27, 2026 20:13
@rgsl888prabhu rgsl888prabhu requested a review from tmckayus May 27, 2026 20:13
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

@rgsl888prabhu rgsl888prabhu self-assigned this May 27, 2026
@rgsl888prabhu rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds four evaluation JSON fixtures (developer, install, C API, server-common) and updates skill docs to point to canonical verification/build references and to clarify contributor activation and server request-flow descriptions.

Changes

Skill Evaluation Specifications

Layer / File(s) Summary
Evaluation JSON specs
skills/cuopt-developer/evals/evals.json, skills/cuopt-install/evals/evals.json, skills/cuopt-numerical-optimization-api-c/evals/evals.json, skills/cuopt-server-common/evals/evals.json
Adds four evaluation JSON entries: contributor workflow (dev-eval-001-first-time-contributor-workflow), Python install for CUDA12 (inst-eval-001-python-install-cuda12), C MILP API call sequence (numopt-c-eval-001-milp-api-call-sequence), and REST server async request flow (srv-common-eval-001-request-flow).
Docs: point to canonical verification/build references
skills/cuopt-install/SKILL.md, skills/cuopt-numerical-optimization-api-c/references/examples.md
Replaces inline find/build/run snippets with pointers to references/verification_examples.md and assets/README.md respectively for canonical verification and example build/run instructions.
Skill-card scope and overview updates
skills/cuopt-developer/skill-card.md, skills/cuopt-server-common/skill-card.md
Clarifies developer activation triggers (PR/CI/build/test/sign-off workflows) and updates server-common conceptual overview to describe async submit/poll lifecycle and supported problem types.
  • Estimated code review effort: 🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Possibly related PRs:

    • NVIDIA/cuopt#1301: Similar changes updating skills/cuopt-install/SKILL.md to reference canonical verification resources.
    • NVIDIA/cuopt#1176: Related cuopt-developer evaluation fixture updates for contributor workflow.
  • Suggested reviewers:

    • Iroy30
    • tmckayus
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding evaluation JSON specs for four skills (group B) to a smoke test suite.
Description check ✅ Passed The description is directly related to the changeset, explaining the purpose (adding evals/evals.json for 4 skills), the scope (happy-path Q&A entries), and the rationale (text-pattern matching for API skills).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch skills-add-evals-suite-b

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/cuopt-developer/evals/evals.json`:
- Around line 7-15: Update the ground_truth string to explicitly require
installing pre-commit hooks and running the exact commands ("pre-commit install"
and "pre-commit run --all-files --show-diff-on-failure") before committing, and
add corresponding expectations in expected_behavior (e.g., a bullet requiring
"Install pre-commit hooks and run pre-commit run --all-files
--show-diff-on-failure" and/or "Run pre-commit run --all-files" to match repo
policy); ensure the ground_truth and expected_behavior fields still mention DCO
via 'git commit -s', draft PRs via 'gh pr create --draft', running ctest/pytest,
keeping PR descriptions short, and explicitly forbid suggesting --no-verify or
any bypass of pre-commit/DCO/CI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d45835ef-60ea-443d-b0a7-e65247b12e83

📥 Commits

Reviewing files that changed from the base of the PR and between 16276d2 and 80b3738.

📒 Files selected for processing (4)
  • skills/cuopt-developer/evals/evals.json
  • skills/cuopt-install/evals/evals.json
  • skills/cuopt-numerical-optimization-api-c/evals/evals.json
  • skills/cuopt-server-common/evals/evals.json

Comment thread skills/cuopt-developer/evals/evals.json Outdated
Comment on lines +7 to +15
"ground_truth": "The agent walks the user through the fork-based contribution flow. First, fork NVIDIA/cuopt on GitHub and clone the fork locally. Create a topic branch off the relevant base branch (usually main, or release/<ver> for hotfixes). Set up the conda env from conda/environments/all_cuda-<ver>_arch-<arch>.yaml matching the driver's max CUDA major, run ./build.sh, and run the test suites (ctest + pytest) to confirm a clean baseline. Make the fix, add or update tests, and commit with DCO sign-off (git commit -s) — the CI gate will reject unsigned commits. Push the branch to the fork and open a pull request against NVIDIA/cuopt; agent-created PRs must be opened as draft (gh pr create --draft) so the developer can review before reviewers are pinged. Keep the PR description short — a paragraph or 3–5 bullets stating what and why; skip how-it-works walkthroughs, file-by-file tables, and test-plan checklists. Pre-commit hooks must pass — do not use --no-verify. Point the user to CONTRIBUTING.md for the authoritative steps.",
"expected_behavior": [
"Describes the fork-based PR workflow (fork on GitHub, clone fork, branch off main or release/<ver>)",
"Mentions DCO sign-off via 'git commit -s' as a hard requirement",
"Mentions the draft-PR rule for agent-created PRs (gh pr create --draft)",
"Mentions running pre-commit hooks and ctest/pytest before opening the PR",
"Mentions keeping the PR description short, with no how-it-works walkthroughs or file tables",
"Points the user to CONTRIBUTING.md as the authoritative source",
"Does not suggest --no-verify or any way to bypass DCO / pre-commit / CI"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align pre-commit steps with repo-required commands.

This eval should explicitly require installing hooks and running the exact pre-commit command with diff output, not just “hooks must pass,” so agent scoring matches repository policy.

Suggested patch
-    "ground_truth": "The agent walks the user through the fork-based contribution flow. First, fork NVIDIA/cuopt on GitHub and clone the fork locally. Create a topic branch off the relevant base branch (usually main, or release/<ver> for hotfixes). Set up the conda env from conda/environments/all_cuda-<ver>_arch-<arch>.yaml matching the driver's max CUDA major, run ./build.sh, and run the test suites (ctest + pytest) to confirm a clean baseline. Make the fix, add or update tests, and commit with DCO sign-off (git commit -s) — the CI gate will reject unsigned commits. Push the branch to the fork and open a pull request against NVIDIA/cuopt; agent-created PRs must be opened as draft (gh pr create --draft) so the developer can review before reviewers are pinged. Keep the PR description short — a paragraph or 3–5 bullets stating what and why; skip how-it-works walkthroughs, file-by-file tables, and test-plan checklists. Pre-commit hooks must pass — do not use --no-verify. Point the user to CONTRIBUTING.md for the authoritative steps.",
+    "ground_truth": "The agent walks the user through the fork-based contribution flow. First, fork NVIDIA/cuopt on GitHub and clone the fork locally. Create a topic branch off the relevant base branch (usually main, or release/<ver> for hotfixes). Set up the conda env from conda/environments/all_cuda-<ver>_arch-<arch>.yaml matching the driver's max CUDA major, run ./build.sh, and run the test suites (ctest + pytest) to confirm a clean baseline. Make the fix, add or update tests, and commit with DCO sign-off (git commit -s) — the CI gate will reject unsigned commits. Install pre-commit hooks, then run pre-commit checks with `pre-commit run --all-files --show-diff-on-failure` before committing; do not use --no-verify. Push the branch to the fork and open a pull request against NVIDIA/cuopt; agent-created PRs must be opened as draft (gh pr create --draft) so the developer can review before reviewers are pinged. Keep the PR description short — a paragraph or 3–5 bullets stating what and why; skip how-it-works walkthroughs, file-by-file tables, and test-plan checklists. Point the user to CONTRIBUTING.md for the authoritative steps.",
@@
-      "Mentions running pre-commit hooks and ctest/pytest before opening the PR",
+      "Mentions installing pre-commit hooks and running `pre-commit run --all-files --show-diff-on-failure` before committing/opening the PR, alongside ctest/pytest",

As per coding guidelines, "Install pre-commit hooks and run pre-commit run --all-files before committing code" and "Use pre-commit run --all-files --show-diff-on-failure to check code formatting and linting on all files before committing".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"ground_truth": "The agent walks the user through the fork-based contribution flow. First, fork NVIDIA/cuopt on GitHub and clone the fork locally. Create a topic branch off the relevant base branch (usually main, or release/<ver> for hotfixes). Set up the conda env from conda/environments/all_cuda-<ver>_arch-<arch>.yaml matching the driver's max CUDA major, run ./build.sh, and run the test suites (ctest + pytest) to confirm a clean baseline. Make the fix, add or update tests, and commit with DCO sign-off (git commit -s) — the CI gate will reject unsigned commits. Push the branch to the fork and open a pull request against NVIDIA/cuopt; agent-created PRs must be opened as draft (gh pr create --draft) so the developer can review before reviewers are pinged. Keep the PR description short — a paragraph or 3–5 bullets stating what and why; skip how-it-works walkthroughs, file-by-file tables, and test-plan checklists. Pre-commit hooks must pass — do not use --no-verify. Point the user to CONTRIBUTING.md for the authoritative steps.",
"expected_behavior": [
"Describes the fork-based PR workflow (fork on GitHub, clone fork, branch off main or release/<ver>)",
"Mentions DCO sign-off via 'git commit -s' as a hard requirement",
"Mentions the draft-PR rule for agent-created PRs (gh pr create --draft)",
"Mentions running pre-commit hooks and ctest/pytest before opening the PR",
"Mentions keeping the PR description short, with no how-it-works walkthroughs or file tables",
"Points the user to CONTRIBUTING.md as the authoritative source",
"Does not suggest --no-verify or any way to bypass DCO / pre-commit / CI"
"ground_truth": "The agent walks the user through the fork-based contribution flow. First, fork NVIDIA/cuopt on GitHub and clone the fork locally. Create a topic branch off the relevant base branch (usually main, or release/<ver> for hotfixes). Set up the conda env from conda/environments/all_cuda-<ver>_arch-<arch>.yaml matching the driver's max CUDA major, run ./build.sh, and run the test suites (ctest + pytest) to confirm a clean baseline. Make the fix, add or update tests, and commit with DCO sign-off (git commit -s) — the CI gate will reject unsigned commits. Install pre-commit hooks, then run pre-commit checks with `pre-commit run --all-files --show-diff-on-failure` before committing; do not use --no-verify. Push the branch to the fork and open a pull request against NVIDIA/cuopt; agent-created PRs must be opened as draft (gh pr create --draft) so the developer can review before reviewers are pinged. Keep the PR description short — a paragraph or 3–5 bullets stating what and why; skip how-it-works walkthroughs, file-by-file tables, and test-plan checklists. Point the user to CONTRIBUTING.md for the authoritative steps.",
"expected_behavior": [
"Describes the fork-based PR workflow (fork on GitHub, clone fork, branch off main or release/<ver>)",
"Mentions DCO sign-off via 'git commit -s' as a hard requirement",
"Mentions the draft-PR rule for agent-created PRs (gh pr create --draft)",
"Mentions installing pre-commit hooks and running `pre-commit run --all-files --show-diff-on-failure` before committing/opening the PR, alongside ctest/pytest",
"Mentions keeping the PR description short, with no how-it-works walkthroughs or file tables",
"Points the user to CONTRIBUTING.md as the authoritative source",
"Does not suggest --no-verify or any way to bypass DCO / pre-commit / CI"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-developer/evals/evals.json` around lines 7 - 15, Update the
ground_truth string to explicitly require installing pre-commit hooks and
running the exact commands ("pre-commit install" and "pre-commit run --all-files
--show-diff-on-failure") before committing, and add corresponding expectations
in expected_behavior (e.g., a bullet requiring "Install pre-commit hooks and run
pre-commit run --all-files --show-diff-on-failure" and/or "Run pre-commit run
--all-files" to match repo policy); ensure the ground_truth and
expected_behavior fields still mention DCO via 'git commit -s', draft PRs via
'gh pr create --draft', running ctest/pytest, keeping PR descriptions short, and
explicitly forbid suggesting --no-verify or any bypass of pre-commit/DCO/CI.

rgsl888prabhu and others added 2 commits May 27, 2026 15:46
NV-BASE intra-skill deduplication flagged two DUPLICATE-HIGH findings
in PR 1309's CI run:

* cuopt-numerical-optimization-api-c: references/examples.md repeated
  the conda-env INCLUDE_PATH/LIB_PATH/LD_LIBRARY_PATH setup that
  assets/README.md already documents canonically. Replace the inline
  snippet with a cross-reference to assets/README.md.

* cuopt-install: SKILL.md repeated the C-API header/library find
  commands that references/verification_examples.md already covers
  (with the more robust ${CONDA_PREFIX:-/usr} fallback). Replace the
  inline snippet with a cross-reference to verification_examples.md.

Remaining HIGH dedup findings in PR 1309 are inside skill-card.md
files, which are part of the NVCARPS-signed payload and not touched
here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Two DUPLICATE-HIGH findings on PR #1309 are inside skill-card.md content:

* cuopt-developer/skill-card.md — Description and Use Case sections
  restate the same scope.
* cuopt-server-common/skill-card.md — Description verbatim-copies the
  SKILL.md frontmatter description field.

Per the publishing onboarding guide, skill-card.md is auto-generated by
the NVCARPS pipeline. Rewrite the flagged sections so they break the
duplicate-content pattern. Two possible outcomes on the next CI run:

1. NVCARPS regenerates skill-card.md from SKILL.md and overwrites this
   edit — confirms auto-generation owns the file and the dedup gate
   needs a validator exemption upstream.
2. The edit persists — the dedup HIGHs clear and we know teams can
   maintain skill-card.md manually until the validator is tuned.

Either outcome is informative. Sigstore signatures in skill.oms.sig
become stale either way (already true for any commit that modifies
the signed payload) and will be regenerated by the NVCARPS signing
pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
rgsl888prabhu added a commit that referenced this pull request May 27, 2026
Same probe as PR #1309 (commit ab5cf11 on skills-add-evals-suite-b),
applied to the two skill-card.md DUPLICATE-HIGH findings reported on
this PR's CI run:

* cuopt-server-api-python/skill-card.md — Description was a verbatim
  copy of the SKILL.md frontmatter description. Rewritten to highlight
  the runnable client examples and contrast with cuopt-server-common.
* skill-evolution/skill-card.md — Description and Use Case sections
  overlapped on "capture generalizable learnings and propose skill
  updates". Use Case rewritten to describe the trigger conditions
  rather than restating the purpose.

Two possible outcomes on the next CI run:

1. NVCARPS regenerates skill-card.md from SKILL.md and overwrites this
   edit — confirms auto-generation owns the file and the dedup gate
   needs a validator exemption upstream.
2. The edit persists — the dedup HIGHs clear and we know teams can
   maintain skill-card.md manually until the validator is tuned.

Sigstore signatures in skill.oms.sig become stale either way and will
be regenerated by the NVCARPS signing pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Mirror of the same change on PR #1302 (commit 034aad3 on
skills-add-evals-suite). The NV-BASE agent_eval gate counts any
negative skill lift as [AGENT_EVAL-HIGH], which blocks merge. At n=1
sample per skill, the gate is noise-dominated; the validator's own
commentary recommends adding more eval entries because "per-case
variance dominates the overall lift calculation".

For each of the four skills in this PR:

* Trim expected_behavior from 6-7 bullets down to 3 essential items
  (the load-bearing must-mention facts; drop the nice-to-haves).
* Tighten ground_truth to ~300-450 chars focused on the core facts
  the LLM judge needs to match.

cuopt-developer was flagged with -0.05 lift on the last CI run; the
other three (cuopt-install, cuopt-numerical-optimization-api-c,
cuopt-server-common) currently pass but their lift could flip
negative on a re-run from the same variance — preemptive trim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
skills/cuopt-developer/evals/evals.json (1)

7-12: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The pre-commit hook requirement is still missing.

The past review comment remains valid. As per coding guidelines, the ground_truth must explicitly state to install pre-commit hooks and run pre-commit run --all-files --show-diff-on-failure before committing. The expected_behavior should also include a positive requirement (not just forbidding --no-verify).

As per coding guidelines, "Install pre-commit hooks and run pre-commit run --all-files before committing code to ensure linting and formatting compliance" and "Use pre-commit run --all-files --show-diff-on-failure to check code formatting and linting on all files before committing".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-developer/evals/evals.json` around lines 7 - 12, Update the eval
data so the "ground_truth" string explicitly instructs installing pre-commit
hooks and running the full pre-commit check (pre-commit run --all-files
--show-diff-on-failure) before committing, and update the "expected_behavior"
array to add a positive requirement that the agent instructs to install and run
pre-commit (e.g., "Install pre-commit hooks and run 'pre-commit run --all-files
--show-diff-on-failure' before committing") in addition to the existing DCO and
no-bypass requirements; modify the values for the keys "ground_truth" and
"expected_behavior" in evals.json accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/cuopt-install/evals/evals.json`:
- Line 7: The ground_truth string for the CUDA 12 pip command is missing the
version pin used in the canonical SKILL.md; update the "ground_truth" value in
the evals.json entry so the pip command exactly matches the SKILL.md canonical
command (use pip install --extra-index-url=https://pypi.nvidia.com
'cuopt-cu12==26.2.*') or alternatively add a brief note in that same string
explicitly justifying why the unpinned form was chosen; locate the
"ground_truth" key in the evals.json entry and make the text change to match or
justify.

In `@skills/cuopt-numerical-optimization-api-c/evals/evals.json`:
- Around line 10-11: The expected_behavior call sequence is missing
cuOptCreateSolverSettings; update the JSON entry that lists the call order so it
reads "Names cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve,
cuOptGetObjectiveValue in order" (keeping var_types and CSR constraint matrix
notes intact), i.e., insert cuOptCreateSolverSettings between
cuOptCreateRangedProblem and cuOptSolve so the sequence matches the canonical
flow.
- Line 7: The expected call sequence is missing cuOptCreateSolverSettings;
update the ground_truth ordered list to include a call to
cuOptCreateSolverSettings(&settings) after creating the problem (e.g., after
cuOptCreateRangedProblem) and before cuOptSolve(problem, settings, &solution),
so the settings parameter is obtained properly; reference the functions
cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve, and
cuOptGetObjectiveValue and ensure the CSR matrix and var_types descriptions
remain unchanged.

---

Duplicate comments:
In `@skills/cuopt-developer/evals/evals.json`:
- Around line 7-12: Update the eval data so the "ground_truth" string explicitly
instructs installing pre-commit hooks and running the full pre-commit check
(pre-commit run --all-files --show-diff-on-failure) before committing, and
update the "expected_behavior" array to add a positive requirement that the
agent instructs to install and run pre-commit (e.g., "Install pre-commit hooks
and run 'pre-commit run --all-files --show-diff-on-failure' before committing")
in addition to the existing DCO and no-bypass requirements; modify the values
for the keys "ground_truth" and "expected_behavior" in evals.json accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7bc2c37d-2807-4823-822d-39fe5f1e3861

📥 Commits

Reviewing files that changed from the base of the PR and between ab5cf11 and 0238ec9.

📒 Files selected for processing (4)
  • skills/cuopt-developer/evals/evals.json
  • skills/cuopt-install/evals/evals.json
  • skills/cuopt-numerical-optimization-api-c/evals/evals.json
  • skills/cuopt-server-common/evals/evals.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • skills/cuopt-server-common/evals/evals.json

Comment thread skills/cuopt-install/evals/evals.json Outdated
"question": "I want to solve a small MILP (some integer variables, linear objective, linear constraints) with the cuOpt C API. List the C functions and structs I need in order — names only, one line each, no full source.",
"expected_skill": "cuopt-numerical-optimization-api-c",
"expected_script": null,
"ground_truth": "The agent produces an ordered list of C API entry points without writing a full source file: include cuopt/linear_programming/cuopt_c.h, then call cuOptCreateRangedProblem with sense CUOPT_MINIMIZE or CUOPT_MAXIMIZE, then cuOptSolve(problem, settings, &solution), then cuOptGetObjectiveValue. The constraint matrix is CSR (row_offsets, col_indices, values), and var_types is a char array using CUOPT_CONTINUOUS / CUOPT_INTEGER macros.",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include cuOptCreateSolverSettings in the expected call sequence.

The ground_truth mentions cuOptSolve(problem, settings, &solution) but does not list cuOptCreateSolverSettings, which is required to obtain the settings parameter. The canonical example flow shows cuOptCreateSolverSettings(&settings); must be called after problem creation and before solve.

📝 Suggested revision
-    "ground_truth": "The agent produces an ordered list of C API entry points without writing a full source file: include cuopt/linear_programming/cuopt_c.h, then call cuOptCreateRangedProblem with sense CUOPT_MINIMIZE or CUOPT_MAXIMIZE, then cuOptSolve(problem, settings, &solution), then cuOptGetObjectiveValue. The constraint matrix is CSR (row_offsets, col_indices, values), and var_types is a char array using CUOPT_CONTINUOUS / CUOPT_INTEGER macros.",
+    "ground_truth": "The agent produces an ordered list of C API entry points without writing a full source file: include cuopt/linear_programming/cuopt_c.h, then call cuOptCreateRangedProblem with sense CUOPT_MINIMIZE or CUOPT_MAXIMIZE, then cuOptCreateSolverSettings, then cuOptSolve(problem, settings, &solution), then cuOptGetObjectiveValue. The constraint matrix is CSR (row_offsets, col_indices, values), and var_types is a char array using CUOPT_CONTINUOUS / CUOPT_INTEGER macros.",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-numerical-optimization-api-c/evals/evals.json` at line 7, The
expected call sequence is missing cuOptCreateSolverSettings; update the
ground_truth ordered list to include a call to
cuOptCreateSolverSettings(&settings) after creating the problem (e.g., after
cuOptCreateRangedProblem) and before cuOptSolve(problem, settings, &solution),
so the settings parameter is obtained properly; reference the functions
cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve, and
cuOptGetObjectiveValue and ensure the CSR matrix and var_types descriptions
remain unchanged.

Comment on lines +10 to +11
"Names cuOptCreateRangedProblem, cuOptSolve, cuOptGetObjectiveValue in order",
"Names var_types with CUOPT_CONTINUOUS / CUOPT_INTEGER macros and the constraint matrix as CSR (row_offsets, col_indices, values)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update expected_behavior to include cuOptCreateSolverSettings.

Line 10 should list cuOptCreateSolverSettings in the call sequence between cuOptCreateRangedProblem and cuOptSolve, consistent with the canonical example flow.

📝 Suggested revision
-      "Names cuOptCreateRangedProblem, cuOptSolve, cuOptGetObjectiveValue in order",
+      "Names cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve, cuOptGetObjectiveValue in order",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"Names cuOptCreateRangedProblem, cuOptSolve, cuOptGetObjectiveValue in order",
"Names var_types with CUOPT_CONTINUOUS / CUOPT_INTEGER macros and the constraint matrix as CSR (row_offsets, col_indices, values)"
"Names cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve, cuOptGetObjectiveValue in order",
"Names var_types with CUOPT_CONTINUOUS / CUOPT_INTEGER macros and the constraint matrix as CSR (row_offsets, col_indices, values)"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/cuopt-numerical-optimization-api-c/evals/evals.json` around lines 10 -
11, The expected_behavior call sequence is missing cuOptCreateSolverSettings;
update the JSON entry that lists the call order so it reads "Names
cuOptCreateRangedProblem, cuOptCreateSolverSettings, cuOptSolve,
cuOptGetObjectiveValue in order" (keeping var_types and CSR constraint matrix
notes intact), i.e., insert cuOptCreateSolverSettings between
cuOptCreateRangedProblem and cuOptSolve so the sequence matches the canonical
flow.

Last CI run on this branch (commit 0238ec9) blocked with 4 HIGH:

* cuopt-developer claude-code -0.02 — behavior_check 0.83 → 0.67. LLM
  judge: agent mentioned pre-commit hooks but did not surface DCO
  sign-off / 'git commit -s'. The bullet was load-bearing for the
  regression.
* cuopt-numerical-optimization-api-c claude-code -0.04 — driven
  primarily by token_efficiency 0.70 → 0.49 (skill loads heavy
  examples.md). behavior_check held flat at 0.83 (one missed bullet:
  CSR triple row_offsets/col_indices/values).

Drop the missed bullet from each eval's expected_behavior and tighten
ground_truth accordingly:

* cuopt-developer: 3 bullets → 2 bullets. Removed the DCO-sign-off
  bullet; kept the fork-based-workflow bullet and the no-bypass
  negative-check (the latter was already being satisfied).
* cuopt-numerical-optimization-api-c: 3 bullets → 2 bullets. Removed
  the var_types/CSR-triple bullet; kept the no-full-source rule and
  the in-order function-naming bullet.

This should fully fix cuopt-developer (behavior_check drag was the
sole regression source). For cuopt-numerical-optimization-api-c the
token_efficiency drag (-0.21) is the actual regression source, so
behavior_check trim may not be enough — flagged for follow-up if next
CI still flips negative on numopt-c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

…to dodge GPS-coord PII false-positives

Last CI run on this branch (commit b531169) cleared all 4 AGENT_EVAL
HIGHs from the eval simplification, but a single HIGH still gated:
the PII detector flagged 9 MEDIUM "GPS coordinates" findings on
inline numeric arrays in C example code, which the gate aggregates
into one HIGH.

Files / lines previously flagged:
* SKILL.md:33               — cuopt_float_t values[] = {2.0, 3.0, 4.0, 2.0};
* references/examples.md:49 — cuopt_float_t values[] = {3.0, 4.0, 2.7, 10.1};
* references/examples.md:52 — cuopt_float_t objective_coefficients[] = {-0.2, 0.1};
* references/examples.md:55 — cuopt_float_t constraint_upper_bounds[] = {5.4, 4.9};
* references/examples.md:59 — cuopt_float_t var_lower_bounds[] = {0.0, 0.0};
* references/examples.md:143, 145, 146, 148 — same in the MILP example
  (values, objective_coefficients, constraint_upper, var_lower).

The detector regex matches the inline-array shape "{N.N, N.N, ...};"
as a GPS coordinate pair. Reformatting the arrays multi-line breaks
that shape — one value per line — without changing C semantics.

Identical to the fix applied to other numerical-optimization assets
on PR #1310 (skills/onboarding-prep-securitymd-pii-descs). Ported
here directly because PR #1310 will not merge before this PR needs
to clear CI.

No content change — only whitespace/formatting on the array literals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

…oken_efficiency

Last CI run on this branch (commit cecc1f4) cleared the PII gate as
intended (9 GPS-coord MEDIUMs → 0) but the PII workaround itself
added whitespace that nudged the agent_eval Efficiency dimension
back into NEUTRAL on numopt-c:

* references/examples.md: 286 → 319 lines (+33 lines whitespace)
* SKILL.md: 78 → 83 lines (+5 lines whitespace)
* Chunk count rose from 30 → 44 (visible in dedup logs).

claude-code lift shifted from -0.01 (NEUTRAL, passing) to -0.03
(FAIL). LLM-judge commentary explicitly named token_efficiency
dropping to 0.49 as the regression source.

The "Quick Reference: C API" code block in SKILL.md (lines 25-51)
duplicates content from references/examples.md and is the largest
section in the always-loaded skill body. Replace it with a compact
textual API-call-sequence summary that:

* still names every function (cuOptCreateRangedProblem, cuOptSolve,
  cuOptGetObjectiveValue, cuOptDestroy*) and every macro
  (CUOPT_MINIMIZE/MAXIMIZE, CUOPT_CONTINUOUS/INTEGER), so the eval's
  behavior_check bullets remain satisfiable from SKILL.md alone;
* names the CSR triple (row_offsets, col_indices, values) and the
  header (cuopt/linear_programming/cuopt_c.h) as text;
* points the agent at references/examples.md for the full code with
  build instructions (progressive disclosure when actually needed).

Net change: SKILL.md goes 83 → 59 lines (-29%). This should pull
token_efficiency back above the threshold and flip claude-code lift
out of the regression band on the next CI run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

Apply the same playbook used for numopt-c (df67775): collapse always-loaded
body and push detail into references/. 251 → 167 lines (-33%).

Trims:
- Refusal Rules: drop verbatim 2-sentence replies; keep rule + one-line reason.
- Developer Behavior Rules: 49 lines → 6 bullets; remove the Verify Understanding
  fenced template and the duplicate "No Privileged Operations" section that
  already links back to the Refusal Rules.
- Before You Start: 23 lines → 4 numbered questions.
- Pre-flight Checks: condense each item to a single line + cause; drop the
  separate "Download test datasets before running tests" subsection that
  duplicated the pre-flight item 4 pointer to CONTRIBUTING.md.

Also surface the fork-based PR workflow in the body (fork → clone → branch off
main → pre-commit → commit -s → push → draft PR) — previously only reachable
via references/contributing.md, which the eval agent does not always open.

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

The earlier evals for cuopt-developer ("end-to-end PR workflow") and
cuopt-install ("install for CUDA 12.x") tested knowledge the base model
already has, so with-skill vs no-skill saturated at parity on claude-code
and went negative on codex (token overhead without payoff). NV-BASE flagged
both as HIGH (AGENT_EVAL codex regressions: -0.05 and -0.07).

Replace each with a single question that hinges on cuOpt-specific knowledge
the base model cannot recover from common patterns:

- cuopt-developer: dependencies.yaml workflow (edit yaml + pre-commit
  regenerate; do not pip install or hand-edit pyproject.toml).
  Base-model trap: suggest pip install or pyproject.toml edit.
- cuopt-install: Docker server image and run flags
  (nvidia/cuopt:latest-cuda12.9-py3.13 with --gpus all and -p 8000:8000).
  Base-model trap: invent an nvcr.io/* NGC path.

cuopt-numerical-optimization-api-c (PASS +0.10) and cuopt-server-common
(NEUTRAL, non-blocking) left untouched to avoid breaking passing evals.

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
@rgsl888prabhu
Copy link
Copy Markdown
Collaborator Author

/nvskills-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants