Add skill evaluation dataset for cuopt-lp-milp-api-python#1172
Add skill evaluation dataset for cuopt-lp-milp-api-python#1172rapids-bot[bot] merged 5 commits intoNVIDIA:mainfrom
Conversation
Initial skill evaluation dataset for cuopt-lp-milp-api-python at skills/cuopt-lp-milp-api-python/evals/evals.json. 10 entries adapted from the microsoft/OptiGuide IndustryOR corpus (MIT license, attribution in evals/SOURCES.md): - 5 LP-style problems (production planning, profit max, transportation, diet, blending with tiered pricing) - 5 MILP-style problems (assignment, knapsack, lot-sizing, set multi-cover / shift scheduling, bin packing / car parking) Each entry uses the standard schema with one extra `source` field for provenance. Per the user's review: - ground_truth is the numeric optimal value only (exact match, no tolerance) so the LLM judge has a deterministic check - expected_behavior is generic and problem-agnostic — does not pre-categorize a problem as LP vs MILP since that is the agent's job to figure out from the problem text and the cuopt-lp-milp-api-python skill covers both Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
The previous expected_behavior bullet ('Reports the optimal objective
value as part of the response') did not state that the value must
match the ground_truth exactly to the shown precision, leaving room
for the LLM judge to accept rounded answers. Replacing the bullet with
'Reports an optimal objective value that exactly matches the
ground_truth to the precision shown (no rounding tolerance is
allowed)' so the requirement is unambiguous.
Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
The earlier seed of 10 entries was a leftover 'start small' habit that doesn't really apply for dataset-derived evals — once the rubric and schema are validated, including the rest of the corpus is a near-zero- cost transcription. Regenerated evals.json from scratch using the same generic rubric and the exact-precision ground_truth requirement so all 99 entries are internally consistent. IDs are stable: lpmilp-NNN-<class-slug> where NNN is the source row index + 1 and the class slug is derived from the first problem_class tag, which makes problem-level traceability easy without breaking the source-row mapping. The corpus is overwhelmingly LP/MILP. The one row tagged PortfolioOptimization is included because its own text says 'Formulate this as a linear programming problem' — it is an LP, not an actual QP, so it is in scope for cuopt-lp-milp-api-python. Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
|
Important Review skippedReview was skipped as selected files did not have any reviewable changes. 💤 Files selected but had no reviewable changes (1)
⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughA new Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Review rate limit: 8/10 reviews remaining, refill in 11 minutes and 42 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/cuopt-lp-milp-api-python/evals/SOURCES.md`:
- Line 18: The fenced license block in SOURCES.md is missing a language tag
(MD040); update the opening fence used for the MIT license block from ``` to
```text so the block becomes a labeled text code fence (e.g., change the license
block's opening fence in the MIT License section to ```text).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 62cfb1c7-800f-4655-93ba-9e0ebb4348f9
📒 Files selected for processing (2)
skills/cuopt-lp-milp-api-python/evals/SOURCES.mdskills/cuopt-lp-milp-api-python/evals/evals.json
|
|
||
| The MIT license under which the source dataset is distributed: | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Add a language tag to the fenced license block to satisfy markdownlint.
Line 18 opens a fenced block without a language, which triggers MD040.
Proposed fix
-```
+```text
MIT License
@@
SOFTWARE</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 18-18: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/cuopt-lp-milp-api-python/evals/SOURCES.md` at line 18, The fenced
license block in SOURCES.md is missing a language tag (MD040); update the
opening fence used for the MIT license block from ``` to ```text so the block
becomes a labeled text code fence (e.g., change the license block's opening
fence in the MIT License section to ```text).
The other six bullets (decision variables, constraints, objective sense, cuOpt API usage, clarification, no solver substitution) were identical across all 99 entries and largely implied by the exact-precision objective-match check — an agent that gets the right answer to the shown precision had to formulate the problem correctly. Reducing duplication and file size without losing the load-bearing signal. Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Iroy30
left a comment
There was a problem hiding this comment.
LGTM. Just curious- Is it easier to wget each time or have it in the repo?
If there are failures in pull github, it would be a headache for CI failures or even local. And it wants in json format which is not same as the github repo. |
|
/merge |
Summary
Initial skill evaluation dataset for
cuopt-lp-milp-api-pythonatskills/cuopt-lp-milp-api-python/evals/evals.json. 99 entries adapted from the microsoft/OptiGuide IndustryOR corpus (MIT, attribution inevals/SOURCES.md).ground_truthis the numeric optimal value; rubric requires exact match to the precision shown (no tolerance)expected_behavioris generic across all entries — does not pre-categorize as LP vs MILPsourcefield referencing the dataset row for traceabilityQP eval set is out of scope (the corpus has no genuine QP problems) and will follow in a separate PR.