You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds targeted eval coverage for the data-designer skill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.
🔗 Related Issue
N/A
🔄 Changes
Add skills/data-designer/evals/evals.json with focused positive evals for Autopilot dataset generation scenarios.
Add negative evals for unrelated PostgreSQL and React UI requests to verify the skill is not selected outside dataset-generation tasks.
Keep behavior checks short, targeted, and easy to grade as true or false.
Adds a single new file, skills/data-designer/evals/evals.json (86 lines), containing six eval cases for the data-designer skill: four positive cases exercising Autopilot routing plus specific behaviors (person sampling, LLM judge score access, sampler params), and two negative cases (PostgreSQL admin, React UI) that should not invoke the skill. No existing files are modified. JSON parses cleanly.
This is purely test/eval data — no runtime code paths are touched, blast radius is limited to anyone consuming the eval file.
Findings
Correctness & alignment with SKILL.md
The eval expectations line up well with the rules documented in skills/data-designer/SKILL.md:
The sampler-params case asserts params is used and sampler_params is not — directly mirrors the SKILL.md rule at line 37 (SamplerColumnConfig: Takes params, not sampler_params).
The llm-judge-scores case asserts the accepted-boolean derivation references .score — mirrors the LLM judge access guidance at SKILL.md line 38.
The person-reviews case references get_person_object_schema.py, which exists at skills/data-designer/scripts/get_person_object_schema.py. ✓
Both negative cases assert the skill is not selected for unrelated tasks (DB admin, React UI), which is appropriate given the SKILL.md description scopes the skill to dataset/synthetic-data/data-generation tasks.
Schema consistency
All six entries share the same shape: id, question, expected_skill, expected_script, ground_truth, expected_behavior. Good — uniform schema makes downstream grading simpler.
expected_script is null for five entries and "get_person_object_schema.py" for one. That's the only structured assertion about which script the agent should run; the rest of script-execution checks live in expected_behavior strings. Worth noting the schema isn't documented anywhere (no README in evals/, no JSON Schema), so the meaning of expected_script vs. the free-text expected_behavior items is implicit. Not blocking, but a short evals/README.md describing the schema and how cases are graded would help future contributors.
Behavior assertions — graders need to match
The expected_behavior strings are natural-language predicates that presumably get fed to an LLM judge. A few observations:
Sampler-params case, item "The site column is generated by a category sampler" — strong assertion. The user prompt mentions only site as a column; the agent could reasonably pick category or subcategory. Probably fine as written, but if the grader is strict about category vs. any categorical sampler, this could be flaky.
Person-reviews case, item "The agent ran python scripts/get_person_object_schema.py with a locale argument" — assumes the script accepts a locale arg. Worth a quick sanity check that the script's CLI actually supports this; if it's optional, a flaky-grader risk exists.
Negative cases — the assertion "The agent did not read the data-designer SKILL.md" is a precise behavior, but Autopilot/skill-invocation typically reads SKILL.md as part of skill selection, not just execution. Whether the harness counts skill-selection reads against this assertion depends on grader semantics. Worth verifying with one dry run of the negative cases that a passing agent actually trips zero reads.
"The agent did not ask the user a clarifying question before building the script" appears in only one positive case, even though all four positive prompts include "do not ask follow-up questions" / "be opinionated" language. Consider adding this assertion to the other three for consistency, or deliberately leaving it off if the grading focus is different.
Project conventions
File location (skills/<skill>/evals/evals.json) is the first instance of this pattern in the repo (no other evals*.json files exist). The PR establishes the convention rather than following one — fine, but worth being deliberate. If multiple skills will get evals, codifying the layout (one file per skill, top-level array, this schema) in skills/README.md or similar would prevent drift.
No SPDX/license header — appropriate, JSON doesn't carry one.
No tests or CI wiring is added to consume the eval file. The PR description notes this is intentional ("eval JSON only"), and presumably an external evaluator picks it up. If there's a make target or repo-internal runner, a follow-up that wires it in would close the loop.
Minor
id values use kebab-case prefixed with data-designer- — consistent and grep-friendly. ✓
ground_truth and expected_behavior are slightly redundant in places (e.g., the positive cases' ground_truth restates what the behavior list already covers). Not a problem, just verbose.
DCO sign-off is missing per the PR checklist. Will need to be addressed before merge if this repo enforces DCO.
Risks
Low. No code paths change. Worst case is a flaky eval grader; that surfaces as eval-suite noise, not user-facing breakage.
The eval assertions are tightly coupled to current SKILL.md wording (e.g., file names, the .score rule, the params vs. sampler_params rule). Future SKILL.md edits will need to keep this file in sync — a one-line note in SKILL.md or a comment in the eval file pointing at the dependency would help.
Verdict
Looks good. Small, well-scoped, internally consistent, and the assertions match documented skill behavior. Suggested non-blocking follow-ups:
Add a brief skills/data-designer/evals/README.md (or top-level skills/EVALS.md) describing the schema and grading model.
Verify get_person_object_schema.py actually accepts a locale argument before this eval runs in CI.
Consider tightening or relaxing the "did not read SKILL.md" assertion on negative cases based on how the grader treats skill-selection reads.
Adds skills/data-designer/evals/evals.json with six targeted eval cases for the data-designer skill: four positive Autopilot workflow cases (support tickets, product reviews with person sampling, LLM judge score access, IoT telemetry sampler params) and two negative routing cases (PostgreSQL admin and React UI). The expected behavior checks are well-aligned with SKILL.md constraints such as .score accessor patterns and params vs sampler_params.
Four positive evals exercise the Autopilot workflow end-to-end, including validate and preview steps; two negative evals guard against false-positive routing to the data-designer skill for unrelated tasks.
Three of the four positive evals omit the "The agent ran data-designer agent context" behavior check that is explicitly step 2 of autopilot.md; the llm-judge-scores and sampler-params evals have implicit proxy checks, but the person-reviews eval does not.
Confidence Score: 5/5
JSON-only eval data file with no executable code; safe to merge.
The change is a single JSON file containing eval test cases with no executable code, schema migrations, or runtime paths. The behavior checks are accurate and consistent with the skill's documented API constraints.
No files require special attention — only skills/data-designer/evals/evals.json was changed.
Important Files Changed
Filename
Overview
skills/data-designer/evals/evals.json
New eval file adding 4 positive and 2 negative cases for data-designer; expected_script inconsistency in person-reviews case and a missing agent context coverage check noted.
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Eval Runner] --> B{expected_skill?}
B -->|data-designer| C[Positive Evals]
B -->|null| D[Negative Evals]
C --> E[autopilot-support-tickets]
C --> F[autopilot-person-reviews]
C --> G[autopilot-llm-judge-scores]
C --> H[autopilot-sampler-params]
D --> I[negative-database-admin]
D --> J[negative-react-component]
LoadingPrompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---### Issue 1 of 1
skills/data-designer/evals/evals.json:18-31
**Missing `data-designer agent context` check**
Every Autopilot eval except `data-designer-autopilot-support-tickets` omits the "The agent ran data-designer agent context before writing the script" behavior. This is step 2 of `workflows/autopilot.md` and is how the agent learns all column, sampler, and processor schemas before writing code. In this case the person-reviews eval substitutes it with "The agent ran python scripts/get_person_object_schema.py", which inspects only the person-object schema — a narrower check. An agent that skips `data-designer agent context` and goes straight to person schema inspection would still pass, but would be writing script columns without having learned the full config schema. The llm-judge-scores and sampler-params evals have `.score` / `sampler_params` proxy checks that implicitly verify schema learning, but person-reviews has no such proxy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary
Adds targeted eval coverage for the
data-designerskill so Autopilot routing and skill-specific behaviors are easier to verify. The cases focus on Data Designer workflow use, person sampling, LLM judge score access, sampler params, and unrelated negative prompts.🔗 Related Issue
N/A
🔄 Changes
skills/data-designer/evals/evals.jsonwith focused positive evals for Autopilot dataset generation scenarios.🧪 Testing
make testpasses — not run; eval JSON onlypython3 -m json.tool skills/data-designer/evals/evals.jsonpasses✅ Checklist