fix: validate subcategory parents are sampler columns#614
Conversation
Subcategory sampler columns require a category-sampler parent. When the parent column was a non-sampler type (e.g. llm-text), validation failed deep inside the sampler-only DataSchema with the misleading message "Column 'X' not found in schema" - the column does exist in the user's config, just not in the sampler subset. Add a model validator on DataDesignerConfig that has visibility into all column types and raises a precise error naming the parent's actual column type.
PR #614 Review — fix: validate subcategory parents are sampler columnsSummaryAdds a FindingsCorrectness — looks right
Error message — slightly ambiguousThe message says This also mirrors the exact phrasing the downstream validator at Test coverage — good for the happy/unhappy paths, one gap
Style / conventions — conforming
Risk / blast radius — low
VerdictApprove with minor suggestions. The fix is well-scoped, matches the existing validator conventions in the codebase, and genuinely improves the error surface for a documented user-reported issue. The two non-blocking requests: (1) tighten the error wording to disambiguate "category-sampler," and (2) add a test case for a non-llm, non-sampler parent (e.g. |
Greptile SummaryThis PR adds a model validator to
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/data_designer_config.py | Adds _validate_subcategory_parents model validator; logic is correct, short-circuits on non-sampler columns, accesses params.category only after confirming subcategory sampler type. |
| packages/data-designer-config/tests/config/test_data_designer_config.py | Three new tests cover the error case, happy path, and missing-parent deferral — all scenarios are meaningful and the assertions are appropriate. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[model_validate called on DataDesignerConfig] --> B[Pydantic validates individual fields and columns]
B --> C[_validate_subcategory_parents runs]
C --> D[Build name to column map]
D --> E{For each column}
E --> F{column_type == sampler AND sampler_type == SUBCATEGORY?}
F -- No --> E
F -- Yes --> G{parent = by_name.get col.params.category}
G -- parent is None --> H[Skip - defer to validate_subcategory_columns_if_present]
H --> E
G -- parent found --> I{parent.column_type != sampler?}
I -- No, parent is a sampler --> J[Pass - proceed]
J --> E
I -- Yes, parent is non-sampler e.g. llm-text --> K[Raise ValueError with clear column type message]
Reviews (4): Last reviewed commit: "chore: trim trailing whitespace introduc..." | Re-trigger Greptile
|
Nice fix on this one, @amanoel — turning a confusing "column not in schema" error into a precise, user-facing message is exactly the kind of polish that pays for itself in support time. SummaryAdds a FindingsSuggestions — Take it or leave it
What Looks Good
VerdictShip it (with nits) — Only Suggestions. Nothing blocking. The error-message-vs-check mismatch is the one I'd most want to see addressed before merge, but a defensible answer is "we deliberately keep the category-sampler check at the engine layer," in which case just softening the message wording is enough. This review was generated by an AI assistant. |
- Tighten the error message to match what the validator actually checks
("sampler columns with sampler_type='category'") and mirror the wording
the engine-level companion validator at schema.py uses.
- Rename `_check_subcategory_parents_are_samplers` to
`_validate_subcategory_parents` to follow STYLEGUIDE convention
(`validate_*` for check-style validators).
- Add a regression test pinning the deliberate scope: when the parent
column name does not exist at all, this validator does not raise and
defers to the existing engine-level "Column not found" path.
|
Thanks for the careful reviews, @johnnygreco and the bots. Pushed 4751535 addressing the three findings that converged across the reviews: 1. Error message tightened - The previous wording ("category-sampler columns") overpromised what the check enforces. Now mirrors the engine-level validator's phrasing so users see consistent terminology if they trip both: I went with softening the message rather than extending the validator to also catch wrong-sampler-type - keeps the layer split clean, and the engine-level check at 2. Validator renamed - 3. Regression test added - Skipped: the suggestion to add a separate 510/510 tests pass; ruff clean. |
📋 Summary
When a subcategory sampler's parent column is a non-sampler type (e.g.
llm-text), DataDesigner used to fail deep inDataSchemavalidation with the misleading messageColumn 'X' not found in schema- even though the column exists in the user's config, just not in the sampler subset. This PR adds a model validator at theDataDesignerConfiglevel (which has visibility into all column types) so the error names the parent's actual column type and points the user at the real fix.🔗 Related Issue
N/A — surfaced by a user on Slack who hit it after changing a column from a category sampler to an llm-text column without realizing a downstream subcategory sampler still referenced it as parent.
🔄 Changes
_check_subcategory_parents_are_samplersmodel validator onDataDesignerConfig(packages/data-designer-config/src/data_designer/config/data_designer_config.py).validate_subcategory_columns_if_presentcontinues to handle the wrong-sampler-type case.llm-text) and a happy-path case (packages/data-designer-config/tests/config/test_data_designer_config.py).🧪 Testing
pytest packages/data-designer-config/tests/passes (499/499)pytest packages/data-designer-engine/tests/engine/sampling_gen/test_schema.pypasses (10/10)✅ Checklist