fix: validate subcategory parents are sampler columns by andreatgretel · Pull Request #614 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-05-07T16:56:35Z

📋 Summary

When a subcategory sampler's parent column is a non-sampler type (e.g. llm-text), DataDesigner used to fail deep in DataSchema validation with the misleading message Column 'X' not found in schema - even though the column exists in the user's config, just not in the sampler subset. This PR adds a model validator at the DataDesignerConfig level (which has visibility into all column types) so the error names the parent's actual column type and points the user at the real fix.

🔗 Related Issue

N/A — surfaced by a user on Slack who hit it after changing a column from a category sampler to an llm-text column without realizing a downstream subcategory sampler still referenced it as parent.

🔄 Changes

Add _check_subcategory_parents_are_samplers model validator on DataDesignerConfig (packages/data-designer-config/src/data_designer/config/data_designer_config.py).
Validator only fires for the case the existing schema-level validator can't handle (parent exists but is non-sampler); the existing validate_subcategory_columns_if_present continues to handle the wrong-sampler-type case.
Add unit tests covering Kendrick's case (parent is llm-text) and a happy-path case (packages/data-designer-config/tests/config/test_data_designer_config.py).

🧪 Testing

pytest packages/data-designer-config/tests/ passes (499/499)
pytest packages/data-designer-engine/tests/engine/sampling_gen/test_schema.py passes (10/10)
Unit tests added for the new validator
E2E tests added/updated (N/A — config-only change)

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (N/A)

Subcategory sampler columns require a category-sampler parent. When the parent column was a non-sampler type (e.g. llm-text), validation failed deep inside the sampler-only DataSchema with the misleading message "Column 'X' not found in schema" - the column does exist in the user's config, just not in the sampler subset. Add a model validator on DataDesignerConfig that has visibility into all column types and raises a precise error naming the parent's actual column type.

github-actions · 2026-05-07T16:58:49Z

PR #614 Review — fix: validate subcategory parents are sampler columns

Summary

Adds a model_validator(mode="after") on DataDesignerConfig that fires when a subcategory sampler's parent column exists but has a non-sampler column_type (e.g. llm-text, seed-dataset, expression). Previously this case bottomed out in DataSchema.get_column at packages/data-designer-engine/src/data_designer/engine/sampling_gen/schema.py:138-141 with the misleading Column 'X' not found in schema error, because only sampler columns make it into the schema. The new validator runs at the config layer where all column types are visible, so it can name the actual offending column_type back to the user. +67/-1, config-only, two unit tests.

Findings

Correctness — looks right

The validator guards parent is not None, so a subcategory whose params.category refers to a non-existent column falls through to the downstream DataSchema error. This overlap with the existing schema-level validator is intentional and documented in the PR description — the two validators split responsibility cleanly: this one handles "parent exists but wrong column_type"; validate_subcategory_columns_if_present (schema.py:112) handles "parent is a sampler, but wrong sampler_type" plus value-set alignment.
Short-circuit on col.column_type != "sampler" is what keeps the col.sampler_type access safe — only sampler columns carry that attribute. The ordering of the two clauses is load-bearing; worth a brief inline comment or just leaving it as-is since it's idiomatic Pydantic.
by_name is built once per validator invocation (O(n)) and the outer loop is O(n), so the whole check is O(n) in column count. Fine.

Error message — slightly ambiguous

The message says Subcategory parents must be category-sampler columns. Readers unfamiliar with the distinction between column_type="sampler" and sampler_type="category" may read "category-sampler" as a single compound term. Consider tightening to something like:

Subcategory parents must be sampler columns with sampler_type='category'.

This also mirrors the exact phrasing the downstream validator at schema.py:116-119 uses when the parent is a sampler of the wrong type, so users who trip both validators over time see consistent terminology.

Test coverage — good for the happy/unhappy paths, one gap

test_subcategory_parent_must_be_a_sampler_column covers the Slack case (parent is llm-text). ✓
test_subcategory_parent_as_category_sampler_is_valid covers the happy path. ✓
Missing: a test where the parent is a seed-dataset column (or another non-sampler, non-llm type). Since the validator treats all non-sampler column_types uniformly, one extra parametrized case (or a single additional test with seed-dataset) would lock in that contract. The llm-text-only test leaves room for a future refactor to narrow the check to llm columns without any test failing.
Nice to have: a test asserting that if the parent column name doesn't exist at all, the new validator does not raise (i.e. it defers to the existing schema-level error). This documents the intentional scope of the new check.

Style / conventions — conforming

from __future__ import annotations present. ✓
Absolute imports. ✓
typing_extensions.Self used for the return type (matches existing pattern in schema.py:113). ✓
The string literal "sampler" is consistent with the Literal["sampler"] discriminator on SamplerColumnConfig (column_configs.py:82). Using DataDesignerColumnType.SAMPLER would add an import without improving safety, so the literal is fine.
Validator name _check_subcategory_parents_are_samplers is more restrictive than what it actually enforces (it only catches the non-sampler case, not the wrong-sampler-type case). The leading underscore and the docstring-less body are acceptable for a private validator, but a one-line comment above it pointing at the companion validator in schema.py would save the next reader a grep.

Risk / blast radius — low

Config-only change in data-designer-config; no import-direction concerns (config package imports nothing from engine/interface).
The new check is strictly a narrowing of what used to succeed at config-validation time. A prior config that passed config validation and then failed at schema validation will now fail a tick earlier with a better error — no false positives are possible because the only configs it rejects are ones that would have failed downstream anyway.
No changes to public API, builder, or CLI; no import-time cost beyond SamplerType (already a lightweight enum).

Verdict

Approve with minor suggestions. The fix is well-scoped, matches the existing validator conventions in the codebase, and genuinely improves the error surface for a documented user-reported issue. The two non-blocking requests: (1) tighten the error wording to disambiguate "category-sampler," and (2) add a test case for a non-llm, non-sampler parent (e.g. seed-dataset) to pin the behavior for all non-sampler column types.

greptile-apps · 2026-05-07T16:59:15Z

Greptile Summary

This PR adds a model validator to DataDesignerConfig that catches the case where a subcategory sampler references a non-sampler column (e.g. llm-text) as its parent, surfacing a clear, actionable error instead of the confusing downstream Column not found in schema message.

Adds _validate_subcategory_parents as a mode=\"after\" model validator that builds a name→config lookup, iterates only subcategory sampler columns, and raises when the referenced parent column exists but is not a \"sampler\" column type.
Deliberately skips the missing-parent case, delegating it to the existing validate_subcategory_columns_if_present schema-level validator, keeping the two validators' responsibilities distinct.
Adds three unit tests: the triggering error case (llm-text parent), a valid category-sampler parent (happy path), and explicit confirmation that a missing parent defers to the downstream validator.

Confidence Score: 5/5

Safe to merge — the change is confined to a new model validator that only raises on a previously unhandled misconfiguration, leaving all valid configs unaffected.

The validator is narrowly scoped: it iterates already-validated column objects, short-circuits correctly on non-sampler columns, and only raises for the exact bad-state it targets. The missing-parent case is deliberately delegated downstream and is explicitly tested. No existing behavior is changed.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/config/data_designer_config.py	Adds _validate_subcategory_parents model validator; logic is correct, short-circuits on non-sampler columns, accesses params.category only after confirming subcategory sampler type.
packages/data-designer-config/tests/config/test_data_designer_config.py	Three new tests cover the error case, happy path, and missing-parent deferral — all scenarios are meaningful and the assertions are appropriate.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[model_validate called on DataDesignerConfig] --> B[Pydantic validates individual fields and columns]
    B --> C[_validate_subcategory_parents runs]
    C --> D[Build name to column map]
    D --> E{For each column}
    E --> F{column_type == sampler AND sampler_type == SUBCATEGORY?}
    F -- No --> E
    F -- Yes --> G{parent = by_name.get col.params.category}
    G -- parent is None --> H[Skip - defer to validate_subcategory_columns_if_present]
    H --> E
    G -- parent found --> I{parent.column_type != sampler?}
    I -- No, parent is a sampler --> J[Pass - proceed]
    J --> E
    I -- Yes, parent is non-sampler e.g. llm-text --> K[Raise ValueError with clear column type message]

_{Reviews (4): Last reviewed commit: "chore: trim trailing whitespace introduc..." | Re-trigger Greptile}

johnnygreco · 2026-05-07T18:44:02Z

Nice fix on this one, @amanoel — turning a confusing "column not in schema" error into a precise, user-facing message is exactly the kind of polish that pays for itself in support time.

Summary

Adds a model_validator(mode="after") on DataDesignerConfig that detects subcategory sampler columns whose parent is a non-sampler column (e.g. an llm-text column) and raises a clear ValueError naming the offending parent and its actual column_type. Previously this misconfiguration leaked through to the sampler-only DataSchema validation and surfaced as a misleading "Column 'X' not found in schema" error.

Findings

Suggestions — Take it or leave it

packages/data-designer-config/src/data_designer/config/data_designer_config.py:56-59 — Error message is slightly broader than the check

What: The check is parent.column_type != "sampler" (any sampler), but the message asserts that "Subcategory parents must be category-sampler columns." If a user picks, say, a uniform sampler as the parent, this validator passes and the engine-level validate_subcategory_columns_if_present then raises the more specific "must be a category source type" error. The wording in the new validator slightly overpromises what it enforces.
Why: Future maintainers reading just this validator may assume it enforces the category-sampler rule; in practice that rule lives in engine/sampling_gen/schema.py:113. Two near-but-not-identical assertions split across packages can drift.

Suggestion: Either tighten the message to match what's checked here ("Subcategory parents must be sampler columns; the category-sampler rule is enforced separately"), or — cleaner — extend this validator to also catch the wrong-sampler-type case so the user gets a single, authoritative error at the config layer:

if parent is not None:
    if parent.column_type != "sampler":
        raise ValueError(
            f"Subcategory column '{col.name}' has parent '{parent.name}', which is a "
            f"'{parent.column_type}' column. Subcategory parents must be category-sampler columns."
        )
    if parent.sampler_type != SamplerType.CATEGORY:
        raise ValueError(
            f"Subcategory column '{col.name}' has parent '{parent.name}' of sampler type "
            f"'{parent.sampler_type}'. Subcategory parents must be category samplers."
        )

Take it or leave it — keeping the layers of validation separate is also a defensible choice.

packages/data-designer-config/src/data_designer/config/data_designer_config.py:49 — Validator naming

What: The new validator is named _check_subcategory_parents_are_samplers. STYLEGUIDE.md (§Pydantic Models) suggests validate_* as the convention for check-style validators.
Why: Other validators in this same package use _validate_* (private) or validate_* (public). Consistency makes the intent scannable.
Suggestion: Consider renaming to _validate_subcategory_parents. Both forms exist in the codebase, so this is purely a polish.

packages/data-designer-config/tests/config/test_data_designer_config.py — Missing test for the "parent doesn't exist at all" path

What: The validator's if parent is not None branch is only exercised by the two new tests when a parent does exist. The "parent name typo / missing column" case (parent is None) isn't covered.
Why: It's a code path that silently falls through, and it's worth a regression test to lock in that this validator deliberately defers to downstream validation in that case (vs. accidentally swallowing the typo).
Suggestion: Add a small test that points category at a non-existent column name and asserts the existing downstream validation still raises (or — even simpler — that this new validator does not raise its specific message).

What Looks Good

Catching this at the DataDesignerConfig layer is the right call — it's the only layer that has visibility into all column types, so the error message can be precise about the parent's actual column_type.
The commit message is excellent: it explains the user-facing symptom, the root cause, and the fix. Future readers will appreciate this.
The two new tests cover both directions (failing case asserts on the specific error message via regex; passing case verifies a valid config round-trips).
from typing_extensions import Self matches the convention used across the rest of the config package.

Verdict

Ship it (with nits) — Only Suggestions. Nothing blocking. The error-message-vs-check mismatch is the one I'd most want to see addressed before merge, but a defensible answer is "we deliberately keep the category-sampler check at the engine layer," in which case just softening the message wording is enough.

This review was generated by an AI assistant.

- Tighten the error message to match what the validator actually checks ("sampler columns with sampler_type='category'") and mirror the wording the engine-level companion validator at schema.py uses. - Rename `_check_subcategory_parents_are_samplers` to `_validate_subcategory_parents` to follow STYLEGUIDE convention (`validate_*` for check-style validators). - Add a regression test pinning the deliberate scope: when the parent column name does not exist at all, this validator does not raise and defers to the existing engine-level "Column not found" path.

andreatgretel · 2026-05-07T20:35:37Z

Thanks for the careful reviews, @johnnygreco and the bots. Pushed 4751535 addressing the three findings that converged across the reviews:

1. Error message tightened - The previous wording ("category-sampler columns") overpromised what the check enforces. Now mirrors the engine-level validator's phrasing so users see consistent terminology if they trip both:

Subcategory parents must be sampler columns with sampler_type='category'.

I went with softening the message rather than extending the validator to also catch wrong-sampler-type - keeps the layer split clean, and the engine-level check at schema.py:116-119 already owns that contract.

2. Validator renamed - _check_subcategory_parents_are_samplers → _validate_subcategory_parents, per STYLEGUIDE.md:296 (validate_* for check-style validators).

3. Regression test added - test_subcategory_parent_missing_defers_to_schema_validator pins the intentional scope of the new validator: when the parent name doesn't exist at all, this validator does not raise and defers to the engine-level "Column not found" path.

Skipped: the suggestion to add a separate seed-dataset parent test - the validator treats all non-sampler column_types uniformly via one comparison, so a second case wouldn't add real coverage. Happy to parametrize if you'd prefer it locked in explicitly.

510/510 tests pass; ruff clean.

andreatgretel requested a review from a team as a code owner May 7, 2026 16:56

andreatgretel temporarily deployed to agentic-ci May 7, 2026 16:56 — with GitHub Actions Inactive

johnnygreco previously approved these changes May 7, 2026

View reviewed changes

Merge branch 'main' into andreatgretel/fix/subcategory-parent-validation

5102a32

andreatgretel dismissed johnnygreco’s stale review via 5102a32 May 7, 2026 22:50

chore: trim trailing whitespace introduced in merge resolution

cdb0c5c

andreatgretel requested a review from johnnygreco May 7, 2026 22:52

johnnygreco approved these changes May 8, 2026

View reviewed changes

andreatgretel merged commit 6cbbb7d into main May 8, 2026
49 checks passed

andreatgretel deleted the andreatgretel/fix/subcategory-parent-validation branch May 14, 2026 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate subcategory parents are sampler columns#614

fix: validate subcategory parents are sampler columns#614
andreatgretel merged 4 commits into
mainfrom
andreatgretel/fix/subcategory-parent-validation

andreatgretel commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

johnnygreco commented May 7, 2026

Uh oh!

andreatgretel commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreatgretel commented May 7, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 7, 2026

PR #614 Review — fix: validate subcategory parents are sampler columns

Summary

Findings

Correctness — looks right

Error message — slightly ambiguous

Test coverage — good for the happy/unhappy paths, one gap

Style / conventions — conforming

Risk / blast radius — low

Verdict

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

johnnygreco commented May 7, 2026

Summary

Findings

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

andreatgretel commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 7, 2026 •

edited

Loading