feat: add dropped column preservation toggle#691
Conversation
Closes #690 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Greptile SummaryThis PR adds a
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/run_config.py | Adds preserve_dropped_columns: bool = True field to RunConfig with a Pydantic Field and docstring; backward-compatible default. |
| packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py | Gates the dropped-column parquet write on run_config.preserve_dropped_columns; columns are still removed from the main dataframe regardless of the flag. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Persists preserve_dropped_columns in metadata defaults and adds _dropped_column_artifact_policy_matches() for resume-compatibility checks in both _check_resume_config_compatibility() and _load_resume_state(). |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Adds two resume-policy mismatch tests: ALWAYS raises, IF_POSSIBLE downgrades to NEVER and starts fresh. |
| packages/data-designer/tests/interface/test_data_designer.py | Adds two end-to-end interface tests validating opt-out (no dropped-column artifacts written) and default preservation (dropped columns present only in artifact files). |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DatasetBuilder.build] --> B{resume in IF_POSSIBLE / ALWAYS?}
B -- Yes --> C[_check_resume_config_compatibility]
C --> D[_dropped_column_artifact_policy_matches metadata]
D -- mismatch --> E{resume mode?}
E -- ALWAYS --> F[raise DatasetGenerationError]
E -- IF_POSSIBLE --> G[downgrade resume to NEVER / start fresh run]
D -- match --> H[check config hash]
H --> I{compatible?}
I -- Yes --> J[resume = ALWAYS]
I -- No --> G
B -- No --> K[_set_metadata_defaults writes preserve_dropped_columns to metadata]
J --> K
G --> K
K --> L[_build_with_resume / _build_async]
L --> M[_load_resume_state / _dropped_column_artifact_policy_matches again]
M --> N[DropColumnsProcessor.process_after_batch]
N --> O{preserve_dropped_columns?}
O -- True --> P[save dropped-column parquet artifact]
O -- False --> Q[skip artifact write]
P --> R[drop columns from main DataFrame]
Q --> R
Reviews (5): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile
Review: PR #691 — feat: add dropped column preservation toggleSummaryAdds a new The change is a +112/-1 surgical addition: one new config field, one short-circuit in FindingsCorrectness
Conventions / style
TestsCoverage is appropriate for the change size:
The existing One small gap: there is no explicit positive-path unit test asserting that Performance / risk
Documentation
SecurityNo security-relevant changes. The toggle only affects whether intermediate parquet files are written; secrets handling, sandboxing, and Jinja rendering are untouched. VerdictApprove with optional follow-ups. This is a clean, narrowly scoped, backwards-compatible feature with proportionate test coverage. No correctness or architectural concerns. Optional, non-blocking suggestions:
|
johnnygreco
left a comment
There was a problem hiding this comment.
Thanks for putting this together, @nabinchha!
Summary
This adds a backwards-compatible RunConfig.preserve_dropped_columns toggle and wires it into DropColumnsProcessor so users can skip writing dropped-column parquet artifacts while still dropping those columns from the final dataset. The implementation matches the stated intent for fresh runs, but there is a resume/artifact compatibility edge case worth fixing before merge.
Findings
Warnings — Worth addressing
Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.
packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py:38 — Preserve toggle can leave resume artifacts inconsistent
- What: When
preserve_dropped_columns=False, new batches skip dropped-column parquet writes, but resume compatibility and artifact cleanup do not know the artifact policy changed. If a previous compatible run wrote dropped-column artifacts with the defaultTrue, then a resumed or extended run switches this flag toFalse, the olddropped-columns-parquet-filesremain while the new batches have none. After the build,load_dataset_with_dropped_columns()tries to concatenate a shorter dropped-column dataset andDataDesigner.create()fails with a row-count mismatch. - Why: This can bite users who add the new opt-out to an existing interrupted or extendable dataset. The dataset config fingerprint still matches, so
ResumeMode.ALWAYSproceeds and only fails after generation. The opposite direction (False->True) is also unsafe because the old dropped columns cannot be reconstructed for already-written batches. - Suggestion: Make
preserve_dropped_columnspart of the resume/artifact compatibility contract, for example by persisting it in metadata and treating changes as incompatible forResumeMode.ALWAYS/ fresh forIF_POSSIBLE. If switchingTrue->Falseshould be allowed, explicitly clear or ignore stale dropped-column artifacts when disabled and still rejectFalse->True. A regression test around resuming/extending adrop=Truedataset after toggling this flag would lock in the expected behavior.
What Looks Good
- The default remains
True, which preserves the current behavior for existing users on fresh runs. - The processor change is nicely scoped and keeps the declarative config / imperative engine split intact.
- The new tests cover both the default preservation path and the storage-saving opt-out path through the public
DataDesignerflow.
Verdict
Needs changes — please address the resume/artifact compatibility issue before merge.
This review was generated by an AI assistant.
Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
|
Addressed the resume/artifact compatibility feedback in What changed:
Validation:
|
…eserve-dropped-columns # Conflicts: # packages/data-designer-config/tests/config/test_run_config.py
johnnygreco
left a comment
There was a problem hiding this comment.
Thanks for the follow-up. The metadata-backed policy check addresses the resume/artifact consistency issue, and the focused regressions plus local repro look good to me.
📋 Summary
Adds a backwards-compatible
RunConfig.preserve_dropped_columnstoggle so users can opt out of writing dropped-column parquet artifacts. This keepsdrop=Truecolumns out of the main dataset parquet files while allowing storage-heavy intermediate columns, such as base64 image payloads, to be discarded entirely.🔗 Related Issue
Closes #690
🔄 Changes
preserve_dropped_columns: bool = TruetoRunConfig.DropColumnsProcessoron the new RunConfig option.metadata.jsonso resume can detect incompatible policy changes.preserve_dropped_columnschanges as resume-incompatible forResumeMode.ALWAYS, whileResumeMode.IF_POSSIBLEstarts a fresh dataset automatically.🧪 Testing
make testpasses (not run; focused tests run instead)UV_FROZEN=1 uv run pytest packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_build_resume_always_raises_on_dropped_column_artifact_policy_mismatch packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_build_if_possible_starts_fresh_on_dropped_column_artifact_policy_mismatch packages/data-designer/tests/interface/test_data_designer.py::test_create_with_drop_true_can_skip_dropped_column_artifacts packages/data-designer/tests/interface/test_data_designer.py::test_create_with_drop_true_preserves_columns_only_in_dropped_artifacts -qUV_FROZEN=1 uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -qUV_FROZEN=1 uv run ruff check packages/data-designer-config/src/data_designer/config/run_config.py packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer/tests/interface/test_data_designer.pyUV_FROZEN=1 uv run ruff format --check packages/data-designer-config/src/data_designer/config/run_config.py packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer/tests/interface/test_data_designer.py✅ Checklist