Skip to content

feat: add dropped column preservation toggle#691

Merged
nabinchha merged 5 commits into
mainfrom
nmulepati/feat-690-preserve-dropped-columns
May 21, 2026
Merged

feat: add dropped column preservation toggle#691
nabinchha merged 5 commits into
mainfrom
nmulepati/feat-690-preserve-dropped-columns

Conversation

@nabinchha
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha commented May 20, 2026

📋 Summary

Adds a backwards-compatible RunConfig.preserve_dropped_columns toggle so users can opt out of writing dropped-column parquet artifacts. This keeps drop=True columns out of the main dataset parquet files while allowing storage-heavy intermediate columns, such as base64 image payloads, to be discarded entirely.

🔗 Related Issue

Closes #690

🔄 Changes

  • Add preserve_dropped_columns: bool = True to RunConfig.
  • Gate dropped-column parquet writes in DropColumnsProcessor on the new RunConfig option.
  • Persist the dropped-column artifact policy in metadata.json so resume can detect incompatible policy changes.
  • Treat preserve_dropped_columns changes as resume-incompatible for ResumeMode.ALWAYS, while ResumeMode.IF_POSSIBLE starts a fresh dataset automatically.
  • Add unit and interface regression coverage for default preservation, opt-out behavior, main parquet non-duplication, metadata persistence, and resume policy mismatches.

🧪 Testing

  • make test passes (not run; focused tests run instead)
  • Unit tests added/updated
  • E2E tests added/updated (N/A - no E2E surface)
  • UV_FROZEN=1 uv run pytest packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_build_resume_always_raises_on_dropped_column_artifact_policy_mismatch packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_build_if_possible_starts_fresh_on_dropped_column_artifact_policy_mismatch packages/data-designer/tests/interface/test_data_designer.py::test_create_with_drop_true_can_skip_dropped_column_artifacts packages/data-designer/tests/interface/test_data_designer.py::test_create_with_drop_true_preserves_columns_only_in_dropped_artifacts -q
  • UV_FROZEN=1 uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -q
  • UV_FROZEN=1 uv run ruff check packages/data-designer-config/src/data_designer/config/run_config.py packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer/tests/interface/test_data_designer.py
  • UV_FROZEN=1 uv run ruff format --check packages/data-designer-config/src/data_designer/config/run_config.py packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/processing/processors/test_drop_columns.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer/tests/interface/test_data_designer.py

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A - no architecture doc impact)

Closes #690

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner May 20, 2026 17:27
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR adds a preserve_dropped_columns: bool = True toggle to RunConfig that lets users opt out of writing dropped-column parquet artifacts, which is useful for avoiding storage of large intermediate columns (e.g. base64 image payloads). The default is True, keeping existing behavior intact.

  • DropColumnsProcessor: The parquet save is gated on run_config.preserve_dropped_columns; columns are still removed from the main dataset regardless.
  • DatasetBuilder: The flag is written into metadata.json defaults and a _dropped_column_artifact_policy_matches() helper enforces resume-compatibility — ALWAYS raises on mismatch, IF_POSSIBLE downgrades to a fresh run.
  • Tests: Unit, processor, dataset-builder, and interface tests cover default preservation, opt-out, metadata persistence, and both resume-mode mismatch scenarios.

Confidence Score: 5/5

Safe to merge — the change is fully backward-compatible, the flag defaults to the historical behavior, and resume incompatibility is handled correctly for both ALWAYS and IF_POSSIBLE modes.

The feature flag defaults to True, so existing users and runs are unaffected. The resume-compatibility logic correctly short-circuits before any dataset mutation occurs when a policy mismatch is detected. The _dropped_column_artifact_policy_matches helper defaults missing metadata to True, which correctly represents pre-existing datasets that always preserved dropped columns. Test coverage spans the processor, builder, and interface layers including both resume-mode mismatch branches.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/run_config.py Adds preserve_dropped_columns: bool = True field to RunConfig with a Pydantic Field and docstring; backward-compatible default.
packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py Gates the dropped-column parquet write on run_config.preserve_dropped_columns; columns are still removed from the main dataframe regardless of the flag.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Persists preserve_dropped_columns in metadata defaults and adds _dropped_column_artifact_policy_matches() for resume-compatibility checks in both _check_resume_config_compatibility() and _load_resume_state().
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py Adds two resume-policy mismatch tests: ALWAYS raises, IF_POSSIBLE downgrades to NEVER and starts fresh.
packages/data-designer/tests/interface/test_data_designer.py Adds two end-to-end interface tests validating opt-out (no dropped-column artifacts written) and default preservation (dropped columns present only in artifact files).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DatasetBuilder.build] --> B{resume in IF_POSSIBLE / ALWAYS?}
    B -- Yes --> C[_check_resume_config_compatibility]
    C --> D[_dropped_column_artifact_policy_matches metadata]
    D -- mismatch --> E{resume mode?}
    E -- ALWAYS --> F[raise DatasetGenerationError]
    E -- IF_POSSIBLE --> G[downgrade resume to NEVER / start fresh run]
    D -- match --> H[check config hash]
    H --> I{compatible?}
    I -- Yes --> J[resume = ALWAYS]
    I -- No --> G
    B -- No --> K[_set_metadata_defaults writes preserve_dropped_columns to metadata]
    J --> K
    G --> K
    K --> L[_build_with_resume / _build_async]
    L --> M[_load_resume_state / _dropped_column_artifact_policy_matches again]
    M --> N[DropColumnsProcessor.process_after_batch]
    N --> O{preserve_dropped_columns?}
    O -- True --> P[save dropped-column parquet artifact]
    O -- False --> Q[skip artifact write]
    P --> R[drop columns from main DataFrame]
    Q --> R
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into nmulepati/feat-..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #691 — feat: add dropped column preservation toggle

Summary

Adds a new RunConfig.preserve_dropped_columns: bool = True setting that gates the persistence of dropped-column parquet artifacts. When True (default), behavior is unchanged: columns marked drop=True are removed from the main dataset but written to a separate dropped-columns parquet. When False, the dropped columns are removed from the main dataset and not written anywhere, which lets users avoid storing heavy intermediate columns (e.g. base64 image payloads).

The change is a +112/-1 surgical addition: one new config field, one short-circuit in DropColumnsProcessor, and tests at three layers (config unit, processor unit, interface integration).

Findings

Correctness

  • Backwards compatible. Default True preserves the existing artifact-writing behavior. Existing dropped-column workflows are unaffected. ✅
  • Drop semantics preserved when disabled. With the flag off, data.drop(columns=resolved, inplace=True) (drop_columns.py:41) still runs, so dropped columns are removed from the active DataFrame regardless of artifact persistence. ✅
  • Preview mode unaffected. The new condition is current_batch_number is not None and ...preserve_dropped_columns, so the existing preview-mode short-circuit (where current_batch_number is None) still skips the save. ✅
  • Access path is canonical. self.resource_provider.run_config.preserve_dropped_columns matches the established pattern used elsewhere in the engine (e.g. llm_completion.py:45, validation.py:116, dataset_builder.py:308). ✅

Conventions / style

  • from __future__ import annotations, type annotations, and absolute imports are all in place.
  • Field uses Field(default=True, description=...) consistent with neighboring fields (jinja_rendering_engine, progress_interval). The class-level docstring entry is also added — slight duplication with the description= arg, but matches the style of the other fields in this class, so this is consistent.
  • Naming nit: preserve_dropped_columns could be misread as "preserve dropped columns in the main dataset." A name like write_dropped_column_artifacts would be unambiguous about what is preserved (the artifacts vs. the columns themselves). The docstring clarifies, so this is minor — but worth considering before the API ships, since the field name is part of the public RunConfig surface.

Tests

Coverage is appropriate for the change size:

  • test_run_config_preserves_dropped_columns_by_default — locks in the default.
  • test_run_config_accepts_disabled_dropped_column_preservation — locks in the opt-out.
  • test_process_after_batch_does_not_save_when_preservation_disabled — verifies the processor still drops columns but does not call write_parquet_file when disabled.
  • Two interface-level integration tests exercise the end-to-end behavior, including verifying that dropped_columns_dataset_path does not exist when the toggle is off, and the inverse — that the path does contain the dropped column under the default.

The existing test_drop_columns.py fixture stub_processor was updated to attach mock_resource_provider.run_config = RunConfig(), so all pre-existing assertions implicitly continue to exercise the preserve_dropped_columns=True path. Good.

One small gap: there is no explicit positive-path unit test asserting that write_parquet_file is called when preserve_dropped_columns=True. The existing test_process_after_batch cases do this implicitly via the default RunConfig(), so this isn't a blocker — just noting the symmetry could be tighter.

Performance / risk

  • Negligible runtime impact: one bool check on the run-config object per drop-columns batch.
  • Storage impact only matters when users opt in; the default preserves current behavior, so no risk to existing pipelines.
  • No reverse imports — config field added in data-designer-config, consumed in data-designer-engine, which respects the interface → engine → config direction.

Documentation

  • The Fern docs page at fern/versions/latest/pages/code_reference/config/run_config.mdx is auto-generated from the field metadata, so the new field's description= will flow through. No manual doc edits are needed.
  • Worth considering (follow-up, not blocking): a one-line mention in user-facing docs about when to disable preservation (the "base64 image payloads" use case from the PR description) would help users discover the toggle. Today it lives only in API reference.

Security

No security-relevant changes. The toggle only affects whether intermediate parquet files are written; secrets handling, sandboxing, and Jinja rendering are untouched.

Verdict

Approve with optional follow-ups. This is a clean, narrowly scoped, backwards-compatible feature with proportionate test coverage. No correctness or architectural concerns.

Optional, non-blocking suggestions:

  1. Consider renaming the field to something like write_dropped_column_artifacts for unambiguous semantics. (If shipping the current name, the wording is fine — just a name-clarity thought.)
  2. Add a positive-path unit test in test_drop_columns.py that asserts write_parquet_file is called when the toggle is True, mirroring the new negative-path test.
  3. Consider a brief usage note in the Fern docs about when to disable preservation.

Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together, @nabinchha!

Summary

This adds a backwards-compatible RunConfig.preserve_dropped_columns toggle and wires it into DropColumnsProcessor so users can skip writing dropped-column parquet artifacts while still dropping those columns from the final dataset. The implementation matches the stated intent for fresh runs, but there is a resume/artifact compatibility edge case worth fixing before merge.

Findings

Warnings — Worth addressing

Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.

packages/data-designer-engine/src/data_designer/engine/processing/processors/drop_columns.py:38 — Preserve toggle can leave resume artifacts inconsistent

  • What: When preserve_dropped_columns=False, new batches skip dropped-column parquet writes, but resume compatibility and artifact cleanup do not know the artifact policy changed. If a previous compatible run wrote dropped-column artifacts with the default True, then a resumed or extended run switches this flag to False, the old dropped-columns-parquet-files remain while the new batches have none. After the build, load_dataset_with_dropped_columns() tries to concatenate a shorter dropped-column dataset and DataDesigner.create() fails with a row-count mismatch.
  • Why: This can bite users who add the new opt-out to an existing interrupted or extendable dataset. The dataset config fingerprint still matches, so ResumeMode.ALWAYS proceeds and only fails after generation. The opposite direction (False -> True) is also unsafe because the old dropped columns cannot be reconstructed for already-written batches.
  • Suggestion: Make preserve_dropped_columns part of the resume/artifact compatibility contract, for example by persisting it in metadata and treating changes as incompatible for ResumeMode.ALWAYS / fresh for IF_POSSIBLE. If switching True -> False should be allowed, explicitly clear or ignore stale dropped-column artifacts when disabled and still reject False -> True. A regression test around resuming/extending a drop=True dataset after toggling this flag would lock in the expected behavior.

What Looks Good

  • The default remains True, which preserves the current behavior for existing users on fresh runs.
  • The processor change is nicely scoped and keeps the declarative config / imperative engine split intact.
  • The new tests cover both the default preservation path and the storage-saving opt-out path through the public DataDesigner flow.

Verdict

Needs changes — please address the resume/artifact compatibility issue before merge.


This review was generated by an AI assistant.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha
Copy link
Copy Markdown
Contributor Author

Addressed the resume/artifact compatibility feedback in dca2ee3f.

What changed:

  • Persist preserve_dropped_columns in metadata.json via dataset metadata defaults.
  • Treat changes to that policy as resume-incompatible so ResumeMode.ALWAYS fails before mixing stale/new dropped-column artifacts.
  • Allow ResumeMode.IF_POSSIBLE to fall back to a fresh dataset when the policy differs.
  • Added regression tests for both ALWAYS and IF_POSSIBLE, plus metadata assertions on the public create path.

Validation:

  • UV_FROZEN=1 uv run pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py -q -> 99 passed
  • Focused preservation/resume suite -> 23 passed
  • Ruff check/format on touched files passed

@nabinchha nabinchha requested a review from johnnygreco May 20, 2026 17:42
nabinchha added 2 commits May 20, 2026 16:30
…eserve-dropped-columns

# Conflicts:
#	packages/data-designer-config/tests/config/test_run_config.py
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up. The metadata-backed policy check addresses the resume/artifact consistency issue, and the focused regressions plus local repro look good to me.

@nabinchha nabinchha merged commit 2a487cd into main May 21, 2026
50 checks passed
@nabinchha nabinchha deleted the nmulepati/feat-690-preserve-dropped-columns branch May 21, 2026 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add RunConfig option to skip dropped-column parquet outputs

2 participants