Summary
When a column config has drop=True, DataDesigner correctly strips that column from the main dataset parquet files. However, the dropped column values are still written to separate parquet files. Users need a RunConfig-level switch to opt out of writing those dropped-column parquet artifacts.
Proposed behavior
Add a preserve_dropped_columns: bool = True option to RunConfig.
When preserve_dropped_columns=True, keep the current behavior: dropped columns are omitted from the main dataset parquet files and preserved in separate parquet files.
When preserve_dropped_columns=False, columns with drop=True should still be omitted from the main dataset parquet files, but the separate dropped-column parquet files should not be written.
Motivation
Some dropped columns are only intermediate generation inputs and do not need to be persisted. Skipping the dropped-column parquet files can reduce storage use and help avoid persisting sensitive or unnecessary intermediate values.
Acceptance criteria
RunConfig exposes a preserve_dropped_columns boolean with a backwards-compatible default of True.
RunConfig(..., preserve_dropped_columns=False) prevents separate dropped-column parquet files from being written.
drop=True columns remain excluded from the main dataset parquet files in both modes.
- Existing behavior is unchanged when the option is omitted.
- Tests cover both the default preserve behavior and the opt-out behavior.
Summary
When a column config has
drop=True, DataDesigner correctly strips that column from the main dataset parquet files. However, the dropped column values are still written to separate parquet files. Users need a RunConfig-level switch to opt out of writing those dropped-column parquet artifacts.Proposed behavior
Add a
preserve_dropped_columns: bool = Trueoption toRunConfig.When
preserve_dropped_columns=True, keep the current behavior: dropped columns are omitted from the main dataset parquet files and preserved in separate parquet files.When
preserve_dropped_columns=False, columns withdrop=Trueshould still be omitted from the main dataset parquet files, but the separate dropped-column parquet files should not be written.Motivation
Some dropped columns are only intermediate generation inputs and do not need to be persisted. Skipping the dropped-column parquet files can reduce storage use and help avoid persisting sensitive or unnecessary intermediate values.
Acceptance criteria
RunConfigexposes apreserve_dropped_columnsboolean with a backwards-compatible default ofTrue.RunConfig(..., preserve_dropped_columns=False)prevents separate dropped-column parquet files from being written.drop=Truecolumns remain excluded from the main dataset parquet files in both modes.