Skip to content

Add RunConfig option to skip dropped-column parquet outputs #690

@nabinchha

Description

@nabinchha

Summary

When a column config has drop=True, DataDesigner correctly strips that column from the main dataset parquet files. However, the dropped column values are still written to separate parquet files. Users need a RunConfig-level switch to opt out of writing those dropped-column parquet artifacts.

Proposed behavior

Add a preserve_dropped_columns: bool = True option to RunConfig.

When preserve_dropped_columns=True, keep the current behavior: dropped columns are omitted from the main dataset parquet files and preserved in separate parquet files.

When preserve_dropped_columns=False, columns with drop=True should still be omitted from the main dataset parquet files, but the separate dropped-column parquet files should not be written.

Motivation

Some dropped columns are only intermediate generation inputs and do not need to be persisted. Skipping the dropped-column parquet files can reduce storage use and help avoid persisting sensitive or unnecessary intermediate values.

Acceptance criteria

  • RunConfig exposes a preserve_dropped_columns boolean with a backwards-compatible default of True.
  • RunConfig(..., preserve_dropped_columns=False) prevents separate dropped-column parquet files from being written.
  • drop=True columns remain excluded from the main dataset parquet files in both modes.
  • Existing behavior is unchanged when the option is omitted.
  • Tests cover both the default preserve behavior and the opt-out behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions