Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix removal of internal metadata information in 350 shim #10630

Merged
merged 3 commits into from
Mar 27, 2024

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Mar 25, 2024

Fixes #8844. This PR adds fix for removing internal metadata information from the schema while using CTAS. This feature was added in Spark 3.5.0.

Changes:

Created a new class SchemaMetadataShims. It includes a new method getCleanedSchema() which calls spark's removeInternalMetadata().

  • This function removes the metadata properties having keys that are meant for internal usage (eg, FILE_SOURCE_METADATA_COL_ATTR_KEY)
schema = SchemaMetadataShims.getCleanedSchema(result.schema)

Testing

  • Added unit test to use CTAS query to create a new table as select from another DF.
  • Added a metadata property having internal keys.
  • Tested if the metadata was finally removed (set as NULL).
  • Tested in Spark 3.4.0, metadata does not get removed.

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa added audit_3.5.0 Spark 3.5+ Spark 3.5+ issues labels Mar 25, 2024
@parthosa parthosa self-assigned this Mar 25, 2024
…ting code

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
@parthosa parthosa requested a review from razajafri March 27, 2024 15:55
@parthosa parthosa changed the title Add fix for removing internal metadata information from 350 shim Fix removal of internal metadata information in 350 shim Mar 27, 2024
Copy link
Collaborator

@razajafri razajafri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@razajafri
Copy link
Collaborator

build

@parthosa parthosa merged commit b57ffe0 into NVIDIA:branch-24.04 Mar 27, 2024
43 checks passed
@parthosa parthosa deleted the spark-rapids-8844 branch March 27, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.5.0 Spark 3.5+ Spark 3.5+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUDIT][SPARK-43123][SQL] Internal field metadata should not be leaked to catalogs
3 participants