Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass metadata extractors to FileScanRDD [databricks] #10616

Merged
merged 4 commits into from
Mar 22, 2024

Conversation

razajafri
Copy link
Collaborator

This PR handles the change that was made in Spark. We are passing through the metdata extractors from the fileFormat to the FileScanRDD.

Changes

  • Created a shim for 350+ to pass the metadata extractors

fixes #8766

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
metadataColumns: Seq[AttributeReference] = Seq.empty): RDD[InternalRow] = {
if (relation.isDefined) {
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we pass relation to access fileFormat to access fileConstantMetadataExtractors and only the latter is shim-specific should we just pass fileFormat as an option to getFileScanRDD?

Comment on lines 45 to 50
if (relation.isDefined) {
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)
} else {
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically in Scala calling .get on an Option is an anti-pattern.

Suggested change
if (relation.isDefined) {
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
metadataExtractors = relation.get.fileFormat.fileConstantMetadataExtractors)
} else {
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns)
}
new FileScanRDD(sparkSession, readFunction, filePartitions, readDataSchema, metadataColumns,
relation.map(_.fileFormat.fileConstantMetadataExtractors).getOrElse(Map.empty))

@@ -78,6 +78,7 @@ trait SparkShims {
readFunction: (PartitionedFile) => Iterator[InternalRow],
filePartitions: Seq[FilePartition],
readDataSchema: StructType,
relation: Option[HadoopFsRelation],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move this arg to the last position and give it a default value None, you probably will have fewer lines to modify

@razajafri
Copy link
Collaborator Author

Thanks for the review and suggestions. PTAL again

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but need to update copyrights

can either use a pre-commit hook or invoke directly

export SPARK_RAPIDS_AUTO_COPYRIGHTER=ON 
git diff origin/branch-24.04..HEAD --name-status | \
  awk '/^M\s+/ { print $2}' | \
  xargs ./scripts/auto-copyrighter.sh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update copyright

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update copyright year

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gerashegalov
Copy link
Collaborator

build

@razajafri razajafri merged commit 09a0081 into NVIDIA:branch-24.04 Mar 22, 2024
42 of 43 checks passed
@razajafri razajafri deleted the SP-8766-file-source-scan branch March 22, 2024 16:42
@razajafri razajafri restored the SP-8766-file-source-scan branch April 23, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUDIT][SPARK-43226] Define extractors for file-constant metadata
2 participants