Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

Closed
revans2 opened this issue Sep 16, 2021 · 1 comment · Fixed by #3666
Closed

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

revans2 opened this issue Sep 16, 2021 · 1 comment · Fixed by #3666
Assignees
Labels
bug Something isn't working P0 Must have for release Spark 3.2+

Comments

@revans2
Copy link
Collaborator

revans2 commented Sep 16, 2021

Describe the bug
I was trying to test out #3509 and was not able to reproduce it locally. Instead on 3.2 I ran into some issues with reading parquet data. Reading works fine when using PERFILE, but when I set it to AUTO or COALESCING it fails with errors in parquet.

It turns out that when we want to get the size of the footer we write out the footer, but newer versions of parquet actually check if our footer looks correct, no overlap in the blocks. So this fails because we are using the blocks unmodified from different input files.

Steps/Code to reproduce bug
On Spark 3.2.1-SNAPSHOT (3.2.0) run locally with 1 executor and maxPartitionBytes of 1g. Then setup TPCDS for Scale Factor 200 on parquet and run the query.

spark.time(spark.sql("select count(*) from store_sales, date_dim d1 where ss_sold_date_sk = d1.d_date_sk and d1.d_year between 1999 AND 1999 + 2").show)

It fails 100% of the time for me with errors like

Caused by: java.lang.IllegalStateException: Invalid block starting position:2106143682
  at org.apache.parquet.Preconditions.checkState(Preconditions.java:93)
  at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:216)
  at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:197)
  at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter(GpuParquetScan.scala:517)
  at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter$(GpuParquetScan.scala:486)

Expected behavior
It will pass

I am not 100% sure how to fix this. I could go muck around with the blocks passed in, but the starting offset comes from the first column's metadata, which means I would have to mess with each column's metadata too.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify Spark 3.2+ labels Sep 16, 2021
@abellina
Copy link
Collaborator

abellina commented Sep 24, 2021

we just hit this in spark2a. It seems we should prioritize a fix, or disabling the AUTO/COALESCING writer. The PERFILE writer does work in that case too.

@jlowe jlowe added this to To do in Release 21.10 via automation Sep 24, 2021
@jlowe jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Sep 24, 2021
@jlowe jlowe self-assigned this Sep 24, 2021
@Salonijain27 Salonijain27 added this to the Sep 27 - Oct 1 milestone Sep 24, 2021
Release 21.10 automation moved this from To do to Done Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release Spark 3.2+
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants