[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

revans2 · 2021-09-16T18:42:26Z

Describe the bug
I was trying to test out #3509 and was not able to reproduce it locally. Instead on 3.2 I ran into some issues with reading parquet data. Reading works fine when using PERFILE, but when I set it to AUTO or COALESCING it fails with errors in parquet.

It turns out that when we want to get the size of the footer we write out the footer, but newer versions of parquet actually check if our footer looks correct, no overlap in the blocks. So this fails because we are using the blocks unmodified from different input files.

Steps/Code to reproduce bug
On Spark 3.2.1-SNAPSHOT (3.2.0) run locally with 1 executor and maxPartitionBytes of 1g. Then setup TPCDS for Scale Factor 200 on parquet and run the query.

spark.time(spark.sql("select count(*) from store_sales, date_dim d1 where ss_sold_date_sk = d1.d_date_sk and d1.d_year between 1999 AND 1999 + 2").show)

It fails 100% of the time for me with errors like

Caused by: java.lang.IllegalStateException: Invalid block starting position:2106143682
  at org.apache.parquet.Preconditions.checkState(Preconditions.java:93)
  at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:216)
  at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:197)
  at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter(GpuParquetScan.scala:517)
  at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter$(GpuParquetScan.scala:486)

Expected behavior
It will pass

I am not 100% sure how to fix this. I could go muck around with the blocks passed in, but the starting offset comes from the first column's metadata, which means I would have to mess with each column's metadata too.

The text was updated successfully, but these errors were encountered:

abellina · 2021-09-24T16:37:28Z

we just hit this in spark2a. It seems we should prioritize a fix, or disabling the AUTO/COALESCING writer. The PERFILE writer does work in that case too.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify Spark 3.2+ labels Sep 16, 2021

jlowe added this to To do in Release 21.10 via automation Sep 24, 2021

jlowe added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Sep 24, 2021

jlowe self-assigned this Sep 24, 2021

Salonijain27 added this to the Sep 27 - Oct 1 milestone Sep 24, 2021

jlowe mentioned this issue Sep 24, 2021

Recompute Parquet block metadata when estimating footer from multiple file input #3666

Merged

jlowe closed this as completed in #3666 Sep 27, 2021

Release 21.10 automation moved this from To do to Done Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

revans2 commented Sep 16, 2021

abellina commented Sep 24, 2021 •

edited

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

[BUG] MultiFileParquetPartitionReader can fail while trying to write the footer #3516

Comments

revans2 commented Sep 16, 2021

abellina commented Sep 24, 2021 • edited

abellina commented Sep 24, 2021 •

edited