You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I was trying to test out #3509 and was not able to reproduce it locally. Instead on 3.2 I ran into some issues with reading parquet data. Reading works fine when using PERFILE, but when I set it to AUTO or COALESCING it fails with errors in parquet.
It turns out that when we want to get the size of the footer we write out the footer, but newer versions of parquet actually check if our footer looks correct, no overlap in the blocks. So this fails because we are using the blocks unmodified from different input files.
Steps/Code to reproduce bug
On Spark 3.2.1-SNAPSHOT (3.2.0) run locally with 1 executor and maxPartitionBytes of 1g. Then setup TPCDS for Scale Factor 200 on parquet and run the query.
spark.time(spark.sql("select count(*) from store_sales, date_dim d1 where ss_sold_date_sk = d1.d_date_sk and d1.d_year between 1999 AND 1999 + 2").show)
It fails 100% of the time for me with errors like
Caused by: java.lang.IllegalStateException: Invalid block starting position:2106143682
at org.apache.parquet.Preconditions.checkState(Preconditions.java:93)
at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:216)
at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:197)
at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter(GpuParquetScan.scala:517)
at com.nvidia.spark.rapids.ParquetPartitionReaderBase.writeFooter$(GpuParquetScan.scala:486)
Expected behavior
It will pass
I am not 100% sure how to fix this. I could go muck around with the blocks passed in, but the starting offset comes from the first column's metadata, which means I would have to mess with each column's metadata too.
The text was updated successfully, but these errors were encountered:
we just hit this in spark2a. It seems we should prioritize a fix, or disabling the AUTO/COALESCING writer. The PERFILE writer does work in that case too.
Describe the bug
I was trying to test out #3509 and was not able to reproduce it locally. Instead on 3.2 I ran into some issues with reading parquet data. Reading works fine when using PERFILE, but when I set it to AUTO or COALESCING it fails with errors in parquet.
It turns out that when we want to get the size of the footer we write out the footer, but newer versions of parquet actually check if our footer looks correct, no overlap in the blocks. So this fails because we are using the blocks unmodified from different input files.
Steps/Code to reproduce bug
On Spark 3.2.1-SNAPSHOT (3.2.0) run locally with 1 executor and maxPartitionBytes of 1g. Then setup TPCDS for Scale Factor 200 on parquet and run the query.
It fails 100% of the time for me with errors like
Expected behavior
It will pass
I am not 100% sure how to fix this. I could go muck around with the blocks passed in, but the starting offset comes from the first column's metadata, which means I would have to mess with each column's metadata too.
The text was updated successfully, but these errors were encountered: