-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized Parquet Read In Spark DataSource #90
Comments
Isn't this more an issue in the Spark codebase? I think a key prerequisite is to have the Parquet Data Source in Spark on the Data Source v2 API first. |
That's one way to do it; right now though Iceberg implements its own Parquet I/O and it doesn't do vectorized read. It also depends if Iceberg continues to handle the file I/O layer or if Iceberg just serves as the metastore / catalog API and delegates to the Spark datasource. |
good point, I'm seeing a lot of problem now with not using the "native" parquet data source.. |
I think there's some merit to passing off all the low level optimizations to If we did it that way, the delta between using Spark's Parquet data source or implementing one's own isn't that large. It becomes more of just flavor around how to bootstrap the parquet-mr modules. |
There are a few problems with the vectorized read path in Spark: it only supports flat schemas and has no support for schema evolution. What we've been planning on is to use the Arrow integration in Spark to get vectorized reads. There's already an implementation of Spark's ColumnarBatch translates from an Arrow RowBatch for PySpark/Pandas integration. Having deserialization directly to Arrow would be a benefit to multiple engines. We're considering using it in the Presto connector as well, and it would work well in a record service (possibly using Flight). |
How would that integrate with the existing Parquet file format? This would be a blocker for many places that rely on Parquet in existing data sources and want to port over to Iceberg (this is what we're thinking of doing). We can't afford to lose the read optimizations there, that would be a performance regression for our Spark applications. Are there ways to patch the vectorized read path in Spark to address those issues? Or can we support vectorized read with the limited subset of cases where it does work? |
We could create vectorized readers in Iceberg that do the same thing as Spark, but I think it is a better use of time to deserialize to Arrow, which has support for memory optimizations like dictionary-encoded row batches. All of this work can be thought of as an in-memory representation. Iceberg has an API for deserializing pages to rows, it would add one for deserializing to a column batch. Agreed that this is a performance blocker if you have flat tables and don't need schema evolution. But I don't think that producing Spark's ColumnarBatch would be much easier than producing Arrow. |
Just wanted to clarify (I'm also not too familiar with the Arrow technology), would supporting the Arrow case give us vectorized read for Parquet files? And as such would writing the Arrow integration give comparable performance to existing Spark on Parquet users? |
Iceberg would deserialize to Arrow's RowBatch, which is a more useful in-memory representation because it could support non-Spark engines. Spark can use Arrow RowBatch because there's a wrapper than translates to its internal columnar representation, ColumnarBatch. The vectorization part is the materialization to Arrow. Performance should be comparable and would also support projection and nested data. |
I should also note a few things about performance:
|
Cool, we can experiment with that but we always use the vectorized reader though. Is there an expected timeline for the Arrow vectorized read to land? Would also be happy to help move that forward. |
I don't have a timeline for Parquet to Arrow reads. @julienledem has some experience with that and can probably comment on the amount of effort it would require. I built the record-oriented read path here in a couple of weekends, so I think it would go fairly quickly. There's a lot of work we could do here, though. Including getting the APIs right to take advantage of page skipping. |
It looks like Parquet into Arrow conversion is already available for Python at least: https://arrow.apache.org/docs/python/parquet.html. I think there's something similar for C++ but am still exploring the tech: https://github.com/apache/arrow/tree/master/cpp I wonder what it would take to port this work over to Java? I would hope the algorithms in the implementations would translate between languages nicely. |
I've opened a new issue for this in the ASF project: apache/iceberg#9 Let's move further discussion there. Thanks! |
The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.
It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.
The text was updated successfully, but these errors were encountered: