Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized Parquet Read In Spark DataSource #90

Closed
mccheah opened this issue Nov 10, 2018 · 14 comments
Closed

Vectorized Parquet Read In Spark DataSource #90

mccheah opened this issue Nov 10, 2018 · 14 comments

Comments

@mccheah
Copy link
Contributor

mccheah commented Nov 10, 2018

The Parquet file format reader that is available in core Spark includes a number of optimizations, the main one which is in vectorized columnar reading. In considering a potential migration from the old Spark readers to Iceberg, one would be concerned about the gap in performance that comes from lacking Spark's numerous optimizations in this space.

It is not clear what is the best way to incorporate these optimizations into Iceberg. One option would be to propose moving this code from Spark to parquet-mr. Another would be to invoke Spark's parquet reader directly here, but that is internal API. We could implement vectorized reading directly in Iceberg, but that is very much to suggest that we reinvent the wheel.

@felixcheungu
Copy link

Isn't this more an issue in the Spark codebase? I think a key prerequisite is to have the Parquet Data Source in Spark on the Data Source v2 API first.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 10, 2018

That's one way to do it; right now though Iceberg implements its own Parquet I/O and it doesn't do vectorized read. It also depends if Iceberg continues to handle the file I/O layer or if Iceberg just serves as the metastore / catalog API and delegates to the Spark datasource.

@felixcheungu
Copy link

good point, I'm seeing a lot of problem now with not using the "native" parquet data source..

@mccheah
Copy link
Contributor Author

mccheah commented Nov 10, 2018

I think there's some merit to passing off all the low level optimizations to parquet-mr - that way they're available everywhere, Spark - Iceberg - otherwise. Much of that work isn't specific to Spark, from my understanding.

If we did it that way, the delta between using Spark's Parquet data source or implementing one's own isn't that large. It becomes more of just flavor around how to bootstrap the parquet-mr modules.

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2018

There are a few problems with the vectorized read path in Spark: it only supports flat schemas and has no support for schema evolution.

What we've been planning on is to use the Arrow integration in Spark to get vectorized reads. There's already an implementation of Spark's ColumnarBatch translates from an Arrow RowBatch for PySpark/Pandas integration. Having deserialization directly to Arrow would be a benefit to multiple engines. We're considering using it in the Presto connector as well, and it would work well in a record service (possibly using Flight).

@mccheah
Copy link
Contributor Author

mccheah commented Nov 16, 2018

How would that integrate with the existing Parquet file format? This would be a blocker for many places that rely on Parquet in existing data sources and want to port over to Iceberg (this is what we're thinking of doing). We can't afford to lose the read optimizations there, that would be a performance regression for our Spark applications.

Are there ways to patch the vectorized read path in Spark to address those issues? Or can we support vectorized read with the limited subset of cases where it does work?

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2018

We could create vectorized readers in Iceberg that do the same thing as Spark, but I think it is a better use of time to deserialize to Arrow, which has support for memory optimizations like dictionary-encoded row batches.

All of this work can be thought of as an in-memory representation. Iceberg has an API for deserializing pages to rows, it would add one for deserializing to a column batch.

Agreed that this is a performance blocker if you have flat tables and don't need schema evolution. But I don't think that producing Spark's ColumnarBatch would be much easier than producing Arrow.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 16, 2018

Just wanted to clarify (I'm also not too familiar with the Arrow technology), would supporting the Arrow case give us vectorized read for Parquet files? And as such would writing the Arrow integration give comparable performance to existing Spark on Parquet users?

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2018

Iceberg would deserialize to Arrow's RowBatch, which is a more useful in-memory representation because it could support non-Spark engines. Spark can use Arrow RowBatch because there's a wrapper than translates to its internal columnar representation, ColumnarBatch. The vectorization part is the materialization to Arrow.

Performance should be comparable and would also support projection and nested data.

@rdblue
Copy link
Contributor

rdblue commented Nov 16, 2018

I should also note a few things about performance:

  • Iceberg can prune splits more aggressively using min/max stats, which can be a huge speed-up for selective queries even without vectorization.
  • Iceberg should be faster than the native Parquet read path in Spark when not using the vectorized reader. The native read path uses a slower record materialization API and also makes 2 copies of every row before passing it up to the next operator. See the v2 InternalRow PR for more detail.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 16, 2018

Cool, we can experiment with that but we always use the vectorized reader though. Is there an expected timeline for the Arrow vectorized read to land? Would also be happy to help move that forward.

@rdblue
Copy link
Contributor

rdblue commented Nov 19, 2018

I don't have a timeline for Parquet to Arrow reads. @julienledem has some experience with that and can probably comment on the amount of effort it would require.

I built the record-oriented read path here in a couple of weekends, so I think it would go fairly quickly. There's a lot of work we could do here, though. Including getting the APIs right to take advantage of page skipping.

@mccheah
Copy link
Contributor Author

mccheah commented Nov 19, 2018

It looks like Parquet into Arrow conversion is already available for Python at least: https://arrow.apache.org/docs/python/parquet.html. I think there's something similar for C++ but am still exploring the tech: https://github.com/apache/arrow/tree/master/cpp

I wonder what it would take to port this work over to Java? I would hope the algorithms in the implementations would translate between languages nicely.

@rdblue
Copy link
Contributor

rdblue commented Dec 8, 2018

I've opened a new issue for this in the ASF project: apache/iceberg#9

Let's move further discussion there. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants