New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research Apache Arrow to improve in-memory data model #1469

Open
thadguidry opened this Issue Feb 10, 2018 · 1 comment

Comments

@thadguidry
Member

thadguidry commented Feb 10, 2018

Goal: Increase the capacity of OpenRefine’s data processing by moving to an alternative, higher performance in-memory / on-disk data storage technology. See document Planned in Phase 2 enhancements

We can probably lower our memory overhead with Apache Arrow: https://arrow.apache.org/docs/memory_layout.html

Thad chatted with Arrow folks and they think Arrow is a great fit for OpenRefine especially considering our columnar operations.

Here's a bit more info on how Apache Arrow folks say we might leverage them...

https://apachearrow.slack.com/archives/C0S8Z7VBK/p1516934654000036


Thad Guidry [8:44 PM] Anyone know about us at OpenRefine ? We're curious if/how Arrow might be useful for us to improve our ancient in-memory data model and processing so desktop and laptop users can work with more local data in OpenRefine ?

bhulette [10:28 AM] @thad Guidry I hadn't heard of OpenRefine before, but it definitely looks like something that could benefit from the Arrow format. The biggest selling point would be the ability to easily interoperate with other tools that use Arrow (e.g. Spark, pandas, etc...) without any serialization costs.

[10:29 AM] I don't know what your current data model looks like, but there could be performance benefits from the columnar layout as well

Thad Guidry [10:53 AM] @bhulette That's described here https://github.com/OpenRefine/OpenRefine/wiki/Server-Side-Architecture

bhulette [11:05 AM] yeah that looks pretty amenable to the arrow format - a loose analogy could be that "column models" are specified by the arrow schema, and the "raw data" is stored in record batches/dictionary batches

Thad Guidry [11:06 AM] @bhulette gotcha

bhulette [11:07 AM] the column groups idea for storing a tree is pretty interesting

[11:08 AM] you would be able to specify blank cells in arrow using validity buffers

Thad Guidry [11:11 AM] @bhulette keep the ideas coming ! (fyi, we had also thought of Apache Ignite)

bhulette [11:12 AM] Im not sure if arrow could help with storing changes

[11:13 AM] but that could be a welcome addition to the project, if people like @wesmckinn think it's in scope 🙂

[11:15 AM] is the OpenRefine server distributed?

Thad Guidry [11:16 AM] @bhulette no

[11:18 AM] @bhulette OpenRefine is used locally (desktop/laptop) to clean data. We eventually want to separate the backend and frontend, so that we can do large transformations via streaming/batching against Apache BEAM, etc. But our user base, once they get that big, typically use other tools.

@thadguidry

This comment has been minimized.

Member

thadguidry commented Aug 4, 2018

We might also just leverage a wider surface than Apache Arrow itself and instead incorporate Apache Drill (Apache Arrow + Hadoop, NoSQL, and others support)

https://drill.apache.org/

In-fact, Arrow was spun off as a new project using code donated from Drill [1] and has now grown to include a lot more. [1] https://drill.apache.org/docs/value-vectors/

I've taken the Docker Drill embedded version out for a spin and its really fast and performant, even when only given 4gigs Ram. I was impressed with the ability to connect a few JSON files along with a large 500meg CSV file...all 3 being joined and compared with SQL operations at the same time.

I'd like to see more research by others to look into Apache Drill as a complete data backend for OpenRefine.

I thought this was also interesting... someone connected Drill to Spark via Spark's dataframe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment