Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Research Apache Arrow to improve in-memory data model #1469
Goal: Increase the capacity of OpenRefine’s data processing by moving to an alternative, higher performance in-memory / on-disk data storage technology. See document Planned in Phase 2 enhancements
We can probably lower our memory overhead with Apache Arrow: https://arrow.apache.org/docs/memory_layout.html
Thad chatted with Arrow folks and they think Arrow is a great fit for OpenRefine especially considering our columnar operations.
Here's a bit more info on how Apache Arrow folks say we might leverage them...
Thad Guidry [8:44 PM] Anyone know about us at OpenRefine ? We're curious if/how Arrow might be useful for us to improve our ancient in-memory data model and processing so desktop and laptop users can work with more local data in OpenRefine ?
bhulette [10:28 AM] @thad Guidry I hadn't heard of OpenRefine before, but it definitely looks like something that could benefit from the Arrow format. The biggest selling point would be the ability to easily interoperate with other tools that use Arrow (e.g. Spark, pandas, etc...) without any serialization costs.
[10:29 AM] I don't know what your current data model looks like, but there could be performance benefits from the columnar layout as well
Thad Guidry [10:53 AM] @bhulette That's described here https://github.com/OpenRefine/OpenRefine/wiki/Server-Side-Architecture
bhulette [11:05 AM] yeah that looks pretty amenable to the arrow format - a loose analogy could be that "column models" are specified by the arrow schema, and the "raw data" is stored in record batches/dictionary batches
Thad Guidry [11:06 AM] @bhulette gotcha
bhulette [11:07 AM] the column groups idea for storing a tree is pretty interesting
[11:08 AM] you would be able to specify blank cells in arrow using validity buffers
Thad Guidry [11:11 AM] @bhulette keep the ideas coming ! (fyi, we had also thought of Apache Ignite)
bhulette [11:12 AM] Im not sure if arrow could help with storing changes
[11:13 AM] but that could be a welcome addition to the project, if people like @wesmckinn think it's in scope
[11:15 AM] is the OpenRefine server distributed?
Thad Guidry [11:16 AM] @bhulette no
[11:18 AM] @bhulette OpenRefine is used locally (desktop/laptop) to clean data. We eventually want to separate the backend and frontend, so that we can do large transformations via streaming/batching against Apache BEAM, etc. But our user base, once they get that big, typically use other tools.
referenced this issue
May 11, 2018
We might also just leverage a wider surface than Apache Arrow itself and instead incorporate Apache Drill (Apache Arrow + Hadoop, NoSQL, and others support)
In-fact, Arrow was spun off as a new project using code donated from Drill  and has now grown to include a lot more.  https://drill.apache.org/docs/value-vectors/
I've taken the Docker Drill embedded version out for a spin and its really fast and performant, even when only given 4gigs Ram. I was impressed with the ability to connect a few JSON files along with a large 500meg CSV file...all 3 being joined and compared with SQL operations at the same time.
I'd like to see more research by others to look into Apache Drill as a complete data backend for OpenRefine.
I thought this was also interesting... someone connected Drill to Spark via Spark's dataframe