columnar support for Arrow tables #2030

Fil · 2024-03-21T12:49:58Z

Detect Arrow tables and use as much of the direct access to the columns as we can—first and foremost, by not materializing the data on mark.initialize, and by routing a string accessor to getChild.

We don't add apache-arrow as a dependency (which means detection is done with duck typing of the methods we use… we could reinforce this a bit if needed, but I think that's fine).

The story is a bit complicated in the group transform (and maybe other places?) because we're actually making a new output data which currently uses take and map to create a new dataset that "looks like" the original data array.

In the Arrow table case, we might want take to be a "filtered" table, but I don't think it exists (API reference). We return instead an array of Row objects (which are Proxy objects into the columns); it's probably the best memory-wise, even though I don't like the looks of it. Anyway they're easy to convert to regular objects by writing ({...d}).

For a more thorough investigation of the places where we assume that values are arrays, I ran all the unit tests by replacing the data by an Arrow table (in Mark and facet data). This resulted in 25 "changed" snapshots (see diff); all of them are, it seems, only due to dates that change during the conversion to arrow. None of them were crashes.

I still need to investigate why the dates are modified ~~(I'm thinking that may be because Arrow coerces them to Date32<day> — nope, they are DateMillisecond)~~.

closes #191

cc: @jheer

…ften not the case)

Fil · 2024-03-21T14:39:41Z

The issue with dates can be reduced to this (which is independent of Plot):

import * as Arrow from "apache-arrow";
const data = [{date: new Date(1950, 1, 2)}]
console.log(data, [...Arrow.tableFromJSON(data)].map(d=> ({...d})));

// [ { date: 1950-02-01T23:00:00.000Z } ] [ { date: 1950-03-23T16:02:47.296Z } ]

you can see that the date is off by 50 days. Am I missing something, should I open an issue on https://github.com/apache/arrow @domoritz?

version information:

apache-arrow@^15.0.2:
  resolved "https://registry.yarnpkg.com/apache-arrow/-/apache-arrow-15.0.2.tgz#d87c6447d64d6fab34aa70119362680b6617ce63"

jheer · 2024-03-21T15:32:00Z

Hmm, we’ve had success reading Date values in DuckDB transferred in Arrow format. So I’d be sure to check if this is an encoding problem (Date to Arrow) or a decoding issue (Arrow to Date) first. Some encoders in Arrow JS (eg for Decimal) are known to be broken. I sometimes have had to use DuckDB or pyarrow to generate Arrow bytes for testing.

On a related note, in Mosaic we special case Timestamp types as Arrow JS returns those as numbers; we then instantiate Date objects ourselves.

domoritz · 2024-03-21T15:43:00Z

src/options.js

+
+// Arrayify arrow tables. We try to avoid materializing the values, but the
+// Proxy might be used by the group reducer to construct groupData.
+function arrowTableProxy(data) {


Doesn't this negative most of the benefit of using arrow? I think proxies add a lot of overhead and arrow already has fast proxies for e.g. iteration. Can you defer the conversion to the group reducer?

I think it works. We still read the named channels with getChild(name), and the Symbol.iterator is also passed directly to the _Table object, so there shouldn't be any waste here. The only place where this is not great, is in the group transform, which assembles a new array of data points for each group (with the take function—that's what I think Arrow doesn't allow). I made a Proxy because I didn't want to add methods on the source _Table, but we could just slice() it defensively and not use a Proxy?

Oh, I see. So in most cases you use columnar access anyway. Then maybe ignore my comment.

Fil · 2024-03-21T16:32:13Z

The internal data is like this:

 Data {
    type: DateMillisecond [Date] { typeId: 8, unit: 1 },
    children: [],
    dictionary: undefined,
    offset: 0,
    length: 1,
    _nullCount: 0,
    stride: 2,
    values: Int32Array(16) [
      -1314529784,
      -146,
      0,
     …

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

jheer · 2024-03-21T16:42:58Z

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

I think DuckDB produces either Date32 or Timestamp values, so I haven't tripped over this yet! Thanks for documenting it.

Fil · 2024-03-21T17:08:35Z

I've reported the Date issue at apache/arrow#40718. I think it is orthogonal to this PR, since I can get the same error with Plot#main and an arrow table ~~—though it does not explain all the 25 differences :(~~

Fil · 2024-03-21T20:48:16Z

Tested with apache/arrow#40725 everything works smoothly (except for test "mark data parallel to facet data triggers a warning" which is not relevant). Thanks for the super quick fix @trxcllnt and @domoritz!

mbostock

I love the minimally-invasive nature of this change. I also worry that it might be brittle — the proxy masquerading as an Array does “just enough” for it to pass tests, but is this likely to cause problems in the future if the code assumes that the data is an array?

As a thought exercise, what would it look like if we allowed the data to be an arrow table throughout the code base? How many places need to read the materialized mark data as an array and would need to be changed to check isArrowTable? Given the weight behind Apache Arrow, I’m more inclined to make it a first-class thing. Maybe someday Plot uses Arrow internally as the native data representation.

In the case of the group transform, I would love to avoid materializing the array-of-objects, too. (I think at some point I explored having the group transform not materialize the grouped data by default.)

I’m pretty close to approving this PR as-is, but maybe you can do a little more research before we do so? I’d love to understand the alternatives a little more.

Fil · 2024-03-21T23:26:21Z

In the PR that fixes the Date bug (apache/arrow#40725), @domoritz also changes the .get(i) value accessor into .at(i), which means that everywhere we use (array)[i] vs (arrow vector).get(i), we could now unify with (array or vector).at(i). This is probably the change that will help us the most.

An example of a (custom) data transform that breaks with arrow tables is here. It uses data.flatMap, which does not exist on the "fake array". We could add it easily.

(I don't think we need to rush this, we should probably wait for 40725 to land.)

domoritz · 2024-03-21T23:48:22Z

Yeah, compatibility with native arrays was my main goal with supporting at. I'd be supportive of adding map and flatMap.

The next arrow release is in April so we could try to get the change in there (releases are ~ every three months). However, you probably don't want to rely on the latest arrow library being used so having a stop gap until the new library is common makes sense.

Fil added 4 commits March 21, 2024 12:05

columnar support for arrow tables

aa243d7

defer reading the values until they're actually requested (which is o…

6a3e52f

…ften not the case)

tests

fd24c95

comment

f3fc61b

Fil requested a review from mbostock March 21, 2024 12:49

Fil marked this pull request as draft March 21, 2024 12:58

domoritz reviewed Mar 21, 2024

View reviewed changes

Fil marked this pull request as ready for review March 21, 2024 20:48

mbostock reviewed Mar 21, 2024

View reviewed changes

Fil mentioned this pull request Mar 22, 2024

lazily materialize the groups #2031

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

columnar support for Arrow tables #2030

columnar support for Arrow tables #2030

Fil commented Mar 21, 2024 •

edited

Fil commented Mar 21, 2024 •

edited

jheer commented Mar 21, 2024

domoritz Mar 21, 2024

Fil Mar 21, 2024 •

edited

domoritz Mar 21, 2024

Fil commented Mar 21, 2024

jheer commented Mar 21, 2024

Fil commented Mar 21, 2024 •

edited

Fil commented Mar 21, 2024

mbostock left a comment

Fil commented Mar 21, 2024

domoritz commented Mar 21, 2024

columnar support for Arrow tables #2030

Are you sure you want to change the base?

columnar support for Arrow tables #2030

Conversation

Fil commented Mar 21, 2024 • edited

Fil commented Mar 21, 2024 • edited

jheer commented Mar 21, 2024

domoritz Mar 21, 2024

Choose a reason for hiding this comment

Fil Mar 21, 2024 • edited

Choose a reason for hiding this comment

domoritz Mar 21, 2024

Choose a reason for hiding this comment

Fil commented Mar 21, 2024

jheer commented Mar 21, 2024

Fil commented Mar 21, 2024 • edited

Fil commented Mar 21, 2024

mbostock left a comment

Choose a reason for hiding this comment

Fil commented Mar 21, 2024

domoritz commented Mar 21, 2024

Fil commented Mar 21, 2024 •

edited

Fil commented Mar 21, 2024 •

edited

Fil Mar 21, 2024 •

edited

Fil commented Mar 21, 2024 •

edited