Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

columnar support for Arrow tables #2030

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

columnar support for Arrow tables #2030

wants to merge 4 commits into from

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Mar 21, 2024

Detect Arrow tables and use as much of the direct access to the columns as we can—first and foremost, by not materializing the data on mark.initialize, and by routing a string accessor to getChild.

We don't add apache-arrow as a dependency (which means detection is done with duck typing of the methods we use… we could reinforce this a bit if needed, but I think that's fine).

The story is a bit complicated in the group transform (and maybe other places?) because we're actually making a new output data which currently uses take and map to create a new dataset that "looks like" the original data array.

In the Arrow table case, we might want take to be a "filtered" table, but I don't think it exists (API reference). We return instead an array of Row objects (which are Proxy objects into the columns); it's probably the best memory-wise, even though I don't like the looks of it. Anyway they're easy to convert to regular objects by writing ({...d}).

For a more thorough investigation of the places where we assume that values are arrays, I ran all the unit tests by replacing the data by an Arrow table (in Mark and facet data). This resulted in 25 "changed" snapshots (see diff); all of them are, it seems, only due to dates that change during the conversion to arrow. None of them were crashes.

I still need to investigate why the dates are modified (I'm thinking that may be because Arrow coerces them to Date32<day> — nope, they are DateMillisecond).

closes #191

cc: @jheer

@Fil Fil requested a review from mbostock March 21, 2024 12:49
@Fil Fil marked this pull request as draft March 21, 2024 12:58
@Fil
Copy link
Contributor Author

Fil commented Mar 21, 2024

The issue with dates can be reduced to this (which is independent of Plot):

import * as Arrow from "apache-arrow";
const data = [{date: new Date(1950, 1, 2)}]
console.log(data, [...Arrow.tableFromJSON(data)].map(d=> ({...d})));

// [ { date: 1950-02-01T23:00:00.000Z } ] [ { date: 1950-03-23T16:02:47.296Z } ]

you can see that the date is off by 50 days. Am I missing something, should I open an issue on https://github.com/apache/arrow @domoritz?

version information:

apache-arrow@^15.0.2:
  resolved "https://registry.yarnpkg.com/apache-arrow/-/apache-arrow-15.0.2.tgz#d87c6447d64d6fab34aa70119362680b6617ce63"

@jheer
Copy link

jheer commented Mar 21, 2024

Hmm, we’ve had success reading Date values in DuckDB transferred in Arrow format. So I’d be sure to check if this is an encoding problem (Date to Arrow) or a decoding issue (Arrow to Date) first. Some encoders in Arrow JS (eg for Decimal) are known to be broken. I sometimes have had to use DuckDB or pyarrow to generate Arrow bytes for testing.

On a related note, in Mosaic we special case Timestamp types as Arrow JS returns those as numbers; we then instantiate Date objects ourselves.


// Arrayify arrow tables. We try to avoid materializing the values, but the
// Proxy might be used by the group reducer to construct groupData.
function arrowTableProxy(data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this negative most of the benefit of using arrow? I think proxies add a lot of overhead and arrow already has fast proxies for e.g. iteration. Can you defer the conversion to the group reducer?

Copy link
Contributor Author

@Fil Fil Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works. We still read the named channels with getChild(name), and the Symbol.iterator is also passed directly to the _Table object, so there shouldn't be any waste here. The only place where this is not great, is in the group transform, which assembles a new array of data points for each group (with the take function—that's what I think Arrow doesn't allow). I made a Proxy because I didn't want to add methods on the source _Table, but we could just slice() it defensively and not use a Proxy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. So in most cases you use columnar access anyway. Then maybe ignore my comment.

@Fil
Copy link
Contributor Author

Fil commented Mar 21, 2024

The internal data is like this:

 Data {
    type: DateMillisecond [Date] { typeId: 8, unit: 1 },
    children: [],
    dictionary: undefined,
    offset: 0,
    length: 1,
    _nullCount: 0,
    stride: 2,
    values: Int32Array(16) [
      -1314529784,
      -146,
      0,
     …

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

@jheer
Copy link

jheer commented Mar 21, 2024

The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date.

So it's apparently the decoding that fails.

I think DuckDB produces either Date32 or Timestamp values, so I haven't tripped over this yet! Thanks for documenting it.

@Fil
Copy link
Contributor Author

Fil commented Mar 21, 2024

I've reported the Date issue at apache/arrow#40718. I think it is orthogonal to this PR, since I can get the same error with Plot#main and an arrow table —though it does not explain all the 25 differences :(

@Fil
Copy link
Contributor Author

Fil commented Mar 21, 2024

Tested with apache/arrow#40725 everything works smoothly (except for test "mark data parallel to facet data triggers a warning" which is not relevant). Thanks for the super quick fix @trxcllnt and @domoritz!

@Fil Fil marked this pull request as ready for review March 21, 2024 20:48
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the minimally-invasive nature of this change. I also worry that it might be brittle — the proxy masquerading as an Array does “just enough” for it to pass tests, but is this likely to cause problems in the future if the code assumes that the data is an array?

As a thought exercise, what would it look like if we allowed the data to be an arrow table throughout the code base? How many places need to read the materialized mark data as an array and would need to be changed to check isArrowTable? Given the weight behind Apache Arrow, I’m more inclined to make it a first-class thing. Maybe someday Plot uses Arrow internally as the native data representation.

In the case of the group transform, I would love to avoid materializing the array-of-objects, too. (I think at some point I explored having the group transform not materialize the grouped data by default.)

I’m pretty close to approving this PR as-is, but maybe you can do a little more research before we do so? I’d love to understand the alternatives a little more.

@Fil
Copy link
Contributor Author

Fil commented Mar 21, 2024

In the PR that fixes the Date bug (apache/arrow#40725), @domoritz also changes the .get(i) value accessor into .at(i), which means that everywhere we use (array)[i] vs (arrow vector).get(i), we could now unify with (array or vector).at(i). This is probably the change that will help us the most.

An example of a (custom) data transform that breaks with arrow tables is here. It uses data.flatMap, which does not exist on the "fake array". We could add it easily.

(I don't think we need to rush this, we should probably wait for 40725 to land.)

@domoritz
Copy link
Contributor

Yeah, compatibility with native arrays was my main goal with supporting at. I'd be supportive of adding map and flatMap.

The next arrow release is in April so we could try to get the change in there (releases are ~ every three months). However, you probably don't want to rely on the latest arrow library being used so having a stop gap until the new library is common makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize field shorthand for columnar data?
4 participants