reading only a subset of columns #36

ExpandingMan · 2017-01-11T19:35:39Z

Obviously one can simply load the dataframe and select the desired columns, but I was wondering if there would be room for increasing performance when only a subset of columns is needed. If I'm understanding the format correctly, I think the answer to this question is "yes", in which case it would be really nice if we could do something like

df = Feather.read("file.feather", columns=[:x1, :x2, :x4])

which would be useful on dataframes with large numbers of columns.

quinnj · 2017-01-11T22:34:26Z

Yep, definitely possible. I have some notes somewhere when I started to sketch this out for all of DataStreams. Let me try to find those notes and see about applying this generically across all packages.

ExpandingMan · 2017-01-12T14:42:45Z

Great, that'd be a really good feature. It would be especially good if one could also select only a subset of rows. For me this is really the only advantage that a database has over this type of format.

By the way, I can deserialize an 11GB file in about 17s, so I don't have any real need to resort to another format until I reach like 60GB which is fantastic!

ExpandingMan · 2017-04-04T17:26:57Z

Another extremely important feature would be reading/writing only specific rows as well. Is this possible? I'm not even really sure whether this format is designed for that sort of thing.

ExpandingMan · 2017-04-07T14:17:40Z

So, I've been doing some exploring. Right now it seems that @quinnj has implemented Data.streamfrom(source, Data.Column, ...) so that to read a column you can do, for instance,

src = Feather.Source(filename)
sch = Data.schema(src)

col = "A"
colnum = sch[col]
dtype = Data.types(sch)[colnum]
A = Data.streamfrom(src, Data.Column, Vector{dtype}, colnum)

However, if you attempt to use the method for reading single features Data.streamfrom(src, Data.Field, ...) what will happen is that the entire column will be read in and stored in the Feather.Source. So, at the moment, one cannot read individual features.

Edit: Ok, I think I understand what is going on slightly better now. If memory mapping is enabled, when the column is first converted into a Julia type using unsafe_wrap I believe it still exists only on disk. However, Data.streamfrom also calls transform! on the result which, I think in all cases, winds up copying the entire column into memory. I think it should be easy to take the results of unsafe_wrap and convert one of its elements to the appropriate datatype without doing this for the entire column. Then we'd have a Data.streamfrom method that only actually reads specific features off disk.

I'd be happy to make a PR if I'm understanding this correctly, but I'm afraid I might still not be. It's pretty hard to test because all of Julia's memory reporting stuff like sizeof and whos tell me about the memory-mapped stuff also, so I'm having a hard time knowing for sure what's on disk and what's in the RAM.

Edit: I did indeed make a PR. See #45. Requires some fixes which I'll make soon.

ExpandingMan · 2018-06-12T18:53:50Z

This issue is no longer relevant.

ExpandingMan closed this as completed Jun 12, 2018

CarlColglazier mentioned this issue Dec 15, 2020

Reading only a subset of columns apache/arrow-julia#78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading only a subset of columns #36

reading only a subset of columns #36

ExpandingMan commented Jan 11, 2017

quinnj commented Jan 11, 2017

ExpandingMan commented Jan 12, 2017

ExpandingMan commented Apr 4, 2017

ExpandingMan commented Apr 7, 2017 •

edited

Loading

ExpandingMan commented Jun 12, 2018

reading only a subset of columns #36

reading only a subset of columns #36

Comments

ExpandingMan commented Jan 11, 2017

quinnj commented Jan 11, 2017

ExpandingMan commented Jan 12, 2017

ExpandingMan commented Apr 4, 2017

ExpandingMan commented Apr 7, 2017 • edited Loading

ExpandingMan commented Jun 12, 2018

ExpandingMan commented Apr 7, 2017 •

edited

Loading