Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading only a subset of columns #36

Closed
ExpandingMan opened this issue Jan 11, 2017 · 5 comments
Closed

reading only a subset of columns #36

ExpandingMan opened this issue Jan 11, 2017 · 5 comments

Comments

@ExpandingMan
Copy link
Collaborator

Obviously one can simply load the dataframe and select the desired columns, but I was wondering if there would be room for increasing performance when only a subset of columns is needed. If I'm understanding the format correctly, I think the answer to this question is "yes", in which case it would be really nice if we could do something like

df = Feather.read("file.feather", columns=[:x1, :x2, :x4])

which would be useful on dataframes with large numbers of columns.

@quinnj
Copy link
Member

quinnj commented Jan 11, 2017

Yep, definitely possible. I have some notes somewhere when I started to sketch this out for all of DataStreams. Let me try to find those notes and see about applying this generically across all packages.

@ExpandingMan
Copy link
Collaborator Author

Great, that'd be a really good feature. It would be especially good if one could also select only a subset of rows. For me this is really the only advantage that a database has over this type of format.

By the way, I can deserialize an 11GB file in about 17s, so I don't have any real need to resort to another format until I reach like 60GB which is fantastic!

@ExpandingMan
Copy link
Collaborator Author

Another extremely important feature would be reading/writing only specific rows as well. Is this possible? I'm not even really sure whether this format is designed for that sort of thing.

@ExpandingMan
Copy link
Collaborator Author

ExpandingMan commented Apr 7, 2017

So, I've been doing some exploring. Right now it seems that @quinnj has implemented Data.streamfrom(source, Data.Column, ...) so that to read a column you can do, for instance,

src = Feather.Source(filename)
sch = Data.schema(src)

col = "A"
colnum = sch[col]
dtype = Data.types(sch)[colnum]
A = Data.streamfrom(src, Data.Column, Vector{dtype}, colnum)

However, if you attempt to use the method for reading single features Data.streamfrom(src, Data.Field, ...) what will happen is that the entire column will be read in and stored in the Feather.Source. So, at the moment, one cannot read individual features.

Edit: Ok, I think I understand what is going on slightly better now. If memory mapping is enabled, when the column is first converted into a Julia type using unsafe_wrap I believe it still exists only on disk. However, Data.streamfrom also calls transform! on the result which, I think in all cases, winds up copying the entire column into memory. I think it should be easy to take the results of unsafe_wrap and convert one of its elements to the appropriate datatype without doing this for the entire column. Then we'd have a Data.streamfrom method that only actually reads specific features off disk.

I'd be happy to make a PR if I'm understanding this correctly, but I'm afraid I might still not be. It's pretty hard to test because all of Julia's memory reporting stuff like sizeof and whos tell me about the memory-mapped stuff also, so I'm having a hard time knowing for sure what's on disk and what's in the RAM.

Edit: I did indeed make a PR. See #45. Requires some fixes which I'll make soon.

@ExpandingMan
Copy link
Collaborator Author

This issue is no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants