-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading only a subset of columns #36
Comments
Yep, definitely possible. I have some notes somewhere when I started to sketch this out for all of DataStreams. Let me try to find those notes and see about applying this generically across all packages. |
Great, that'd be a really good feature. It would be especially good if one could also select only a subset of rows. For me this is really the only advantage that a database has over this type of format. By the way, I can deserialize an 11GB file in about 17s, so I don't have any real need to resort to another format until I reach like 60GB which is fantastic! |
Another extremely important feature would be reading/writing only specific rows as well. Is this possible? I'm not even really sure whether this format is designed for that sort of thing. |
So, I've been doing some exploring. Right now it seems that @quinnj has implemented src = Feather.Source(filename)
sch = Data.schema(src)
col = "A"
colnum = sch[col]
dtype = Data.types(sch)[colnum]
A = Data.streamfrom(src, Data.Column, Vector{dtype}, colnum) However, if you attempt to use the method for reading single features Edit: Ok, I think I understand what is going on slightly better now. If memory mapping is enabled, when the column is first converted into a Julia type using I'd be happy to make a PR if I'm understanding this correctly, but I'm afraid I might still not be. It's pretty hard to test because all of Julia's memory reporting stuff like Edit: I did indeed make a PR. See #45. Requires some fixes which I'll make soon. |
This issue is no longer relevant. |
Obviously one can simply load the dataframe and select the desired columns, but I was wondering if there would be room for increasing performance when only a subset of columns is needed. If I'm understanding the format correctly, I think the answer to this question is "yes", in which case it would be really nice if we could do something like
which would be useful on dataframes with large numbers of columns.
The text was updated successfully, but these errors were encountered: