-
-
Notifications
You must be signed in to change notification settings - Fork 76
DTable implementation #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jpsamaroo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far!
|
@krynju when you get the chance, let's try benchmarking this against raw DataFrames and see how we do. I suspect we'll probably want some scheduler changes to optimize concurrent thunk submission (which currently takes a global scheduler lock). |
|
Hey guys, glad to see this exciting work, thanks for all the effort! I feel a little awkward making such a radical suggestion as merely an interested observer, but are you sure it would not be better to split this into a new package dependent on Dagger, rather than relying on I realize that currently this isn't a lot of code, so splitting it may seem premature, but it could easily grow over time to be an enormous code base, so splitting it can make dependency logic much more straightforward, and it is often easier to split projects in their nascent stages rather than having a massive reorganization later on. Another thing that separate packages can make much easier to manage is exports: it may be inappropriate to export some table-specific functions from Dagger (because they may use common symbols), but may make more sense from a tables-specific package. Lastly, I'm not entirely sure of the status of |
|
I'm fine with moving this functionality out once it's ready to go; in the meantime we can keep it here for easy access to CI, and for testing in case we need/want to make scheduler changes as part of this work. |
|
Yeah. I think Dagger can exist without a table so I think the table functionality should be moved out as well. Now for practicality reason, perhaps it's ok to keep it here for CI reason etc. |
|
On alternative is to make it a subpackage, slightly annoying CI setup but makes joined development easier without bringing in unecessary dependencies or using Requires. |
eb8e56c to
49285b3
Compare
|
@jpsamaroo I've been thinking about managing the chunks that contain the partitions in a scenario where the table would be bigger than the available memory. So my idea for now is to first adjust the chunksize to memory size instead of rowcount to enable the rest of the functionality.
So both of these assume we would have some form of visibility on the memory available for a Processor to figure out when to cache the chunks and when not to load them into memory. I would probably declare/infer that information on the OSProc level, so even in a Distributed environment that memory availability could be handled. A different problem is where to cache the data - local env is easy, but in Distributed we should probably cache the table chunks at some commonly available location. I'm wondering how much of this should be a part of the Dagger internals/scheduler and what should be a part of the DTable. I could imagine having a CachableChunk type for which you could define cache/restore functions and the scheduler figuring out when to call the cache and restore on these in order to make space in the physical memory and prepare the chunks for the Tasks to execute. With something like that the DTable wouldn't have a lot of work to do to provide out of core processing. What's your take on this? |
|
I've thought about swap-to-disk a lot recently; some initial thoughts are here: #214 (comment), which sound quite similar to what you're suggesting.
This is definitely a thing that the scheduler should know about and handle automatically. It's also relevant to the DArray, and generally to other kinds of operations that generate lots of data.
We already have the cache/restore in the form of checkpointing available to thunks (#183), although this is more of an automatic operation, whereas checkpointing is more manual (although that can change). Regardless, I don't think we need a new chunk type; the existing
It's my hope that we can handle all of these details in the scheduler, so that any Dagger computation can benefit from swap-to-disk. I don't think exposing it to the DTable would be necessary; any knobs could be controlled via multiple dispatch on the chunked data type. |
|
So essentially there's three separate topics:
So 2 and 3 is separate, because in my mind a chunk from point 2 becomes a chunk from point 3 after being successfully loaded once (and in case the partition gets dropped accidentaly it can be loaded again from that source). |
I think we need that handle type (which can just be
Here, the So what's missing here, in my mind, as an extra abstract type
Given the above, the scheduler could be taught to move data from disk to memory when the data is necessary for a computation. It can also be taught that, when bringing in a new piece of data into memory in some way, if the memory requirement estimated for that data is greater than the memory limit of the processor (CPU RAM, GPU VRAM, etc.), then other chunks that're in-memory right now will need to first be moved to disk and cleared from memory. An inability to perform this operation (say, because disk space ran out) would result in the scheduler setting an error state on that task of the DAG and anything that depends on it. In an interactive setting, the user could clean up more disk space or memory, and then re-run the computation and the scheduler should then retry from where it left off. So what will be the format on disk for some arbitrary blob of data currently in-memory? This should probably be determined by a hypothetical So how does all this work for the DTable? The DTable will need to auto-detect (based on the input table type or other mechanism) what Thoughts? |
|
Makes sense to me with my current understanding of the internals Although I have 3 questions:
|
I would make it a
It would be placed straight into the DTable as-is.
At-rest compression of data would be best handled by a dedicated |
|
Is there any plans to support indexing? How would groupby and join work efficiently? I spoke to @quinnj last year about possibly adding partitions and indexing helpers to Tables.. might to be a good time to restart that discussion. |
Indexing - yes, I think it will be of great use for optimizing some operations. Groupby and join I really have to brainstorm it later when I'm done testing the grounds with the "simple ops" In the context of these operations I'm mostly worried about memory management. Once this shuffe/sort or any transformation happens we'll be duplicating the table in memory. With no stable execution plan it's probably harder to decide when to free the previous chunks - especially in an interactive session |
|
We might want to consider what's available with Tables and TableOperations now: https://youtu.be/Ryp0F0dHnLA |
6af7636 to
a18aa37
Compare
|
I've seen that part of the interface and it will be very useful here both as a consumer of it and a source. I've just added some small docs, fixed some consistency issues and put the DataFrames dependency in @requires. I think I can consider this initial scope ready as it is. Next todo's I think can be summarized by this list:
|
|
This is exciting! Feel free to ping me on Tables.jl review stuff; happy to help. Or even to brainstorm the implementation, happy to chat on slack or discourse. |
jpsamaroo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, although we need to be careful about using closures, since they can add unacceptable compile time costs.
|
@jpsamaroo applied requested changes |
jpsamaroo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just a few nitpicks and comments, and then I'd be happy to merge this! 😄
ce0b10c to
2b8bc7d
Compare
|
Thanks a ton! I'm looking forward to seeing this develop 😄 |
TODO: