-
-
Notifications
You must be signed in to change notification settings - Fork 80
DTable interface consistency and initial docs #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DTable interface consistency and initial docs #265
Conversation
krynju
commented
Aug 14, 2021
- Made the map and reduce function act more like TableOperations and Base equivalents instead of trying to mimick DataFrames.
- Map now returns a DTable
- Reduce returns a NamedTuple with results of per-column reduction. Made it work nicely with init from Base and you can also select the columns for reduction in there
- A page about DTable for the documentation - need to still work on it for better examples of usage.
|
note to self: adjust indexing for 1.3 appveyor CI as v[begin] is not supported at 1.3 |
Codecov Report
@@ Coverage Diff @@
## master #265 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 35 34 -1
Lines 2724 2748 +24
======================================
- Misses 2724 2748 +24
Continue to review full report at Codecov.
|
src/table/operations.jl
Outdated
| mapped = chunk |> TableOperations.map(x -> (result = f_row(x))) | ||
| reduce(f_reduce, mapped; init=init) | ||
| chunk_reduce = (_f, _chunk, _cols, _init) -> begin | ||
| values = [reduce(_f, Tables.getcolumn(_chunk, c); init=deepcopy(_init)) for c in _cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpsamaroo Do you think the per-column-within-chunk reduction should be done within separate tasks as well?
I think this could be a potential performance improvement with bigger chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely parallelize more rather than less, and parallelizing per-column within a chunk might give the scheduler better information about compute costs for reducing that column, since columns can have different types, and thus take more or less time to compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the spawns per column inside and:
- Performance wise probably only an upgrade for many columns (need to look for a threshold sometime). For 2-4 cols it was usually a downgrade.
- Stability wise it causes this Eager scheduler error when @spawn inside a thunk #267 , so for now I'll keep this commented out
col_in_chunk_reduce = (_f, _c, _init, _chunk) -> reduce(_f, Tables.getcolumn(_chunk, _c); init=deepcopy(_init))
chunk_reduce = (_f, _chunk, _cols, _init) -> begin
if length(_cols) <= 1
v = [col_in_chunk_reduce(_f, c, _init, _chunk) for c in _cols]
else
values = [Dagger.spawn(col_in_chunk_reduce, _f, c, _init, _chunk) for c in _cols]
v = fetch.(values)
end
(; zip(_cols, v)...)
end| construct_single_column = (_col, _chunk_results...) -> getindex.(_chunk_results, _col) | ||
| result_columns = [Dagger.@spawn construct_single_column(c, chunk_reduce_results...) for c in columns] | ||
|
|
||
| reduce_result_column = (_f, _c, _init) -> reduce(_f, _c; init=_init) | ||
| reduce_chunks = [Dagger.@spawn reduce_result_column(f, c, deepcopy(init)) for c in result_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this part first takes the per chunk results and makes columns out of them and then reduces them.
I tried treereduce instead of this and it was noticeably slower.
I think for the DTable where we know there's not going to be more than a reasonable number of chunks this could potentially be always faster than treereduce
Is there any case to use treereduce instead? Actually multimachine distributed maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generating too many thunks is definitely detrimental to the scheduler right now, which I assume is what treereduce is doing. In the future I'll add support for lazy representations of operations directly in the scheduler, which will let us tell the scheduler, "Here's all the possible ways you can parallelize this operation, do what you think is most efficient".
jpsamaroo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
| construct_single_column = (_col, _chunk_results...) -> getindex.(_chunk_results, _col) | ||
| result_columns = [Dagger.@spawn construct_single_column(c, chunk_reduce_results...) for c in columns] | ||
|
|
||
| reduce_result_column = (_f, _c, _init) -> reduce(_f, _c; init=_init) | ||
| reduce_chunks = [Dagger.@spawn reduce_result_column(f, c, deepcopy(init)) for c in result_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generating too many thunks is definitely detrimental to the scheduler right now, which I assume is what treereduce is doing. In the future I'll add support for lazy representations of operations directly in the scheduler, which will let us tell the scheduler, "Here's all the possible ways you can parallelize this operation, do what you think is most efficient".
Co-authored-by: Julian Samaroo <jpsamaroo@jpsamaroo.me>
|
Thanks again! |