DTable interface consistency and initial docs #265

krynju · 2021-08-14T16:20:19Z

Made the map and reduce function act more like TableOperations and Base equivalents instead of trying to mimick DataFrames.
Map now returns a DTable
Reduce returns a NamedTuple with results of per-column reduction. Made it work nicely with init from Base and you can also select the columns for reduction in there
A page about DTable for the documentation - need to still work on it for better examples of usage.

…tency

krynju · 2021-08-14T17:10:08Z

note to self: adjust indexing for 1.3 appveyor CI as v[begin] is not supported at 1.3

codecov-commenter · 2021-08-14T17:35:42Z

Codecov Report

Merging #265 (8cf06fb) into master (0b72b77) will not change coverage.
The diff coverage is 0.00%.

@@          Coverage Diff           @@
##           master    #265   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files          35      34    -1     
  Lines        2724    2748   +24     
======================================
- Misses       2724    2748   +24

Impacted Files	Coverage Δ
src/Dagger.jl	`0.00% <ø> (ø)`
src/table/dtable.jl	`0.00% <0.00%> (ø)`
src/table/operations.jl	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b72b77...8cf06fb. Read the comment docs.

krynju · 2021-08-16T15:38:12Z

src/table/operations.jl

-    mapped = chunk |> TableOperations.map(x -> (result = f_row(x)))
-    reduce(f_reduce, mapped; init=init)
+    chunk_reduce = (_f, _chunk, _cols, _init) -> begin
+        values = [reduce(_f, Tables.getcolumn(_chunk, c); init=deepcopy(_init)) for c in _cols]


@jpsamaroo Do you think the per-column-within-chunk reduction should be done within separate tasks as well?
I think this could be a potential performance improvement with bigger chunks

We should definitely parallelize more rather than less, and parallelizing per-column within a chunk might give the scheduler better information about compute costs for reducing that column, since columns can have different types, and thus take more or less time to compute.

I've added the spawns per column inside and:

Performance wise probably only an upgrade for many columns (need to look for a threshold sometime). For 2-4 cols it was usually a downgrade.

Stability wise it causes this Eager scheduler error when @spawn inside a thunk #267 , so for now I'll keep this commented out

col_in_chunk_reduce = (_f, _c, _init, _chunk) -> reduce(_f, Tables.getcolumn(_chunk, _c); init=deepcopy(_init)) chunk_reduce = (_f, _chunk, _cols, _init) -> begin if length(_cols) <= 1 v = [col_in_chunk_reduce(_f, c, _init, _chunk) for c in _cols] else values = [Dagger.spawn(col_in_chunk_reduce, _f, c, _init, _chunk) for c in _cols] v = fetch.(values) end (; zip(_cols, v)...) end

src/table/dtable.jl

krynju · 2021-08-18T18:35:39Z

src/table/operations.jl

+    construct_single_column = (_col, _chunk_results...) -> getindex.(_chunk_results, _col)
+    result_columns = [Dagger.@spawn construct_single_column(c, chunk_reduce_results...) for c in columns]
+
+    reduce_result_column = (_f, _c, _init) -> reduce(_f, _c; init=_init)
+    reduce_chunks = [Dagger.@spawn reduce_result_column(f, c, deepcopy(init)) for c in result_columns]


So this part first takes the per chunk results and makes columns out of them and then reduces them.
I tried treereduce instead of this and it was noticeably slower.
I think for the DTable where we know there's not going to be more than a reasonable number of chunks this could potentially be always faster than treereduce
Is there any case to use treereduce instead? Actually multimachine distributed maybe?

@jpsamaroo

Generating too many thunks is definitely detrimental to the scheduler right now, which I assume is what treereduce is doing. In the future I'll add support for lazy representations of operations directly in the scheduler, which will let us tell the scheduler, "Here's all the possible ways you can parallelize this operation, do what you think is most efficient".

jpsamaroo

Awesome work!

src/table/dtable.jl

src/table/operations.jl

jpsamaroo · 2021-08-19T20:50:52Z

src/table/operations.jl

+    construct_single_column = (_col, _chunk_results...) -> getindex.(_chunk_results, _col)
+    result_columns = [Dagger.@spawn construct_single_column(c, chunk_reduce_results...) for c in columns]
+
+    reduce_result_column = (_f, _c, _init) -> reduce(_f, _c; init=_init)
+    reduce_chunks = [Dagger.@spawn reduce_result_column(f, c, deepcopy(init)) for c in result_columns]


Generating too many thunks is definitely detrimental to the scheduler right now, which I assume is what treereduce is doing. In the future I'll add support for lazy representations of operations directly in the scheduler, which will let us tell the scheduler, "Here's all the possible ways you can parallelize this operation, do what you think is most efficient".

src/table/dtable.jl

test/table.jl

Co-authored-by: Julian Samaroo <jpsamaroo@jpsamaroo.me>

jpsamaroo · 2021-08-23T15:37:33Z

Thanks again!

krynju added 7 commits August 12, 2021 00:07

add docs

12cb4d8

docs p2

e30b4df

doc p3

dda3bdf

rework map and reduce

ddad360

more adjustments

0e870a0

Merge branch 'kr/dtable-initial-docs' into kr/dtable-mapreduce-consis…

fa5719f

…tency

update doc examples

8cf06fb

krynju mentioned this pull request Aug 14, 2021

DTable docs #259

Closed

fixup docs

c84a630

krynju marked this pull request as ready for review August 15, 2021 10:26

krynju added 2 commits August 16, 2021 16:57

add deepcopy on init args to cover OnlineStats usage

cbbc5fb

one more deepcopy

cacafe0

krynju commented Aug 16, 2021

View reviewed changes

docs fixup

54b7836

krynju commented Aug 18, 2021

View reviewed changes

src/table/dtable.jl Outdated Show resolved Hide resolved

krynju commented Aug 18, 2021

View reviewed changes

jpsamaroo approved these changes Aug 19, 2021

View reviewed changes

krynju added 3 commits August 22, 2021 16:50

add commented per chunk per column parallelization

922f511

more adjustments

cc18b35

add more tests

df7836f

jpsamaroo reviewed Aug 23, 2021

View reviewed changes

src/table/dtable.jl Show resolved Hide resolved

jpsamaroo reviewed Aug 23, 2021

View reviewed changes

src/table/dtable.jl Outdated Show resolved Hide resolved

src/table/dtable.jl Outdated Show resolved Hide resolved

test/table.jl Outdated Show resolved Hide resolved

test/table.jl Outdated Show resolved Hide resolved

test/table.jl Outdated Show resolved Hide resolved

krynju and others added 2 commits August 23, 2021 17:08

Apply suggestions from code review

8bc1efa

Co-authored-by: Julian Samaroo <jpsamaroo@jpsamaroo.me>

fix tabletype

3e0603f

jpsamaroo merged commit 389ab9a into JuliaParallel:master Aug 23, 2021

Uh oh!

DTable interface consistency and initial docs #265

DTable interface consistency and initial docs #265

Uh oh!

Conversation

krynju commented Aug 14, 2021

Uh oh!

krynju commented Aug 14, 2021

Uh oh!

codecov-commenter commented Aug 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

krynju Aug 16, 2021

Choose a reason for hiding this comment

Uh oh!

jpsamaroo Aug 18, 2021

Choose a reason for hiding this comment

Uh oh!

krynju Aug 22, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krynju Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpsamaroo Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

jpsamaroo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpsamaroo Aug 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpsamaroo commented Aug 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Aug 14, 2021 •

edited

Loading

krynju Aug 18, 2021 •

edited

Loading