Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upport CJ to C #3596
port CJ to C #3596
Conversation
fixed header
Codecov Report
@@ Coverage Diff @@
## master #3596 +/- ##
==========================================
+ Coverage 98.23% 98.24% +<.01%
==========================================
Files 66 67 +1
Lines 12928 12972 +44
==========================================
+ Hits 12700 12744 +44
Misses 228 228
Continue to review full report at Codecov.
|
| SET_VECTOR_ELT(out, j, this_col); | ||
| } break; | ||
| default: | ||
| error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v))); |
MichaelChirico
May 25, 2019
Author
Member
do I need to UNPROTECT in an error branch?
do I need to UNPROTECT in an error branch?
mattdowle
May 25, 2019
•
Member
That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).
That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).
MichaelChirico
May 25, 2019
Author
Member
Only the REAL calls? not INTEGER/LOGICAL?
Only the REAL calls? not INTEGER/LOGICAL?
jangorecki
May 25, 2019
Member
REAL, INTEGER and LOGICAL
REAL, INTEGER and LOGICAL
| } | ||
|
|
||
| # apply sorting | ||
| if (sorted && any(idx <- vapply_1b(l, is.list))) stop("'sorted' is TRUE but element ", which(idx), " is a list, which can't be sorted; try setting sorted = FALSE") |
MichaelChirico
May 25, 2019
Author
Member
so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597
so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597
|
quick suggestion: use |
| case INTSXP: { | ||
| SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++; | ||
| int *this_v_ptr = INTEGER(this_v); | ||
| int *this_col_ptr = INTEGER(this_col); |
MichaelChirico
May 25, 2019
Author
Member
@mattdowle is this the proper way? (also for VECSXP/STRSXP cases it seems there's no analog?)
@mattdowle is this the proper way? (also for VECSXP/STRSXP cases it seems there's no analog?)
mattdowle
May 25, 2019
Member
Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).
Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).
mattdowle
May 25, 2019
•
Member
I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.
I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.
MichaelChirico
May 25, 2019
Author
Member
No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int is a beast. I guess rep.int is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).
The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ often outside of i for populating the crossjoin tables for other reasons.
No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int is a beast. I guess rep.int is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).
The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ often outside of i for populating the crossjoin tables for other reasons.
MichaelChirico
May 25, 2019
Author
Member
enjoy the weekend!
enjoy the weekend!
mattdowle
May 25, 2019
•
Member
yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof on the current CJ might be misleading actually as I think someone said Rprof doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).
yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof on the current CJ might be misleading actually as I think someone said Rprof doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).
MichaelChirico
May 25, 2019
Author
Member
it's really just batches of increasing i
Not sure I follow this.
Latest commit splits out the first and last column as separate cases to skip one of the %// operations, gave about 10% boost.
it's really just batches of increasing i
Not sure I follow this.
Latest commit splits out the first and last column as separate cases to skip one of the %// operations, gave about 10% boost.
|
Updated timing after moving R API calls out of the loop as suggested. Getting close to matching the pure R version on current master. |
|
I can't make heads or tails of the
|
|
It's a good case to try parallelizing. Follow the #omp pragma examples in reorder.c or subset.c maybe. Each column of CJ() result can be done independently in parallel. There isn't any random cache cache (all sequential) so you should get good speedups; e.g. 2 threads should be close to twice as fast in this case. Can't go parallel for strings in this case, but logical, integer and real should be good (and factor is integer). |
I think I'm not following how to parallelize across columns then... first iterate over the columns to see which can be done in parallel, then batch it out? Maybe parallelizing rows is the easier route? |
|
OK I tried parallelizing across rows on the Since really what's been parallelized is the row index, are we safe to include the |
|
Now that the basic idea is working & improves the current implementation, should we try and bring more of the logic into C? Not sure there's much benefit... |
|
very nice PR |
| if (j == 0) { | ||
| #pragma omp parallel for num_threads(getDTthreads()) | ||
| for (int i = 0; i < NN; i++) { | ||
| this_col_ptr[i] = this_v_ptr[i / div]; |
jangorecki
May 25, 2019
•
Member
You iterate over all rows of the output, maybe we could do memcpy of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.
CJ(1:3, 1:5)
using pseudo code
for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)
i*size_of_int*5 is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.
You iterate over all rows of the output, maybe we could do memcpy of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.
CJ(1:3, 1:5)using pseudo code
for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)
i*size_of_int*5 is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.
MichaelChirico
May 25, 2019
Author
Member
Not quite sure I follow you here, sorry
Not quite sure I follow you here, sorry
jangorecki
May 25, 2019
•
Member
here is R code for that
library(data.table)
x = 1:3; nx = length(x)
y = 1:5; ny = length(y)
ans_y = vector("integer", nx*ny)
# pragma here
for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y
all.equal(CJ(x, y)$y, ans_y)
#[1] TRUE
here is R code for that
library(data.table)
x = 1:3; nx = length(x)
y = 1:5; ny = length(y)
ans_y = vector("integer", nx*ny)
# pragma here
for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y
all.equal(CJ(x, y)$y, ans_y)
#[1] TRUE
MichaelChirico
May 25, 2019
Author
Member
ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense
ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense
jangorecki
May 25, 2019
Member
Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally
CJ(1:2, 1:3, 1:4)
V1 V2 V3
<int> <int> <int>
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 1 4
5: 1 2 1
6: 1 2 2
7: 1 2 3
8: 1 2 4
Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally
CJ(1:2, 1:3, 1:4) V1 V2 V3
<int> <int> <int>
1: 1 1 1
2: 1 1 2
3: 1 1 3
4: 1 1 4
5: 1 2 1
6: 1 2 2
7: 1 2 3
8: 1 2 4
|
Please don't move CJ() at R level into its own file. This breaks the commit history: CJ.R looks like new code and we can't see what changed inside it. If we did this for all functions then we'd have a ton of files. It's not perfect currently, but I don't struggle with the way things are organized currently: fairly similar functions being together in one file. I can move CJ.R back. |
|
With c41c1a4 and default 50% threads (4 out of 8 on my laptop) :
Would be good to compare character type too. |
|
With character ids : ids = as.vector(outer(LETTERS, LETTERS, paste0))
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74)) # 5GB; 250m rows
# user system elapsed
# 3.030 0.950 3.963 # v1.12.2
# 1.563 1.298 1.910 # now
ids = as.factor(ids)
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))
# user system elapsed
# 2.334 0.833 3.152 # v1.12.2
# 0.513 1.122 0.409 # now |
|
It just struck me to worry about how this interacts with ALTREP, are we covered there? Particularly since |
|
Yes should be fine. INTEGER(), REAL() etc auto-expand ALTREPs. To work with ALTREPs without expanding them, you have to switch to use different R API calls. In this case of |
Also closes #3597
Putting on my C training wheels... guidance on ways to improve appreciated🙌
Follow-up to https://github.com/orgs/Rdatatable/teams/project-members/discussions/17
Benchmarking with
setDTthreads(2L)Paralellization working well here:
For the record the speed of the pure-R
CJis really quite impressive! it's basically the same speed as declaring an empty list with that number of rows: