Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port CJ to C #3596

Merged
merged 17 commits into from Jun 17, 2019
Merged

port CJ to C #3596

merged 17 commits into from Jun 17, 2019

Conversation

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 25, 2019

Also closes #3597

Putting on my C training wheels... guidance on ways to improve appreciated 🙌

Follow-up to https://github.com/orgs/Rdatatable/teams/project-members/discussions/17

Benchmarking with setDTthreads(2L)

setDTthreads(2L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 2.823   0.578   1.707 
nrow(DT)
rm(DT); gc()
system.time(CJ_newR(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
 #   user  system elapsed 
 # 16.663   2.202  15.514
gc()
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.959   0.637   1.953 

Paralellization working well here:

setDTthreads(8L)
system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
  #  user  system elapsed 
  # 5.506   1.215   0.911 

For the record the speed of the pure-R CJ is really quite impressive! it's basically the same speed as declaring an empty list with that number of rows:

system.time(replicate(10L, numeric(39916800L), simplify = FALSE))
#    user  system elapsed 
#   0.899   0.912   1.891 
system.time(CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
#    user  system elapsed 
#   5.994   0.648   1.920 
Michael Chirico
fixed header
@MichaelChirico MichaelChirico force-pushed the cj_speedup branch from 096bbef to 1dd3c40 May 25, 2019
@codecov
Copy link

@codecov codecov bot commented May 25, 2019

Codecov Report

Merging #3596 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3596      +/-   ##
==========================================
+ Coverage   98.23%   98.24%   +<.01%     
==========================================
  Files          66       67       +1     
  Lines       12928    12972      +44     
==========================================
+ Hits        12700    12744      +44     
  Misses        228      228
Impacted Files Coverage Δ
src/init.c 100% <ø> (ø) ⬆️
R/setkey.R 98.69% <100%> (-0.06%) ⬇️
src/cj.c 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eea0a7b...c16fa25. Read the comment docs.

src/cj.c Outdated
SET_VECTOR_ELT(out, j, this_col);
} break;
default:
error("Type '%s' not supported by CJ.", type2char(TYPEOF(this_v)));

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

do I need to UNPROTECT in an error branch?

This comment has been minimized.

@mattdowle

mattdowle May 25, 2019
Member

That's one nice thing: R will clean up all objects on error so you don't need UNPROTECT.
Yay -- you're into C! Party!
We've been taking R API usage outside loops recently since R 3.5 added overhead. So take the REAL(), INTEGER() and LOGICAL() calls outside (see other C code in data.table for examples but look at files that have been more recently revised).

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

Only the REAL calls? not INTEGER/LOGICAL?

This comment has been minimized.

@jangorecki

jangorecki May 25, 2019
Member

REAL, INTEGER and LOGICAL

R/CJ.R Outdated
}

# apply sorting
if (sorted && any(idx <- vapply_1b(l, is.list))) stop("'sorted' is TRUE but element ", which(idx), " is a list, which can't be sorted; try setting sorted = FALSE")

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

so it's not missed because of the diff from moving CJ to its own file -- this is the part that closes #3597

Michael Chirico
@mattdowle
Copy link
Member

@mattdowle mattdowle commented May 25, 2019

quick suggestion: use Rprof() to confirm the timing you've shown is all in the C. It's feasible there's some surprising overhead at the R level before it gets to C. Best to double-check.

src/cj.c Outdated Show resolved Hide resolved
Michael Chirico
src/cj.c Outdated
case INTSXP: {
SEXP this_col = PROTECT(allocVector(INTSXP, NN)); nprotect++;
int *this_v_ptr = INTEGER(this_v);
int *this_col_ptr = INTEGER(this_col);

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

@mattdowle is this the proper way? (also for VECSXP/STRSXP cases it seems there's no analog?)

This comment has been minimized.

@mattdowle

mattdowle May 25, 2019
Member

Correct. You can't do it for VECSXP/STRSXP because that would be what's known as "behind the write barrier". You can partially do it though by reading (RHS) but not writing (LHS =).

This comment has been minimized.

@mattdowle

mattdowle May 25, 2019
Member

I'm off to bed now and it's a long weekend here -- so have fun in C! Where did you get the sense that CJ() was slow and that it could be sped up -- is the benchmark you put in this PR what motivated you or is there another one?
The idea from way-back was that CJ() would not actually materialize the full table. It would just mark itself as "CJ" object and then bmerge would know how to recycle from the irregular non-expanded structure appropriately without needing to expand it.

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

No benchmark, just from staring at the code now and then and thinking "there's gotta be a faster way", and then happening on the modulo/div version here in my scratch work on an unrelated issue but alas. I'm still surprised how fast it is... rep.int is a beast. I guess rep.int is doing most of the work, and the recursion is only touching scalars instead of the full data (so it's not slowing down).

The alt-rep-y form does sound promising (certainly way more memory efficient), though I do use CJ often outside of i for populating the crossjoin tables for other reasons.

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

enjoy the weekend!

This comment has been minimized.

@mattdowle

mattdowle May 25, 2019
Member

yep: CJ uses rep.int which is implemented in C and an .Internal() R function too (which makes it fast to call in a loop). So really your're trying to beat C with C. Rprof on the current CJ might be misleading actually as I think someone said Rprof doesn't count .Internal() R functions.
Are you definitely getting the same result in the same order? Doing "(i % modulo) / div" on every single iteration is inefficient because it's really just batches of increasing i. It's possible that all those % and / add up in this case (it's contiguous assign so cache effects shouldn't come into play).

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

it's really just batches of increasing i

Not sure I follow this.

Latest commit splits out the first and last column as separate cases to skip one of the %// operations, gave about 10% boost.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented May 25, 2019

Updated timing after moving R API calls out of the loop as suggested. Getting close to matching the pure R version on current master.

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented May 25, 2019

I can't make heads or tails of the Rprof output, though it does look like the new version is (slightly) more memory efficient:

Rprof(interval = .005, memory.profiling = TRUE)
CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ"                3.195    99.22      3.200     99.38    1370.5
# "print.default"     0.010     0.31      0.010      0.31       0.2
# "[.data.table"      0.005     0.16      0.005      0.16       0.3
# "options"           0.005     0.16      0.005      0.16       0.3
# "pmatch"            0.005     0.16      0.005      0.16       0.0
# 
# $by.total
#                     total.time total.pct mem.total self.time self.pct
# "CJ"                     3.200     99.38    1370.5     3.195    99.22
# "<Anonymous>"            0.020      0.62       0.7     0.000     0.00
# "print.data.table"       0.020      0.62       0.7     0.000     0.00
# "print.default"          0.010      0.31       0.2     0.010     0.31
# "print"                  0.010      0.31       0.2     0.000     0.00
# "[.data.table"           0.005      0.16       0.3     0.005     0.16
# "options"                0.005      0.16       0.3     0.005     0.16
# "pmatch"                 0.005      0.16       0.0     0.005     0.16
# ".deparseOpts"           0.005      0.16       0.0     0.000     0.00
# "["                      0.005      0.16       0.3     0.000     0.00
# "char.trunc"             0.005      0.16       0.3     0.000     0.00
# "deparse"                0.005      0.16       0.0     0.000     0.00
# "do.call"                0.005      0.16       0.3     0.000     0.00
# "format.data.table"      0.005      0.16       0.3     0.000     0.00
# "format"                 0.005      0.16       0.3     0.000     0.00
# "FUN"                    0.005      0.16       0.3     0.000     0.00
# "lapply"                 0.005      0.16       0.3     0.000     0.00
# "name_dots"              0.005      0.16       0.0     0.000     0.00
# "rbindlist"              0.005      0.16       0.3     0.000     0.00
# "suppressWarnings"       0.005      0.16       0.3     0.000     0.00
# "tail.data.table"        0.005      0.16       0.3     0.000     0.00
# "tail"                   0.005      0.16       0.3     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 3.22

Rprof(interval = .005, memory.profiling = TRUE)
CJ_old(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11)
Rprof(NULL)
summaryRprof(memory = 'both')
# $by.self
#                 self.time self.pct total.time total.pct mem.total
# "CJ_old"            6.295    99.76      6.300     99.84    1472.0
# "[.data.table"      0.005     0.08      0.005      0.08       0.2
# "%in%"              0.005     0.08      0.005      0.08       0.1
# "print.default"     0.005     0.08      0.005      0.08       0.5
# 
# $by.total
#                    total.time total.pct mem.total self.time self.pct
# "CJ_old"                6.300     99.84    1472.0     6.295    99.76
# "<Anonymous>"           0.010      0.16       0.7     0.000     0.00
# "print.data.table"      0.010      0.16       0.7     0.000     0.00
# "[.data.table"          0.005      0.08       0.2     0.005     0.08
# "%in%"                  0.005      0.08       0.1     0.005     0.08
# "print.default"         0.005      0.08       0.5     0.005     0.08
# "["                     0.005      0.08       0.2     0.000     0.00
# "deparse"               0.005      0.08       0.1     0.000     0.00
# "head.data.table"       0.005      0.08       0.2     0.000     0.00
# "head"                  0.005      0.08       0.2     0.000     0.00
# "mode"                  0.005      0.08       0.1     0.000     0.00
# "name_dots"             0.005      0.08       0.1     0.000     0.00
# "print"                 0.005      0.08       0.5     0.000     0.00
# "rbindlist"             0.005      0.08       0.2     0.000     0.00
# 
# $sample.interval
# [1] 0.005
# 
# $sampling.time
# [1] 6.31
@mattdowle
Copy link
Member

@mattdowle mattdowle commented May 25, 2019

It's a good case to try parallelizing. Follow the #omp pragma examples in reorder.c or subset.c maybe. Each column of CJ() result can be done independently in parallel. There isn't any random cache cache (all sequential) so you should get good speedups; e.g. 2 threads should be close to twice as fast in this case. Can't go parallel for strings in this case, but logical, integer and real should be good (and factor is integer).

Michael Chirico added 2 commits May 25, 2019
@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented May 25, 2019

Can't go parallel for strings in this case

I think I'm not following how to parallelize across columns then... first iterate over the columns to see which can be done in parallel, then batch it out?

Maybe parallelizing rows is the easier route?

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented May 25, 2019

OK I tried parallelizing across rows on the INTEGER/REAL/LOGICAL branches, I think it worked well. With 2 threads, we equal the speed of the current implementation and still get the 10% memory savings. More threads and we do better than currently.

Since really what's been parallelized is the row index, are we safe to include the VECSXP and STRSXP cases in parallel as well?

@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented May 25, 2019

Now that the basic idea is working & improves the current implementation, should we try and bring more of the logic into C? Not sure there's much benefit...

@MichaelChirico MichaelChirico changed the title port CJ to C port CJ to C; also closes #3597 May 25, 2019
src/cj.c Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 25, 2019

very nice PR

src/cj.c Outdated
if (j == 0) {
#pragma omp parallel for num_threads(getDTthreads())
for (int i = 0; i < NN; i++) {
this_col_ptr[i] = this_v_ptr[i / div];

This comment has been minimized.

@jangorecki

jangorecki May 25, 2019
Member

You iterate over all rows of the output, maybe we could do memcpy of blocks of rows? then iterate on blocks, this can be in parallel also. One block per thread. If we can do that it actually depends on the expected order of results.

CJ(1:3, 1:5)

using pseudo code

for (i=0, i<3; i++) memcpy(add_of_res()+i*size_of_int*5, addr_of(1:5), size_of_int*5)

i*size_of_int*5 is memory offset writing to same vector, but different location in it, AFAIK safe even from multiple threads because blocks will not overlap.

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

Not quite sure I follow you here, sorry

This comment has been minimized.

@jangorecki

jangorecki May 25, 2019
Member

here is R code for that

library(data.table)
x = 1:3; nx = length(x)
y = 1:5; ny = length(y)
ans_y = vector("integer", nx*ny)
# pragma here
for (i in seq.int(nx)) ans_y[seq.int(ny)+(i-1L)*length(y)] = y

all.equal(CJ(x, y)$y, ans_y)
#[1] TRUE

This comment has been minimized.

@MichaelChirico

MichaelChirico May 25, 2019
Author Member

ohh gotcha. just loop over y once, figure out where each element goes, and copy all at once, makes sense

This comment has been minimized.

@jangorecki

jangorecki May 25, 2019
Member

Looks trivial for 2 columns, for 3+ columns gets slightly more complicated, but I believe this is the way to go if we want to have that in C optimally

CJ(1:2, 1:3, 1:4)
      V1    V2    V3
    <int> <int> <int>
 1:     1     1     1
 2:     1     1     2
 3:     1     1     3
 4:     1     1     4
 5:     1     2     1
 6:     1     2     2
 7:     1     2     3
 8:     1     2     4
src/cj.c Outdated Show resolved Hide resolved
@mattdowle mattdowle changed the title port CJ to C; also closes #3597 port CJ to C May 29, 2019
@mattdowle
Copy link
Member

@mattdowle mattdowle commented May 29, 2019

Please don't move CJ() at R level into its own file. This breaks the commit history: CJ.R looks like new code and we can't see what changed inside it. If we did this for all functions then we'd have a ton of files. It's not perfect currently, but I don't struggle with the way things are organized currently: fairly similar functions being together in one file. I can move CJ.R back. cj.c on the other hand is new and appropriate to be its own file. And in the end CJ.R will become much smaller as it is moved down to cj.c so it makes sense to keep the (reducing) CJ() at R level where it is (especially as we'll want to see how it has reduced in the diffs).

@mattdowle mattdowle added this to the 1.12.4 milestone Jun 15, 2019
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jun 15, 2019

With c41c1a4 and default 50% threads (4 out of 8 on my laptop) :

> system.time(DT <- CJ(1:2, 1:3, 1:4, 1:5, 1:6, 1:7, 1:8, 1:9, 1:10, 1:11))
   user  system elapsed   # v1.12.2
  0.966   0.288   1.216 

   user  system elapsed   # now
  0.212   0.434   0.205

Would be good to compare character type too.

@Rdatatable Rdatatable deleted a comment from MichaelChirico Jun 15, 2019
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jun 17, 2019

With character ids :

ids = as.vector(outer(LETTERS, LETTERS, paste0))
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))   # 5GB; 250m rows
#   user  system elapsed   
#  3.030   0.950   3.963  # v1.12.2
#  1.563   1.298   1.910  # now

ids = as.factor(ids)
system.time(DT <- CJ(ids, 1:72, 1:73, 1:74))
#   user  system elapsed
#  2.334   0.833   3.152  # v1.12.2
#  0.513   1.122   0.409  # now
@mattdowle mattdowle merged commit c9c8d09 into master Jun 17, 2019
4 checks passed
4 checks passed
codecov/patch 100% of diff hit (target 98.23%)
Details
codecov/project 98.24% (+<.01%) compared to eea0a7b
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mattdowle mattdowle deleted the cj_speedup branch Jun 17, 2019
@MichaelChirico
Copy link
Member Author

@MichaelChirico MichaelChirico commented Jun 18, 2019

It just struck me to worry about how this interacts with ALTREP, are we covered there? Particularly since CJ will often be used like CJ(1:10, 1:20)...

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jun 18, 2019

Yes should be fine. INTEGER(), REAL() etc auto-expand ALTREPs. To work with ALTREPs without expanding them, you have to switch to use different R API calls. In this case of CJ() definitely not worth it because the work happens as the product of the lengths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
You can’t perform that action at this time.