Skip to content

Added (local) kendall tau#18

Closed
CGMossa wants to merge 6 commits intoMoseleyBioinformaticsLab:masterfrom
CGMossa:rusty_kendall
Closed

Added (local) kendall tau#18
CGMossa wants to merge 6 commits intoMoseleyBioinformaticsLab:masterfrom
CGMossa:rusty_kendall

Conversation

@CGMossa
Copy link
Copy Markdown

@CGMossa CGMossa commented Sep 29, 2020

Just to get to the most important part:
A benchmark of the current rust implementation of this yields these results:

ici-kendall-tau 1000    time:   [8.9580 ms 8.9799 ms 9.0039 ms]
                        change: [-5.1194% -3.1332% -1.7031%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

There's something not right, and I cannot seem to decipher it:

0.019283283283283284
-0.044724724724724725
-0.015171171171171172
-0.0065025025025025026
0.022894894894894893
-0.030114114114114115
-0.006366366366366366
-0.028236236236236237
0.03296496496496497
0.04845645645645646

This is output from repeated simulation from standard normal for both x and y.
If I do the same with the r/rcpp-version I get:

   0.001641642
8598.523827828
   0.003627628
8598.495775776
8598.528076076
   0.023367367
   0.032224224
   0.030966967
   0.003787788
8598.512412412

i.e. replicate(10, ici_kendallt(rnorm(1000), rnorm(1000))).

@CGMossa
Copy link
Copy Markdown
Author

CGMossa commented Sep 29, 2020

A comment on the Rcpp version: I think the default type for the sum stuff should be integers. Summing integers should be faster than summing floats, but that might just be me.

@CGMossa
Copy link
Copy Markdown
Author

CGMossa commented Sep 29, 2020

So I tried to do this change to the rcpp code:


  unsigned int sum_concordant = 0;
  unsigned int sum_discordant = 0;
  unsigned int sum_x_ties = 0;
  unsigned int sum_y_ties = 0;
  unsigned int sum_tied_x = 0;
  unsigned int sum_tied_y = 0;
  unsigned int sum_tied_x_na = 0;
  unsigned int sum_tied_y_na = 0;
  unsigned int sum_all_na = 0;

And I think that is what broke the rcpp implementation. Reverting back yields results comparable to what my rust version is outputting.

@hunter-moseley
Copy link
Copy Markdown
Member

hunter-moseley commented Sep 29, 2020 via email

@CGMossa
Copy link
Copy Markdown
Author

CGMossa commented Sep 29, 2020 via email

@rmflight
Copy link
Copy Markdown
Member

Yes, my initial implementation used unsigned integers as well, but there is a division that happens in there by 2, and it doesn't always result in an integer (that is another discussion as to whether it should or not, currently it does not), and then gets added to everything else, and then we have a division at the end, so it was easier to just use floats.

@CGMossa
Copy link
Copy Markdown
Author

CGMossa commented Sep 30, 2020

Irregardless: I don't know if my intuition is correct anyways.

Parallel version

Outputs from parallel version:

0.027227227227227226
0.039711711711711714
0.0037877877877877876
-0.013341341341341342
-0.01497097097097097
0.029773773773773774
-0.01877877877877878
-0.018686686686686688
-0.00546946946946947
-0.0013013013013013013

The benchmarks from the parallel version:

ici-kendall-tau 1000    time:   [2.2719 ms 2.2907 ms 2.3115 ms]
                        change: [-75.978% -75.667% -75.401%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

Interesting, huh?

Just remove the .par_into_iter() and the version stops being the parallelized version..
@rmflight rmflight closed this Jan 6, 2021
@rmflight rmflight deleted the branch MoseleyBioinformaticsLab:master January 6, 2021 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants