Sugar functions - rowSums, colSums, rowMeans, colMeans #549

nathan-russell · 2016-09-03T02:01:49Z

I put together sugar functions for rowSums, colSums, rowMeans, and colMeans (for use with matrices, not data.frames) which all seem to be working as expected. However, early on in the process I switched from using Rcpp::traits::is_na<> to the R macro ISNAN to check for NAs / NaNs in numeric matrices because I noticed a large difference in performance. As an example,

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::LogicalVector macro_na(Rcpp::NumericVector x) {
    R_xlen_t i = 0, sz = x.size();
    Rcpp::LogicalVector res(sz);

    for ( ; i < sz; i++) {
        res[i] = ISNAN(x[i]);
    }

    return res;
}

// [[Rcpp::export]]
Rcpp::LogicalVector traits_na(Rcpp::NumericVector x) {
    R_xlen_t i = 0, sz = x.size();
    Rcpp::LogicalVector res(sz);

    for ( ; i < sz; i++) {
        res[i] = Rcpp::traits::is_na<REALSXP>(x[i]);
    }

    return res;
}

/*** R

set.seed(123); x <- sample(c(rnorm(4), NA), 1e6, TRUE)

microbenchmark::microbenchmark(
    "Macro" = macro_na(x),
    "Traits" = traits_na(x),
    times = 200L
)
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#   Macro  6.075743  6.158497  7.524288  6.259146  8.587615 26.61165   200
#  Traits 13.508780 13.621339 16.530404 13.753101 15.818473 92.80678   200

*/

I believe the difference is due to the memcmp call here. @kevinushey It looks like you authored this, so can you comment on whether or not it is safe to be using ISNAN in place of traits::is_na? I read the explanatory comment,

motivation: on 32bit architectures, we only see 'LargeNA'
as defined ahead; on 64bit architectures, R defaults to
'SmallNA' for R_NaReal, but this can get promoted to 'LargeNA'
if a certain operation can create a 'signalling' NA, e.g. NA_real_+1

and it made sense to me, but despite the fact that I am on a 64-bit machine with unsigned long long support, I was not able to reproduce the situation described. If I'm only using ISNAN to check values on input objects, is there any risk in it giving erroneous results? If so, is there some way to test this directly?

The text was updated successfully, but these errors were encountered:

kevinushey · 2016-09-03T23:40:28Z

I originally added that code as it was supposed to be faster that the macro version, so I'm pretty surprised that's not the case here!

FWIW I think that these optimizations only kick in when compiling in C++11 mode (since we condition on RCPP_HAS_LONG_LONG); e.g. if I add // [[Rcpp::plugins(cpp11)]] I see the expected performance benefit:

> set.seed(123); x <- sample(c(rnorm(4), NA), 1e6, TRUE)

>     microbenchmark::microbenchmark(
+         "Macro" = macro_na(x),
+         "Traits" = traits_na(x),
+         times = 200L
+     )
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
  Macro 26.04574 26.63211 27.79379 26.87088 27.71247 71.07792   200
 Traits 13.27945 13.66297 14.23938 13.81028 14.02067 17.47713   200

When compiling in C++98 / the regular mode, we just delegate to R_IsNA / R_IsNaN, and IIUC this might confer worse performance as they may not be inlined relative to their macro counterparts.

kevinushey · 2016-09-03T23:51:58Z

FWIW, this code:

Rcpp/inst/include/Rcpp/traits/is_na.h

Lines 40 to 43 in 3c09586

    
           template <> 
        
           inline bool is_na<REALSXP>(double x) { 
        
               return internal::Rcpp_IsNA(x) || internal::Rcpp_IsNaN(x); 
        
           }

could likely be improved to only check for NaN. I remember there being a lot of confusion here though, since there's a bunch of things to keep in mind.

Firstly, R's NA is really a 'special kind' of floating point NaN value; that is, it's NaN with a magic number in the lower bits;

Secondly, it's often not clear what a particular R API does for missing values; I usually have to jump back to the sources to remind myself whether a particular routine checks:

Whether a number is an IEEE floating point 'missing value' (NaN),
Whether a number is NaN but not NA,
Whether a number is specifically NA.

(note that the C version of the API is not the same as the R version of the API; I outlined a bit of that here: http://stackoverflow.com/questions/26241085/rcpp-function-check-if-missing-value/26262984#26262984)

nathan-russell · 2016-09-04T01:21:12Z

Hmm well now I'm starting to wonder if the performance difference is platform-related. In a separate file I compiled traits_na with the C++11 plugin, and compared it against the other two. With GCC 4.9 on Debian 8 I get

microbenchmark::microbenchmark(
    "Macro" = macro_na(x),
    "Traits" = traits_na(x),
    "C++11 Traits" = traits_na11(x),
    times = 300L
)
# Unit: milliseconds
#          expr     min      lq    mean median      uq    max neval
#         Macro  6.0555  6.1323  6.6602  6.220  6.8115 12.828   300
#        Traits 13.4736 13.5947 14.6054 13.720 14.7225 68.902   300
#  C++11 Traits 18.3558 18.4797 19.6536 18.846 19.9843 36.756   300

which is strangely even slower than the C++98 traits version. Running this test on Windows 7 I see the same pattern; macro version < C++98 traits version < C++11 traits version. At any rate, I can see why the C++98 version of Rcpp::traits::na would be slower, as calling R_IsNA and R_IsNaN involves a lot more than ISNAN, which ultimately seems to amount to a single call to C's isnan.

Circling back to my original question, I just want to make sure that it is okay to use ISNAN in lieu of traits::is_na for these functions. I feel that since the R versions are using ISNAN, then doing this is in an Rcpp implementation should at least produce consistent results, right?

kevinushey · 2016-09-04T03:14:30Z

ISNAN should be safe to use as a substitute here (it returns true for both of R's NA and NaN values). Or even std::isnan if you have included the appropriate headers, should be fine.

kevinushey · 2016-09-04T03:17:41Z

Now that I think about it, we should probably just delegate to std::isnan for our traits::is_na implementation. EDIT: oops, it's a C++11-only addition. :/

nathan-russell · 2016-09-04T03:39:20Z

What about falling back on R_isnancpp instead?

#include <Rcpp.h>

// [[Rcpp::export]]
bool is_nan(double x) {
    return R_isnancpp(x);
}

/*** R

is_nan(NA_real_)
# [1] TRUE

is_nan(NaN)
# [1] TRUE

is_nan(pi)
# [1] FALSE

*/

It seems like someone already did the hard work of making this portable.

kevinushey · 2016-09-04T05:41:35Z

Sounds good to me; I'd be open to accepting a PR if that change was made.

Honestly, the code I added for 'optimizing' NA checks in hindsight just looks dangerous / weird (ie, the code is too clever for its own good); I'd be open to simplifying things and calling pre-established APIs.

eddelbuettel · 2016-09-04T12:34:52Z

@nathan-russell Do you consider the rowSums, rowMeans, colSums, colMeans, ... to be urgent? I am otherwise at the point of calling what we have 0.12.7 and shipping it, meaning this would wait til 0.12.8 presumably in November. Is that ok with you?

nathan-russell · 2016-09-04T12:56:00Z

@eddelbuettel No it's not urgent, especially since the code hasn't been vetted yet. Please feel free to proceed with releasing 0.12.7 without this addition.

nathan-russell · 2016-09-04T12:58:50Z

@kevinushey Re: the traits::na code -- That sounds good. I'm going to try to wrap up this issue and see about looking into that separately, as the scope of that change is presumably much larger.

row/col-Sums/Means and unit tests - for #549

eddelbuettel · 2016-09-05T12:46:27Z

This should now be taken care of.

nathan-russell added a commit to nathan-russell/Rcpp that referenced this issue Sep 4, 2016

row/col-Sums/Means and unit tests - RcppCore#549

e92b696

nathan-russell mentioned this issue Sep 4, 2016

row/col-Sums/Means and unit tests - for #549 #551

Merged

eddelbuettel added a commit that referenced this issue Sep 5, 2016

Merge pull request #551 from nathan-russell/master

f75ff3b

row/col-Sums/Means and unit tests - for #549

eddelbuettel closed this as completed Sep 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sugar functions - rowSums, colSums, rowMeans, colMeans #549

Sugar functions - rowSums, colSums, rowMeans, colMeans #549

nathan-russell commented Sep 3, 2016

kevinushey commented Sep 3, 2016

kevinushey commented Sep 3, 2016

nathan-russell commented Sep 4, 2016 •

edited

Loading

kevinushey commented Sep 4, 2016 •

edited

Loading

kevinushey commented Sep 4, 2016 •

edited

Loading

nathan-russell commented Sep 4, 2016

kevinushey commented Sep 4, 2016

eddelbuettel commented Sep 4, 2016

nathan-russell commented Sep 4, 2016

nathan-russell commented Sep 4, 2016

eddelbuettel commented Sep 5, 2016

Sugar functions - rowSums, colSums, rowMeans, colMeans #549

Sugar functions - rowSums, colSums, rowMeans, colMeans #549

Comments

nathan-russell commented Sep 3, 2016

kevinushey commented Sep 3, 2016

kevinushey commented Sep 3, 2016

nathan-russell commented Sep 4, 2016 • edited Loading

kevinushey commented Sep 4, 2016 • edited Loading

kevinushey commented Sep 4, 2016 • edited Loading

nathan-russell commented Sep 4, 2016

kevinushey commented Sep 4, 2016

eddelbuettel commented Sep 4, 2016

nathan-russell commented Sep 4, 2016

nathan-russell commented Sep 4, 2016

eddelbuettel commented Sep 5, 2016

nathan-russell commented Sep 4, 2016 •

edited

Loading

kevinushey commented Sep 4, 2016 •

edited

Loading

kevinushey commented Sep 4, 2016 •

edited

Loading