Sugar functions - rowSums, colSums, rowMeans, colMeans #549

Closed
nathan-russell opened this Issue Sep 3, 2016 · 11 comments

Projects

None yet

3 participants

@nathan-russell
Contributor

I put together sugar functions for rowSums, colSums, rowMeans, and colMeans (for use with matrices, not data.frames) which all seem to be working as expected. However, early on in the process I switched from using Rcpp::traits::is_na<> to the R macro ISNAN to check for NAs / NaNs in numeric matrices because I noticed a large difference in performance. As an example,

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::LogicalVector macro_na(Rcpp::NumericVector x) {
    R_xlen_t i = 0, sz = x.size();
    Rcpp::LogicalVector res(sz);

    for ( ; i < sz; i++) {
        res[i] = ISNAN(x[i]);
    }

    return res;
}

// [[Rcpp::export]]
Rcpp::LogicalVector traits_na(Rcpp::NumericVector x) {
    R_xlen_t i = 0, sz = x.size();
    Rcpp::LogicalVector res(sz);

    for ( ; i < sz; i++) {
        res[i] = Rcpp::traits::is_na<REALSXP>(x[i]);
    }

    return res;
}

/*** R

set.seed(123); x <- sample(c(rnorm(4), NA), 1e6, TRUE)

microbenchmark::microbenchmark(
    "Macro" = macro_na(x),
    "Traits" = traits_na(x),
    times = 200L
)
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#   Macro  6.075743  6.158497  7.524288  6.259146  8.587615 26.61165   200
#  Traits 13.508780 13.621339 16.530404 13.753101 15.818473 92.80678   200

*/ 

I believe the difference is due to the memcmp call here. @kevinushey It looks like you authored this, so can you comment on whether or not it is safe to be using ISNAN in place of traits::is_na? I read the explanatory comment,

motivation: on 32bit architectures, we only see 'LargeNA'
as defined ahead; on 64bit architectures, R defaults to
'SmallNA' for R_NaReal, but this can get promoted to 'LargeNA'
if a certain operation can create a 'signalling' NA, e.g. NA_real_+1

and it made sense to me, but despite the fact that I am on a 64-bit machine with unsigned long long support, I was not able to reproduce the situation described. If I'm only using ISNAN to check values on input objects, is there any risk in it giving erroneous results? If so, is there some way to test this directly?

@kevinushey
Contributor

I originally added that code as it was supposed to be faster that the macro version, so I'm pretty surprised that's not the case here!

FWIW I think that these optimizations only kick in when compiling in C++11 mode (since we condition on RCPP_HAS_LONG_LONG); e.g. if I add // [[Rcpp::plugins(cpp11)]] I see the expected performance benefit:

> set.seed(123); x <- sample(c(rnorm(4), NA), 1e6, TRUE)

>     microbenchmark::microbenchmark(
+         "Macro" = macro_na(x),
+         "Traits" = traits_na(x),
+         times = 200L
+     )
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval
  Macro 26.04574 26.63211 27.79379 26.87088 27.71247 71.07792   200
 Traits 13.27945 13.66297 14.23938 13.81028 14.02067 17.47713   200

When compiling in C++98 / the regular mode, we just delegate to R_IsNA / R_IsNaN, and IIUC this might confer worse performance as they may not be inlined relative to their macro counterparts.

@kevinushey
Contributor

FWIW, this code:

https://github.com/RcppCore/Rcpp/blob/3c09586a15bb565dd106fc8a39460f686588e74a/inst/include/Rcpp/traits/is_na.h#L40-L43

could likely be improved to only check for NaN. I remember there being a lot of confusion here though, since there's a bunch of things to keep in mind.

Firstly, R's NA is really a 'special kind' of floating point NaN value; that is, it's NaN with a magic number in the lower bits;

Secondly, it's often not clear what a particular R API does for missing values; I usually have to jump back to the sources to remind myself whether a particular routine checks:

  1. Whether a number is an IEEE floating point 'missing value' (NaN),
  2. Whether a number is NaN but not NA,
  3. Whether a number is specifically NA.

(note that the C version of the API is not the same as the R version of the API; I outlined a bit of that here: http://stackoverflow.com/questions/26241085/rcpp-function-check-if-missing-value/26262984#26262984)

@nathan-russell
Contributor
nathan-russell commented Sep 4, 2016 edited

Hmm well now I'm starting to wonder if the performance difference is platform-related. In a separate file I compiled traits_na with the C++11 plugin, and compared it against the other two. With GCC 4.9 on Debian 8 I get

microbenchmark::microbenchmark(
    "Macro" = macro_na(x),
    "Traits" = traits_na(x),
    "C++11 Traits" = traits_na11(x),
    times = 300L
)
# Unit: milliseconds
#          expr     min      lq    mean median      uq    max neval
#         Macro  6.0555  6.1323  6.6602  6.220  6.8115 12.828   300
#        Traits 13.4736 13.5947 14.6054 13.720 14.7225 68.902   300
#  C++11 Traits 18.3558 18.4797 19.6536 18.846 19.9843 36.756   300 

which is strangely even slower than the C++98 traits version. Running this test on Windows 7 I see the same pattern; macro version < C++98 traits version < C++11 traits version. At any rate, I can see why the C++98 version of Rcpp::traits::na would be slower, as calling R_IsNA and R_IsNaN involves a lot more than ISNAN, which ultimately seems to amount to a single call to C's isnan.


Circling back to my original question, I just want to make sure that it is okay to use ISNAN in lieu of traits::is_na for these functions. I feel that since the R versions are using ISNAN, then doing this is in an Rcpp implementation should at least produce consistent results, right?

@kevinushey
Contributor
kevinushey commented Sep 4, 2016 edited

ISNAN should be safe to use as a substitute here (it returns true for both of R's NA and NaN values). Or even std::isnan if you have included the appropriate headers, should be fine.

@kevinushey
Contributor
kevinushey commented Sep 4, 2016 edited

Now that I think about it, we should probably just delegate to std::isnan for our traits::is_na implementation. EDIT: oops, it's a C++11-only addition. :/

@nathan-russell
Contributor

What about falling back on R_isnancpp instead?

#include <Rcpp.h>

// [[Rcpp::export]]
bool is_nan(double x) {
    return R_isnancpp(x);
}

/*** R

is_nan(NA_real_)
# [1] TRUE

is_nan(NaN)
# [1] TRUE

is_nan(pi)
# [1] FALSE

*/

It seems like someone already did the hard work of making this portable.

@kevinushey
Contributor

Sounds good to me; I'd be open to accepting a PR if that change was made.

Honestly, the code I added for 'optimizing' NA checks in hindsight just looks dangerous / weird (ie, the code is too clever for its own good); I'd be open to simplifying things and calling pre-established APIs.

@eddelbuettel
Member

@nathan-russell Do you consider the rowSums, rowMeans, colSums, colMeans, ... to be urgent? I am otherwise at the point of calling what we have 0.12.7 and shipping it, meaning this would wait til 0.12.8 presumably in November. Is that ok with you?

@nathan-russell
Contributor

@eddelbuettel No it's not urgent, especially since the code hasn't been vetted yet. Please feel free to proceed with releasing 0.12.7 without this addition.

@nathan-russell
Contributor

@kevinushey Re: the traits::na code -- That sounds good. I'm going to try to wrap up this issue and see about looking into that separately, as the scope of that change is presumably much larger.

@nathan-russell nathan-russell added a commit to nathan-russell/Rcpp that referenced this issue Sep 4, 2016
@nathan-russell nathan-russell row/col-Sums/Means and unit tests - #549 e92b696
@eddelbuettel
Member

This should now be taken care of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment