Sugar function sample with unit tests #610

Merged
merged 1 commit into from Dec 11, 2016

Projects

None yet

3 participants

@nathan-russell
Contributor

This PR adds a sugar function sample, with overloads for

  • a single integer as the first argument (behavior of base::sample.int)
  • a vector as the first argument (behavior of base::sample)
@codecov-io

Current coverage is 62.42% (diff: 100%)

Merging #610 into master will increase coverage by 0.02%

@@             master       #610   diff @@
==========================================
  Files            69         69          
  Lines          4798       4798          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2994       2995     +1   
+ Misses         1804       1803     -1   
  Partials          0          0          

Powered by Codecov. Last update 8e3076d...166b750

@eddelbuettel
Member
eddelbuettel commented Dec 11, 2016 edited

What you are doing to opening { is a criminal offence is most states but otherwise looks good at a first glance.

What will this mean for Christian's sample() in RcppArmadillo? They'll just be alternatives?

@nathan-russell
Contributor

Skimming through RcppArmadilloExtensions/sample.h, that also seems to be based on the code in src/main/random.c, so they should produce the same results. That version is slightly less efficient, however, since it makes two passes -- once to calculate the indices, and again to populate the result vector (by index lookup). At any rate, I think it is important to have a sample in Rcpp itself. For the most part having two versions should not cause any problems, since access to Christian's function requires an additional #include <RcppArmadilloExtensions/sample.h>, and it lives in the namespace Rcpp::RcppArmadillo::. The only exception might be code with gratuitous using namespace declarations, but that issue is addressable if need be.

@eddelbuettel
Member

Agree mostly -- the fact that @helmingstay (ie Christian) tucked his away carefully in an additional opt-in header file is good as will not clutter by default. User are more likely to find your proposed sample() now, and I agree also that it is good to have one here.

What I am currently (hey, barely into first coffee of the day) unsure about is whether we should mimick the results of R's sample() the way Christian's does. I mostly want to say "yep we do" as we are genereally trying to be faster replacements. Now, for something probabilistic, does not have to mean identical? Convince me why would not want to replicate (apart from the obvious "gives us easier/faster implementations".

@nathan-russell
Contributor

To clarify, I am generally in favor of replicating R's behavior as much as possible, the current situation included. Most importantly I think it is what users typically expect; and of somewhat lesser significance, it makes writing unit tests a bit easier, as the objectives are more defined.

My remark about slight algorithmic differences between what is used in RcppArmadillo (and similarly in base::sample) vs. the code in this PR was just that the former does random sampling of indices (only), and then uses those sampled indices to generate the result via index lookup (thus requiring a second loop), whereas I'm combining this into one step.

For example, R's version doesn't actually perform sampling on different SEXPs directly, it just uses the result of sample.int (which is calling the C function do_sample):

base::sample
# function (x, size, replace = FALSE, prob = NULL) 
# ...
#     else {
#         if (missing(size)) 
#             size <- length(x)
#         x[sample.int(length(x), size, replace, prob)]
#         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# ... 

Similarly, this approach is used in RcppArmadillo:

// ...
// copy the results into the return vector
for (ii=0; ii<size; ii++) {
    jj = index(ii);  // arma 
    
    ret[ii] = x[jj]; // templated
}
// ...

On the other hand, for each of the index-generating subroutines in src/main/random.c / RcppArmadilloExtensions/sample.h, I have an additional version that performs element lookup in the same loop where the sample index is originally determined. E.g. for the EmpiricalSample routine:

// ...
// `ref` is a Vector<RTYPE> 
// `ians` is a Vector<RTYPE>::iterator 
for ( ; ians != eans; ++ians) {
    int j = static_cast<int>(n * unif_rand());
    *ians = ref[x[j]];
//  ^^^^^^^^^^^^^^^^^^     
    x[j] = x[--n];
}
// ...

This of course required code duplication, but it seemed wasteful to calculate the indices and then make a separate pass to actually look up the result values.


Regardless of the approach taken, I think the ultimate goal should be for the result of [C++ namespace]::sample to match the result of base::sample, and based on the testing that I have done, the Rcpp::sample that I'm proposing does that. For the most part RcppArmadillo::sample also yields the same result as base::sample, but I've come across a few cases where it does not:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
#include <RcppArmadilloExtensions/sample.h>

using namespace Rcpp;

// [[Rcpp::export]]
CharacterVector arma_sample(CharacterVector x, int sz, bool replace, NumericVector p) {
    return RcppArmadillo::sample(x, sz, replace, p);
}

// [[Rcpp::export]]
CharacterVector rcpp_sample(CharacterVector x, int sz, bool replace, NumericVector p) {
    return sample(x, sz, replace, p);
}
# small sample, no replacement
set.seed(123); s1 <- arma_sample(letters, 10, FALSE, rep(1, 26))
set.seed(123); s2 <- rcpp_sample(letters, 10, FALSE, rep(1, 26))
set.seed(123); s3 <- sample(letters, 10, FALSE, rep(1, 26))

all.equal(s1, s3)
# [1] "10 string mismatches"

all.equal(s2, s3)
# [1] TRUE


# small sample, with replacement
set.seed(123); s1 <- arma_sample(letters, 10, TRUE, rep(1, 26))
set.seed(123); s2 <- rcpp_sample(letters, 10, TRUE, rep(1, 26))
set.seed(123); s3 <- sample(letters, 10, TRUE, rep(1, 26))

all.equal(s1, s3)
# [1] "10 string mismatches"

all.equal(s2, s3)
# [1] TRUE


# should use Walker's Alias method
x <- rep(letters, length.out = 1e4)
px <- rep(1.0, length(x))
set.seed(123); s1 <- arma_sample(x, 10, TRUE, px)
set.seed(123); s2 <- rcpp_sample(x, 10, TRUE, px)
set.seed(123); s3 <- sample(x, 10, TRUE, px)

all.equal(s1, s3)
# [1] "3 string mismatches"

all.equal(s2, s3)
# [1] TRUE

I haven't investigated the full scope of the problem, but the RcppArmadillo version seems to be having some trouble with (some of) the versions that perform probabilistic sampling.

@eddelbuettel eddelbuettel merged commit 39edf69 into RcppCore:master Dec 11, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@eddelbuettel
Member

@nathan-russell I think I am seeing a regression. Conrad sent me a test release I am currently checking with rev.deps -- using Rcpp from master as well. Package markovchain now goes pair-shaped with

ccache g++ -std=c++11 -I/usr/share/R/include -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/tmp/RcppDepends/lib/RcppParallel/include" -I"/usr/local/lib/R/site-library/RcppArmadillo/include"   -fpic  -g -O3 -Wall -pipe -Wno-unused -pedantic -c 1_functions4Fitting.cpp -o 1_functions4Fitting.o
In file included from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/functions.h:89:0,
                 from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/sugar.h:31,
                 from /usr/local/lib/R/site-library/Rcpp/include/Rcpp.h:73,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadillo.h:34,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadilloExtensions/fixprob.h:25,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadilloExtensions/sample.h:30,
                 from 1_functions4Fitting.cpp:6:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h: In function ‘Rcpp::Vector<13> Rcpp::sugar::WalkerSample(const Rcpp::Vector<14, Rcpp::PreserveStorage>&, int, int, bool)’:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:160:42: error: expected primary-expression before ‘int’
         HL = static_cast<int*>(Calloc(n, int));
                                          ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:160:45: error: ‘Calloc’ was not declared in this scope
         HL = static_cast<int*>(Calloc(n, int));
                                             ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:161:44: error: expected primary-expression before ‘double’
         q = static_cast<double*>(Calloc(n, double));
                                            ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:200:16: error: ‘Free’ was not declared in this scope
         Free(HL);
                ^
In file included from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/functions.h:89:0,
                 from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/sugar.h:31,
                 from /usr/local/lib/R/site-library/Rcpp/include/Rcpp.h:73,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadillo.h:34,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadilloExtensions/fixprob.h:25,
                 from /usr/local/lib/R/site-library/RcppArmadillo/include/RcppArmadilloExtensions/sample.h:30,
                 from 1_functions4Fitting.cpp:6:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h: In function ‘Rcpp::Vector<RTYPE> Rcpp::sugar::WalkerSample(const Rcpp::Vector<14, Rcpp::PreserveStorage>&, int, const Rcpp::Vector<RTYPE>&)’:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:225:42: error: expected primary-expression before ‘int’
         HL = static_cast<int*>(Calloc(n, int));
                                          ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:225:45: error: there are no arguments to ‘Calloc’ that depend on a template parameter, so a declaration of ‘Calloc’ must be available [-fpermissive]
         HL = static_cast<int*>(Calloc(n, int));
                                             ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:225:45: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:226:44: error: expected primary-expression before ‘double’
         q = static_cast<double*>(Calloc(n, double));
                                            ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:226:50: error: there are no arguments to ‘Calloc’ that depend on a template parameter, so a declaration of ‘Calloc’ must be available [-fpermissive]
         q = static_cast<double*>(Calloc(n, double));
                                                  ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:265:16: error: there are no arguments to ‘Free’ that depend on a template parameter, so a declaration of ‘Free’ must be available [-fpermissive]
         Free(HL);
                ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:266:15: error: there are no arguments to ‘Free’ that depend on a template parameter, so a declaration of ‘Free’ must be available [-fpermissive]
         Free(q);
               ^
/usr/lib/R/etc/Makeconf:141: recipe for target '1_functions4Fitting.o' failed
make: *** [1_functions4Fitting.o] Error 1

If you have a moment, can you take a peek?

Tests are running on Ubuntu 16.04 with g++-5.4.0.

@eddelbuettel
Member

If this is one of these "intersections of sample vs sample" that we feared, should we maybe test for both header guards at the top of your implementation and hence not include if RcppArmadillo's -- which is after all opt-in -- got included already?

@nathan-russell
Contributor
nathan-russell commented Dec 12, 2016 edited

@eddelbuettel Yes I think this is the offending file in markovchain. Do you mean doing something like this in the sugar sample.h?

#if !defined(Rcpp__sugar__sample_h) && !defined(COMPILING_RCPPARMADILLO)
#define Rcpp__sugar__sample_h
// rest of sugar/functions/sample.h as before

Edit: Scratch that suggestion, I still get the error.

@eddelbuettel
Member
eddelbuettel commented Dec 12, 2016 edited

Yes, the

using namespace Rcpp;
using namespace RcppArmadillo;

is what we were afraid of. Question is how to move forward. Adding a very simple, very brute

#ifndef RCPPARMADILLO__EXTENSIONS__SAMPLE_H

at the top (and a matching #endif at the bottom) of your sample.h helps. It now builds again. We could at a minimum add this now, and then try to talk to the author of markovchains.

Better ideas?

@nathan-russell
Contributor

I think that's a great idea, as it will only skip over sugar/functions/sample.h if the RcppArmadillo extenstions file is requested, whereas checking for COMPILING_RCPPARMADILLO was more of a scorched-earth approach. I'll have a patch submitted in a couple of minutes.

@eddelbuettel
Member

I already tested it. I can commit straight -- less work for you. Let me know.

But if you're in a playful mood we could add a #warning or alike to yell at the user. Thoughts?

@nathan-russell
Contributor

Yes if you're ready to commit it then please do. While I do enjoy yelling at people, I think the #warning may generate a warning itself in -pedantic mode, warning: #warning is a GCC extension [enabled by default] according to this post, so it may not be worth the trouble.

@eddelbuettel
Member

Done in 07e045e.

@nathan-russell
Contributor

Awesome; thanks for catching this!

@eddelbuettel
Member
eddelbuettel commented Dec 12, 2016 edited

Side benefit of the (roughly monthly) RcppArmadillo update.

Sadly, that one threw up its own issue in code by Martin Vincent. But we'll get there.

@nathan-russell
Contributor

Another issue related to sample, or from updates in Armadillo itself?

@eddelbuettel
Member

Martin has code in sglOptim he uses in two other packages; now it is sort not sample but similar. Conrad added a few goodies for sparse matrices etc pp.

I'm sure it's simple too; we'll get it fixed.

@eddelbuettel
Member

@nathan-russell Another one, though I may be reading the tea leaves wrong.

Context: Aforementioned tests of RcppArmadillo (as in master right now); failure with three packages by @vincent-dk which are interdependent. The first is sglOptim which (in GitHub) is currently at 1.3.5. We need that updated (1 of 3) to check his other two lsgl and msgl (2 and 3 of 3).

Failure:

/tmp/RcppDepends/sglOptim/src(master)$ cat ../../sglOptim.Rcheck/00install.out * installing *source* package ‘sglOptim’ ...
** libs
ccache g++ -I/usr/share/R/include -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/tmp/RcppDepends/lib/RcppProgress/include" -I"/usr/local/lib/R/site-library/RcppArmadillo/include" -I"/usr/local/lib/R/site-library/BH/include"  -I'../inst/include' -fopenmp -DNDEBUG -fpic  -g -O3 -Wall -pipe -Wno-unused -pedantic -c only_for_testing.cpp -o only_for_testing.o
In file included from /usr/local/lib/R/site-library/Rcpp/include/Rcpp.h:68:0,
                 from /tmp/RcppDepends/lib/RcppProgress/include/progress.hpp:18,
                 from ../inst/include/sgl.h:24,
                 from only_for_testing.cpp:44:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/Nullable.h: In instantiation of ‘bool Rcpp::Nullable<T>::Rf_isNull() const [with T = Rcpp::Vector<14, Rcpp::PreserveStorage>]’:
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/Nullable.h:117:31:   required from ‘bool Rcpp::Nullable<T>::isNotNull() const [with T = Rcpp::Vector<14, Rcpp::PreserveStorage>]’
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/sugar/functions/sample.h:424:25:   required from here
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/Nullable.h:108:29: error: no matching function for call to ‘Rcpp::Nullable<Rcpp::Vector<14, Rcpp::PreserveStorage> >::Rf_isNull(SEXPREC* const&) const’
             return Rf_isNull(m_sexp);
                             ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/Nullable.h:106:21: note: candidate: bool Rcpp::Nullable<T>::Rf_isNull() const [with T = Rcpp::Vector<14, Rcpp::PreserveStorage>]
         inline bool isNull() const {
                     ^
/usr/local/lib/R/site-library/Rcpp/include/Rcpp/Nullable.h:106:21: note:   candidate expects 0 arguments, 1 provided
/usr/lib/R/etc/Makeconf:141: recipe for target 'only_for_testing.o' failed
make: *** [only_for_testing.o] Error 1
ERROR: compilation failed for package ‘sglOptim’
* removing ‘/tmp/RcppDepends/sglOptim.Rcheck/sglOptim’
/tmp/RcppDepends/sglOptim/src(master)$ 

I think this also points to your sample.h which has

typedef Nullable< Vector<REALSXP> > probs_t;

redefining Nullable which bites us later. I may be wrong here, of course. Other ideas? It is early and I had little coffee yet...

@nathan-russell
Contributor

Oh boy. I actually just found another couple of issues (on Windows only) -- one related to the inclusion of <alloca.h> in sample, and one related to time.h in the Date.h and Datetime.h headers -- both stemming from MinGW. I think I have patches for these -- just running the unit tests as we speak to comfirm -- which I can submit in a minute. Right after that, I will take a closer look at error related to Nullable.

@eddelbuettel
Member

Awesome. I cannot make sense of the Nullable issue. When @dcdillon and I were mucking with this and setting this up over a release or two we never ran into this :-/

MinGW probing much appreciated too.

@nathan-russell
Contributor
nathan-russell commented Dec 13, 2016 edited

Still trying to narrow things down here, but one of the build errors in sglOptim related to Nullable is stemming from its RcppProgress dependency. Specifically, there is a using namespace Rcpp; in the header file (header file!!!) progress.hpp:

// I must keep this because now some depedent packages rely on Rcpp namespace to be available
#include <Rcpp.h>
using namespace Rcpp;

While I'm not precisely sure how this relates to sample.h yet, this is a red flag, as anyone who includes progress.h is effectively pulling in the entire Rcpp namespace -- minus the namespace itself. Coupled with that there is a reference to the expansion of the R macro isNull:

C:/R/R-33~1.1/include/Rinternals.h:1192:18: note: bool Rcpp::Nullable::Rf_isNull() const [with T = Rcpp::Vector<14, Rcpp::PreserveStorage>]
#define isNull Rf_isNull
^
C:/R/R-3.3.1/library/Rcpp/include/Rcpp/Nullable.h:106:21: note: in expansion of macro 'isNull'
inline bool isNull() const {

which is defined in <Rinternals.h>. That file is being included in interrupts.hpp, which is included in the previously mentioned progress.h -- so that ends up in sglOptim as well.

@eddelbuettel
Member

Good work. I noticed that RcppProgress.h was pulled in, but didn't open it. What its author does there is almost criminal. Well, or comical. Depends on your mood.

@eddelbuettel
Member

Now reported here at RcppProgress

@nathan-russell
Contributor

Getting back to how the #include <Rinternals.h> before #include <Rcpp.h> problem relates to sample.h, it looks like that is the only header file using the Nullable class so far, which would explain why this situation is occurring for the first time.

@eddelbuettel
Member

And I guess it is in conjunction with the (forcefully) flattened namespace--else the R symbols would not bite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment