-
-
Notifications
You must be signed in to change notification settings - Fork 219
Description
In light of a recent Stack Overflow question, I put together a sugar function corresponding to stats::median
- more specifically, median.default
. At the moment, I have two implementations worked out: one which has an extra boolean argument for specifying whether or not to remove missing values (like median.default
's na.rm
option), and one which does not (it just returns NA
for input containing one or more missing values). I've posted what are more or less final drafts of these source files here.
With the na.rm option
The main benefit of this version is of course having the option to remove missing values - thus more closely conforming to median.default
. As a result, however, the interface function is a little ugly:
template <int RTYPE, bool NA, typename T>
inline typename sugar::median_detail::result<RTYPE>::type
median(const Rcpp::VectorBase<RTYPE, NA, T>& x, bool na_rm = false) {
switch (static_cast<int>(na_rm)) {
case 0:
return sugar::Median<RTYPE, NA, T, false>(x);
case 1:
return sugar::Median<RTYPE, NA, T, true>(x);
default:
return Rcpp::traits::get_na<RTYPE>();
}
}
The na.rm
is encoded as a template parameter, and after playing around with some TMP for a while and failing to find a more elegant solution, I settled on the switch
statement logic above. I'm actually not sure that this is even avoidable, but am certainly open to suggestions if there are better alternatives.
Without the na.rm option
While this version obviously lacks some of the above functionality, it is more consistent with similar sugar functions such as mean
and max
which do not provide an na.rm
option. Accordingly, the interface is more sugar-esque:
template <int RTYPE, bool NA, typename T>
inline sugar::Median<RTYPE, NA, T>
median(const Rcpp::VectorBase<RTYPE, NA, T>& x) {
return sugar::Median<RTYPE, NA, T>(x);
}
General remarks
In the testing I've done so far, both versions seem to mirror median.default
pretty closely. These are some slight differences that I've come across:
- Attributes are not preserved on
Date
andPOSIXt
objects; this is also true of theRcpp::mean
, however. - For even-length character vectors, R returns
NA
and issues a warning, whereas my versions simply returnNA
. I'm not sure ifmedian.default
was actually intended to work with character vectors; I think it may be the case that odd-length character vectors just happen to work with the other generic functions used inside ofmedian.default
, because the warning that is issued comes frommean.default
, notmedian.default
. I still question whether there are legitimate use cases for callingmedian
on a character vector, but regardless, I can add a warning message is desired. median.default
has a very, very strange way of handling logical vectors, sometimes returning a logical, sometimes an integer, and sometimes a floating point number. I don't think it's possible to replicate this from C++, nor do I think it's desirable:
median(c(FALSE, FALSE, TRUE, TRUE))
#[1] 0.5
median(c(FALSE, FALSE, TRUE))
#[1] FALSE
median(c(FALSE, TRUE, TRUE))
#[1] TRUE
median(c(FALSE, TRUE, TRUE, TRUE))
#[1] 1
median(c(FALSE, FALSE, FALSE, TRUE))
#[1] 0
This may be the quirkiest aspect of R I've come across to date.
The only other thing I want to address is the handling of (odd-length) character vectors, which is actually what caused me to bump into the issue with std::sort
in #419. The INTSXP
, REALSXP
, and CPLXSXP
versions of median
utilize std::nth_element
and std::max_element
for a more efficient calculation. Since nth_element
won't work with the string_proxy
class, I just fell back to the Vector::sort
member function for a less efficient, but safer, approach to getting the median value. But again, I can't imagine there are many warranted uses of median
for character vectors, so hopefully no one will be too put off by the less efficient full sort used in the STRSXP
specializations.
Any thoughts or concerns? (Apologies for the essay.)