Skip to content

Sugar median function #424

@nathan-russell

Description

@nathan-russell

In light of a recent Stack Overflow question, I put together a sugar function corresponding to stats::median - more specifically, median.default. At the moment, I have two implementations worked out: one which has an extra boolean argument for specifying whether or not to remove missing values (like median.default's na.rm option), and one which does not (it just returns NA for input containing one or more missing values). I've posted what are more or less final drafts of these source files here.


With the na.rm option

The main benefit of this version is of course having the option to remove missing values - thus more closely conforming to median.default. As a result, however, the interface function is a little ugly:

template <int RTYPE, bool NA, typename T>
inline typename sugar::median_detail::result<RTYPE>::type
median(const Rcpp::VectorBase<RTYPE, NA, T>& x, bool na_rm = false) {
    switch (static_cast<int>(na_rm)) {
        case 0:
            return sugar::Median<RTYPE, NA, T, false>(x);
        case 1:
            return sugar::Median<RTYPE, NA, T, true>(x);
        default:
            return Rcpp::traits::get_na<RTYPE>();
    }
}

The na.rm is encoded as a template parameter, and after playing around with some TMP for a while and failing to find a more elegant solution, I settled on the switch statement logic above. I'm actually not sure that this is even avoidable, but am certainly open to suggestions if there are better alternatives.


Without the na.rm option

While this version obviously lacks some of the above functionality, it is more consistent with similar sugar functions such as mean and max which do not provide an na.rm option. Accordingly, the interface is more sugar-esque:

template <int RTYPE, bool NA, typename T>
inline sugar::Median<RTYPE, NA, T>
median(const Rcpp::VectorBase<RTYPE, NA, T>& x) {
    return sugar::Median<RTYPE, NA, T>(x);
}

General remarks

In the testing I've done so far, both versions seem to mirror median.default pretty closely. These are some slight differences that I've come across:

  • Attributes are not preserved on Date and POSIXt objects; this is also true of the Rcpp::mean, however.
  • For even-length character vectors, R returns NA and issues a warning, whereas my versions simply return NA. I'm not sure if median.default was actually intended to work with character vectors; I think it may be the case that odd-length character vectors just happen to work with the other generic functions used inside of median.default, because the warning that is issued comes from mean.default, not median.default. I still question whether there are legitimate use cases for calling median on a character vector, but regardless, I can add a warning message is desired.
  • median.default has a very, very strange way of handling logical vectors, sometimes returning a logical, sometimes an integer, and sometimes a floating point number. I don't think it's possible to replicate this from C++, nor do I think it's desirable:
median(c(FALSE, FALSE, TRUE, TRUE))
#[1] 0.5

median(c(FALSE, FALSE, TRUE))
#[1] FALSE

median(c(FALSE, TRUE, TRUE))
#[1] TRUE

median(c(FALSE, TRUE, TRUE, TRUE))
#[1] 1

median(c(FALSE, FALSE, FALSE, TRUE))
#[1] 0

This may be the quirkiest aspect of R I've come across to date.

The only other thing I want to address is the handling of (odd-length) character vectors, which is actually what caused me to bump into the issue with std::sort in #419. The INTSXP, REALSXP, and CPLXSXP versions of median utilize std::nth_element and std::max_element for a more efficient calculation. Since nth_element won't work with the string_proxy class, I just fell back to the Vector::sort member function for a less efficient, but safer, approach to getting the median value. But again, I can't imagine there are many warranted uses of median for character vectors, so hopefully no one will be too put off by the less efficient full sort used in the STRSXP specializations.

Any thoughts or concerns? (Apologies for the essay.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions