Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sugar median function #424

Closed
nathan-russell opened this issue Jan 18, 2016 · 7 comments
Closed

Sugar median function #424

nathan-russell opened this issue Jan 18, 2016 · 7 comments

Comments

@nathan-russell
Copy link
Contributor

In light of a recent Stack Overflow question, I put together a sugar function corresponding to stats::median - more specifically, median.default. At the moment, I have two implementations worked out: one which has an extra boolean argument for specifying whether or not to remove missing values (like median.default's na.rm option), and one which does not (it just returns NA for input containing one or more missing values). I've posted what are more or less final drafts of these source files here.


With the na.rm option

The main benefit of this version is of course having the option to remove missing values - thus more closely conforming to median.default. As a result, however, the interface function is a little ugly:

template <int RTYPE, bool NA, typename T>
inline typename sugar::median_detail::result<RTYPE>::type
median(const Rcpp::VectorBase<RTYPE, NA, T>& x, bool na_rm = false) {
    switch (static_cast<int>(na_rm)) {
        case 0:
            return sugar::Median<RTYPE, NA, T, false>(x);
        case 1:
            return sugar::Median<RTYPE, NA, T, true>(x);
        default:
            return Rcpp::traits::get_na<RTYPE>();
    }
}

The na.rm is encoded as a template parameter, and after playing around with some TMP for a while and failing to find a more elegant solution, I settled on the switch statement logic above. I'm actually not sure that this is even avoidable, but am certainly open to suggestions if there are better alternatives.


Without the na.rm option

While this version obviously lacks some of the above functionality, it is more consistent with similar sugar functions such as mean and max which do not provide an na.rm option. Accordingly, the interface is more sugar-esque:

template <int RTYPE, bool NA, typename T>
inline sugar::Median<RTYPE, NA, T>
median(const Rcpp::VectorBase<RTYPE, NA, T>& x) {
    return sugar::Median<RTYPE, NA, T>(x);
}

General remarks

In the testing I've done so far, both versions seem to mirror median.default pretty closely. These are some slight differences that I've come across:

  • Attributes are not preserved on Date and POSIXt objects; this is also true of the Rcpp::mean, however.
  • For even-length character vectors, R returns NA and issues a warning, whereas my versions simply return NA. I'm not sure if median.default was actually intended to work with character vectors; I think it may be the case that odd-length character vectors just happen to work with the other generic functions used inside of median.default, because the warning that is issued comes from mean.default, not median.default. I still question whether there are legitimate use cases for calling median on a character vector, but regardless, I can add a warning message is desired.
  • median.default has a very, very strange way of handling logical vectors, sometimes returning a logical, sometimes an integer, and sometimes a floating point number. I don't think it's possible to replicate this from C++, nor do I think it's desirable:
median(c(FALSE, FALSE, TRUE, TRUE))
#[1] 0.5

median(c(FALSE, FALSE, TRUE))
#[1] FALSE

median(c(FALSE, TRUE, TRUE))
#[1] TRUE

median(c(FALSE, TRUE, TRUE, TRUE))
#[1] 1

median(c(FALSE, FALSE, FALSE, TRUE))
#[1] 0

This may be the quirkiest aspect of R I've come across to date.

The only other thing I want to address is the handling of (odd-length) character vectors, which is actually what caused me to bump into the issue with std::sort in #419. The INTSXP, REALSXP, and CPLXSXP versions of median utilize std::nth_element and std::max_element for a more efficient calculation. Since nth_element won't work with the string_proxy class, I just fell back to the Vector::sort member function for a less efficient, but safer, approach to getting the median value. But again, I can't imagine there are many warranted uses of median for character vectors, so hopefully no one will be too put off by the less efficient full sort used in the STRSXP specializations.

Any thoughts or concerns? (Apologies for the essay.)

@eddelbuettel
Copy link
Member

I like median() a lot. There was also an older discussion on rolling median on SO that I was part of.

I do recall from those discussions, and from having looked at the stats::median code that there was quite a bit of detail to cover. Or am I now confusing median with running median? In any event, it would be nice to have something in sugar, and it would be equally nice it was rigorously tested against what base R has. As we have good reason to believe that is once again a reference implementation.

Oh, and what you have above for logicals may be bug-report worthy.

@nathan-russell
Copy link
Contributor Author

Thanks for the feedback Dirk. Srunmed looks quite a bit more involved, but could probably be cut much more cleanly in C++ thanks to the STL. At any rate, I can start putting together some more formal unit tests and a PR if desired. Which version of median would you prefer to go with?

And I will look into submitting something on the R Bugs tracker. I can't imagine that behavior is intended...

@eddelbuettel
Copy link
Member

I think static (ie frozen data) median is probably much easier -- let's start there. I was obsessing over running medians, and those are harder as need to maintain data structures to easily add and drop. No good solution there yet...

@nathan-russell
Copy link
Contributor Author

That sounds good - but I was more referring to the two ordinary median implementations I linked to here. Would you rather go with the version that has the option of dropping NAs, e.g.

template <int RTYPE, bool NA, typename T>
inline typename sugar::median_detail::result<RTYPE>::type
median(const Rcpp::VectorBase<RTYPE, NA, T>& x, bool na_rm = false) {
    switch (static_cast<int>(na_rm)) {
        case 0:
            return sugar::Median<RTYPE, NA, T, false>(x);
        case 1:
            return sugar::Median<RTYPE, NA, T, true>(x);
        default:
            return Rcpp::traits::get_na<RTYPE>();
    }
}

or the cleaner version which just accepts a single vector argument:

template <int RTYPE, bool NA, typename T>
inline sugar::Median<RTYPE, NA, T>
median(const Rcpp::VectorBase<RTYPE, NA, T>& x) {
    return sugar::Median<RTYPE, NA, T>(x);
}

@eddelbuettel
Copy link
Member

I know, sorry. I don't find the first one that distasteful. And coping with NA is useful.

Either one is fine.

@nathan-russell
Copy link
Contributor Author

Okay that works for me.

@eddelbuettel
Copy link
Member

This can be closed now. Thanks again for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants