Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorting in CharacterVector is not equal to sorting in R #251

Closed
emiliotorres opened this issue Feb 4, 2015 · 4 comments
Closed

sorting in CharacterVector is not equal to sorting in R #251

emiliotorres opened this issue Feb 4, 2015 · 4 comments

Comments

@emiliotorres
Copy link

Lexicographic order in CharacterVector differs from the result obtained in R. See example.
Is this result the expected behaviour?
Thank you!

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector sortcpp(CharacterVector x) {
  x.sort();
  return x;
  }

/*** R
     x <-c("B", "b", "c","A","a")
     sort(x)    ## "a" "A" "b" "B" "c"
     sortcpp(x) ## "A" "B" "a" "b" "c"
*/
@emiliotorres
Copy link
Author

The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp.

In Rcpp/inst/include/Rcpp/internal/NAComparator.h:

inline int StrCmp(SEXP x, SEXP y) {
if (x == NA_STRING) return (y == NA_STRING ? 0 : 1);
if (y == NA_STRING) return -1;
if (x == y) return 0; // same string in cache
return strcmp(char_nocheck(x), char_nocheck(y));
}

In r-source/src/main/sort.c

static int scmp(SEXP x, SEXP y, Rboolean nalast)
{
if (x == NA_STRING && y == NA_STRING) return 0;
if (x == NA_STRING) return nalast ? 1 : -1;
if (y == NA_STRING) return nalast ? -1 : 1;
if (x == y) return 0; /* same string in cache */
return Scollate(x, y);
}

In r-source/src/main/util.c

int Scollate(SEXP a, SEXP b)
{
if (!collationLocaleSet) {
collationLocaleSet = 1;
#ifndef Win32
if (strcmp("C", getLocale()) ) {
#else
const char *p = getenv("R_ICU_LOCALE");
...

@kevinushey
Copy link
Contributor

This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

@coatless
Copy link
Contributor

coatless commented Mar 27, 2017

Last entry before I send a PR later tonight.

Sixth entry in Section: Known Issues

Title: Lexicographic Order of String Sorting Differs Due to Capitalization

Text:

Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) {
  X.sort();
  return X;
}

/*** R
x <- c("B", "b", "c", "A", "a")

# Using R's sort
sort(x)    
## "a" "A" "b" "B" "c"

# Using Rcpp's sort
sortcpp(x) 
## "A" "B" "a" "b" "c"
*/

@coatless
Copy link
Contributor

@eddelbuettel: This issue can now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants