Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorting in CharacterVector is not equal to sorting in R #251

Closed
emiliotorres opened this issue Feb 4, 2015 · 4 comments
Closed

sorting in CharacterVector is not equal to sorting in R #251

emiliotorres opened this issue Feb 4, 2015 · 4 comments
Labels

Comments

@emiliotorres
Copy link

@emiliotorres emiliotorres commented Feb 4, 2015

Lexicographic order in CharacterVector differs from the result obtained in R. See example.
Is this result the expected behaviour?
Thank you!

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector sortcpp(CharacterVector x) {
  x.sort();
  return x;
  }

/*** R
     x <-c("B", "b", "c","A","a")
     sort(x)    ## "a" "A" "b" "B" "c"
     sortcpp(x) ## "A" "B" "a" "b" "c"
*/
@emiliotorres
Copy link
Author

@emiliotorres emiliotorres commented Feb 5, 2015

The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp.

In Rcpp/inst/include/Rcpp/internal/NAComparator.h:

inline int StrCmp(SEXP x, SEXP y) {
if (x == NA_STRING) return (y == NA_STRING ? 0 : 1);
if (y == NA_STRING) return -1;
if (x == y) return 0; // same string in cache
return strcmp(char_nocheck(x), char_nocheck(y));
}

In r-source/src/main/sort.c

static int scmp(SEXP x, SEXP y, Rboolean nalast)
{
if (x == NA_STRING && y == NA_STRING) return 0;
if (x == NA_STRING) return nalast ? 1 : -1;
if (y == NA_STRING) return nalast ? -1 : 1;
if (x == y) return 0; /* same string in cache */
return Scollate(x, y);
}

In r-source/src/main/util.c

int Scollate(SEXP a, SEXP b)
{
if (!collationLocaleSet) {
collationLocaleSet = 1;
#ifndef Win32
if (strcmp("C", getLocale()) ) {
#else
const char *p = getenv("R_ICU_LOCALE");
...
@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Feb 6, 2015

This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

@coatless
Copy link
Contributor

@coatless coatless commented Mar 27, 2017

Last entry before I send a PR later tonight.

Sixth entry in Section: Known Issues

Title: Lexicographic Order of String Sorting Differs Due to Capitalization

Text:

Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) {
  X.sort();
  return X;
}

/*** R
x <- c("B", "b", "c", "A", "a")

# Using R's sort
sort(x)    
## "a" "A" "b" "B" "c"

# Using Rcpp's sort
sortcpp(x) 
## "A" "B" "a" "b" "c"
*/
eddelbuettel added a commit that referenced this issue Mar 31, 2017
Adds Known Issues section to Rcpp FAQ (closes #628, #563, #552, #460, #419, and #251)
@coatless
Copy link
Contributor

@coatless coatless commented Mar 31, 2017

@eddelbuettel: This issue can now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.