# sorting in CharacterVector is not equal to sorting in R #251

opened this Issue Feb 4, 2015 · 4 comments

### emiliotorres commented Feb 4, 2015

 Lexicographic order in CharacterVector differs from the result obtained in R. See example. Is this result the expected behaviour? Thank you! #include using namespace Rcpp ; // [[Rcpp::export]] CharacterVector sortcpp(CharacterVector x) { x.sort(); return x; } /*** R x <-c("B", "b", "c","A","a") sort(x) ## "a" "A" "b" "B" "c" sortcpp(x) ## "A" "B" "a" "b" "c" */ 

### emiliotorres commented Feb 5, 2015

 The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp. In Rcpp/inst/include/Rcpp/internal/NAComparator.h: inline int StrCmp(SEXP x, SEXP y) { if (x == NA_STRING) return (y == NA_STRING ? 0 : 1); if (y == NA_STRING) return -1; if (x == y) return 0; // same string in cache return strcmp(char_nocheck(x), char_nocheck(y)); }  In r-source/src/main/sort.c static int scmp(SEXP x, SEXP y, Rboolean nalast) { if (x == NA_STRING && y == NA_STRING) return 0; if (x == NA_STRING) return nalast ? 1 : -1; if (y == NA_STRING) return nalast ? -1 : 1; if (x == y) return 0; /* same string in cache */ return Scollate(x, y); }  In r-source/src/main/util.c int Scollate(SEXP a, SEXP b) { if (!collationLocaleSet) { collationLocaleSet = 1; #ifndef Win32 if (strcmp("C", getLocale()) ) { #else const char *p = getenv("R_ICU_LOCALE"); ... 
### kevinushey commented Feb 6, 2015

 This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

### coatless commented Mar 27, 2017 • edited Edited 1 time coatless edited Mar 27, 2017 (most recent)

 Last entry before I send a PR later tonight. Sixth entry in Section: Known Issues Title: Lexicographic Order of String Sorting Differs Due to Capitalization Text: Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example: #include // [[Rcpp::export]] Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) { X.sort(); return X; } /*** R x <- c("B", "b", "c", "A", "a") # Using R's sort sort(x) ## "a" "A" "b" "B" "c" # Using Rcpp's sort sortcpp(x) ## "A" "B" "a" "b" "c" */

### eddelbuettel added a commit that referenced this issue Mar 31, 2017

 Merge pull request #661 from coatless/feature/faq-init-known-issues 
Adds Known Issues section to Rcpp FAQ (closes #628, #563, #552, #460, #419, and #251)
 886f5df 
### coatless commented Mar 31, 2017

 @eddelbuettel: This issue can now be closed.

