# sorting in CharacterVector is not equal to sorting in R #251

Closed
opened this Issue Feb 4, 2015 · 4 comments

Projects
None yet
5 participants

### emiliotorres commented Feb 4, 2015

 Lexicographic order in CharacterVector differs from the result obtained in R. See example. Is this result the expected behaviour? Thank you! #include using namespace Rcpp ; // [[Rcpp::export]] CharacterVector sortcpp(CharacterVector x) { x.sort(); return x; } /*** R x <-c("B", "b", "c","A","a") sort(x) ## "a" "A" "b" "B" "c" sortcpp(x) ## "A" "B" "a" "b" "c" */ 

### emiliotorres commented Feb 5, 2015

 The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp. In Rcpp/inst/include/Rcpp/internal/NAComparator.h: inline int StrCmp(SEXP x, SEXP y) { if (x == NA_STRING) return (y == NA_STRING ? 0 : 1); if (y == NA_STRING) return -1; if (x == y) return 0; // same string in cache return strcmp(char_nocheck(x), char_nocheck(y)); }  In r-source/src/main/sort.c static int scmp(SEXP x, SEXP y, Rboolean nalast) { if (x == NA_STRING && y == NA_STRING) return 0; if (x == NA_STRING) return nalast ? 1 : -1; if (y == NA_STRING) return nalast ? -1 : 1; if (x == y) return 0; /* same string in cache */ return Scollate(x, y); }  In r-source/src/main/util.c int Scollate(SEXP a, SEXP b) { if (!collationLocaleSet) { collationLocaleSet = 1; #ifndef Win32 if (strcmp("C", getLocale()) ) { #else const char *p = getenv("R_ICU_LOCALE"); ... 
Contributor

### kevinushey commented Feb 6, 2015

 This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

Closed

Merged

Closed

Contributor

### coatless commented Mar 27, 2017 • edited Edited 1 time coatless edited Mar 27, 2017 (most recent)

 Last entry before I send a PR later tonight. Sixth entry in Section: Known Issues Title: Lexicographic Order of String Sorting Differs Due to Capitalization Text: Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example: #include // [[Rcpp::export]] Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) { X.sort(); return X; } /*** R x <- c("B", "b", "c", "A", "a") # Using R's sort sort(x) ## "a" "A" "b" "B" "c" # Using Rcpp's sort sortcpp(x) ## "A" "B" "a" "b" "c" */

Merged

### eddelbuettel added a commit that referenced this issue Mar 31, 2017

 Merge pull request #661 from coatless/feature/faq-init-known-issues 
Adds Known Issues section to Rcpp FAQ (closes #628, #563, #552, #460, #419, and #251)
 886f5df 
Contributor

### coatless commented Mar 31, 2017

 @eddelbuettel: This issue can now be closed.

Closed