New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorting in CharacterVector is not equal to sorting in R #251

Closed
emiliotorres opened this Issue Feb 4, 2015 · 4 comments

Comments

Projects
None yet
5 participants
@emiliotorres

Lexicographic order in CharacterVector differs from the result obtained in R. See example.
Is this result the expected behaviour?
Thank you!

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector sortcpp(CharacterVector x) {
  x.sort();
  return x;
  }

/*** R
     x <-c("B", "b", "c","A","a")
     sort(x)    ## "a" "A" "b" "B" "c"
     sortcpp(x) ## "A" "B" "a" "b" "c"
*/
@emiliotorres

This comment has been minimized.

Show comment
Hide comment
@emiliotorres

emiliotorres Feb 5, 2015

The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp.

In Rcpp/inst/include/Rcpp/internal/NAComparator.h:

inline int StrCmp(SEXP x, SEXP y) {
if (x == NA_STRING) return (y == NA_STRING ? 0 : 1);
if (y == NA_STRING) return -1;
if (x == y) return 0; // same string in cache
return strcmp(char_nocheck(x), char_nocheck(y));
}

In r-source/src/main/sort.c

static int scmp(SEXP x, SEXP y, Rboolean nalast)
{
if (x == NA_STRING && y == NA_STRING) return 0;
if (x == NA_STRING) return nalast ? 1 : -1;
if (y == NA_STRING) return nalast ? -1 : 1;
if (x == y) return 0; /* same string in cache */
return Scollate(x, y);
}

In r-source/src/main/util.c

int Scollate(SEXP a, SEXP b)
{
if (!collationLocaleSet) {
collationLocaleSet = 1;
#ifndef Win32
if (strcmp("C", getLocale()) ) {
#else
const char *p = getenv("R_ICU_LOCALE");
...

The character comparison in Rcpp is made by the strcmp C function. R uses a Scollate function that depends on locale. I am not sure if it is possible to redefine the StrCmp function using Scollate instead of strcmp.

In Rcpp/inst/include/Rcpp/internal/NAComparator.h:

inline int StrCmp(SEXP x, SEXP y) {
if (x == NA_STRING) return (y == NA_STRING ? 0 : 1);
if (y == NA_STRING) return -1;
if (x == y) return 0; // same string in cache
return strcmp(char_nocheck(x), char_nocheck(y));
}

In r-source/src/main/sort.c

static int scmp(SEXP x, SEXP y, Rboolean nalast)
{
if (x == NA_STRING && y == NA_STRING) return 0;
if (x == NA_STRING) return nalast ? 1 : -1;
if (y == NA_STRING) return nalast ? -1 : 1;
if (x == y) return 0; /* same string in cache */
return Scollate(x, y);
}

In r-source/src/main/util.c

int Scollate(SEXP a, SEXP b)
{
if (!collationLocaleSet) {
collationLocaleSet = 1;
#ifndef Win32
if (strcmp("C", getLocale()) ) {
#else
const char *p = getenv("R_ICU_LOCALE");
...
@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Feb 6, 2015

Contributor

This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

Contributor

kevinushey commented Feb 6, 2015

This one is quite thorny, as R does not export Scollate. I think the best we can do is try to mimic what Scollate does with our own copy, but that is ugly and error prone especially if Scollate were to change.

@coatless

This comment has been minimized.

Show comment
Hide comment
@coatless

coatless Mar 27, 2017

Contributor

Last entry before I send a PR later tonight.

Sixth entry in Section: Known Issues

Title: Lexicographic Order of String Sorting Differs Due to Capitalization

Text:

Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) {
  X.sort();
  return X;
}

/*** R
x <- c("B", "b", "c", "A", "a")

# Using R's sort
sort(x)    
## "a" "A" "b" "B" "c"

# Using Rcpp's sort
sortcpp(x) 
## "A" "B" "a" "b" "c"
*/
Contributor

coatless commented Mar 27, 2017

Last entry before I send a PR later tonight.

Sixth entry in Section: Known Issues

Title: Lexicographic Order of String Sorting Differs Due to Capitalization

Text:

Comparing strings within \R hinges on the ability to process the locale or native-language environment of the string. In \R, there is a function called \code{Scollate} that performs the comparison on locale. Unfortunately, this function has not been made publicly available and, thus, Rcpp does \textit{not} have access to it within its implementation of \code{StrCmp}. As a result, strings that are sorted under the \code{.sort()} member function are ordered improperly. Specifically, if capitalization is present, then capitalized words are sorted together followed by the sorting of lowercase words instead of a mixture of capitalized and lowercase words. The issue is illustrated by the following code example:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterVector sortcpp(Rcpp::CharacterVector X) {
  X.sort();
  return X;
}

/*** R
x <- c("B", "b", "c", "A", "a")

# Using R's sort
sort(x)    
## "a" "A" "b" "B" "c"

# Using Rcpp's sort
sortcpp(x) 
## "A" "B" "a" "b" "c"
*/
@coatless

This comment has been minimized.

Show comment
Hide comment
@coatless

coatless Mar 31, 2017

Contributor

@eddelbuettel: This issue can now be closed.

Contributor

coatless commented Mar 31, 2017

@eddelbuettel: This issue can now be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment