Skip to content

Strange behavior with CharacterVector and std::sort #419

@nathan-russell

Description

@nathan-russell

I believe this is distinct from the collation issue described in #251. Calling std::sort on an Rcpp::CharacterVector produces very unexpected results on my machine (Ubuntu 14.04):

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::CharacterVector RcppSort(Rcpp::CharacterVector x) {
  Rcpp::CharacterVector y = Rcpp::clone(x);
  y.sort();
  return y;
}

// [[Rcpp::export]]
Rcpp::CharacterVector StdSort(Rcpp::CharacterVector x) {
  Rcpp::CharacterVector y = Rcpp::clone(x); 
  std::sort(y.begin(), y.end());
  return y;
}

// [[Rcpp::export]]
std::vector<std::string> StdSort2(Rcpp::CharacterVector x) {
  std::vector<std::string> y = Rcpp::as<std::vector<std::string> >(x);
  std::sort(y.begin(), y.end());
  return y;
}

/*** R

set.seed(123)
(xx <- sample(c(LETTERS[1:5], letters[1:6]), 11))
#[1] "D" "c" "f" "e" "b" "A" "C" "d" "B" "a" "E"

RcppSort(xx)
#[1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f"

StdSort(xx)
#[1] "f" "f" "f" "f" "f" "f" "D" "c" "f" "f" "f"
##    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

StdSort2(xx)
#[1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f"

*/

I'm consistently getting the same strange output from StdSort(xx) whether compiled with clang (5.3) or gcc (4.9.3). Presumably this is the comparator being used in StdSort

bool operator<(const Rcpp::String& other) const {
  return strcmp(get_cstring(), other.get_cstring()) < 0;
}

which does not seem to be doing anything unusual. Unfortunately I'm not terribly familiar with the internals of Rcpp::String / Rcpp::string_proxy<>, so I really can't imagine what could be causing this behavior, but it looked like something worth pointing out.

My session info:

#R version 3.2.3 (2015-12-10)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Ubuntu 14.04.3 LTS
#
#locale:
#[1]  LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
#[4]  LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#[7]  LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
#[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#loaded via a namespace (and not attached):
#[1] tools_3.2.3 Rcpp_0.12.3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions