# RcppCore/Rcpp

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# Strange behavior with CharacterVector and std::sort #419

Closed
opened this Issue Jan 15, 2016 · 21 comments

## Comments

Projects
None yet
5 participants
Contributor

### nathan-russell commented Jan 15, 2016

 I believe this is distinct from the collation issue described in #251. Calling std::sort on an Rcpp::CharacterVector produces very unexpected results on my machine (Ubuntu 14.04): #include // [[Rcpp::export]] Rcpp::CharacterVector RcppSort(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); y.sort(); return y; } // [[Rcpp::export]] Rcpp::CharacterVector StdSort(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); std::sort(y.begin(), y.end()); return y; } // [[Rcpp::export]] std::vector StdSort2(Rcpp::CharacterVector x) { std::vector y = Rcpp::as >(x); std::sort(y.begin(), y.end()); return y; } /*** R set.seed(123) (xx <- sample(c(LETTERS[1:5], letters[1:6]), 11)) #[1] "D" "c" "f" "e" "b" "A" "C" "d" "B" "a" "E" RcppSort(xx) #[1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f" StdSort(xx) #[1] "f" "f" "f" "f" "f" "f" "D" "c" "f" "f" "f" ## ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ StdSort2(xx) #[1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f" */ I'm consistently getting the same strange output from StdSort(xx) whether compiled with clang (5.3) or gcc (4.9.3). Presumably this is the comparator being used in StdSort bool operator<(const Rcpp::String& other) const { return strcmp(get_cstring(), other.get_cstring()) < 0; } which does not seem to be doing anything unusual. Unfortunately I'm not terribly familiar with the internals of Rcpp::String / Rcpp::string_proxy<>, so I really can't imagine what could be causing this behavior, but it looked like something worth pointing out. My session info: #R version 3.2.3 (2015-12-10) #Platform: x86_64-pc-linux-gnu (64-bit) #Running under: Ubuntu 14.04.3 LTS # #locale: #[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 #[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C #[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C # #attached base packages: #[1] stats graphics grDevices utils datasets methods base # #loaded via a namespace (and not attached): #[1] tools_3.2.3 Rcpp_0.12.3
Contributor

### kevinushey commented Jan 15, 2016

 FWIW, this works correctly for me on OS X. Perhaps we're bumping into a weird libc++ vs. libstdc++ issue?
Member

### eddelbuettel commented Jan 15, 2016

 It's borked on Ubuntu 15.04 (my server) with g++-4.9.
Contributor

### nathan-russell commented Jan 15, 2016

 Hmm it looks like OS X is the odd man out (or rather, odd man in); I get the same gibberish output ([1] "f" "f" "f" "f" "f" "f" "D" "c" "f" "f" "f") on CentOS 7 (gcc and clang), as well as Windows 7 (x86_64-w64-mingw32/x64 (64-bit)). Interestingly, std::stable_sort produces a different incorrect output: // [[Rcpp::export]] Rcpp::CharacterVector StableSort(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); std::stable_sort(y.begin(), y.end()); return y; } // "d" "D" "c" "d" "d" "d" "d" "d" "f" "f" "f"
Contributor

### dcdillon commented Jan 15, 2016

 I haven't looked at the code, but I would suspect that this has to do with iterator behavior in Rcpp::CharacterVector not being 100% compliant with stl standards. std::sort and std::stable_sort require 100% stl compliant containers and iterators. Specifically the offset dereference operator comes to mind http://www.cplusplus.com/reference/iterator/
Contributor

### kevinushey commented Jan 15, 2016

 FWIW, it does work correctly with libc++ (tested on Ubuntu): CXX=clang++-3.6 CXXFLAGS=-g -O3 -Wall -march=native -stdlib=libc++ LDFLAGS=-stdlib=libc++ kevin@KBOX:~/git/Rcpp [master] \$ R --vanilla --slave -e 'Rcpp::sourceCpp("ouch.cpp")' > set.seed(123) > (xx <- sample(c(LETTERS[1:5], letters[1:6]), 11)) [1] "D" "c" "f" "e" "b" "A" "C" "d" "B" "a" "E" > RcppSort(xx) [1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f" > StdSort(xx) [1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f" > StdSort2(xx) [1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f"  Given that IIUC libc++ and libstdc++ diverge a bit in their implementations of iterators, I think @dcdillon is probably right.
Contributor

### nathan-russell commented Jan 16, 2016

 Is it possible that libstdc++ isn't using the offset dereference operator in their sort implementations, and libc++ is? Because after trying out several modifications to Proxy_Iterator<>::operator[] and not seeing any change from my above results, I recompiled Rcpp with that operator completely commented out, reran the example, and got the same output. So perhaps this is caused by some other aspect of the string_proxy or Proxy_Iterator classes?
Contributor

### dcdillon commented Jan 16, 2016

 I can show what's happening (I believe) // [[Rcpp::export]] Rcpp::CharacterVector LessTest(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); Rcpp::internal::string_proxy<16> tmp(*y.begin()); *y.begin() = *(y.begin() + 1); *(y.begin() + 1) = tmp; return y; } > LessTest(xx) [1] "c" "c" "f" "e" "b" "A" "C" "d" "B" "a" "E"  I don't know why, but it seems that std::sort in gcc is not using std::swap (or possibly not locating the correct std::swap for Rcpp::string_proxy. Running the same code with std::swap seems to work as evidenced below: // [[Rcpp::export]] Rcpp::CharacterVector LessTest(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); //Rcpp::internal::string_proxy<16> tmp(*y.begin()); //*y.begin() = *(y.begin() + 1); //*(y.begin() + 1) = tmp; std::swap(*y.begin(), *(y.begin() + 1)); return y; } > LessTest(xx) [1] "c" "D" "f" "e" "b" "A" "C" "d" "B" "a" "E"  So the good news is I can explain what is happening. The bad news is I have no idea how to fix it.
Contributor

### dcdillon commented Jan 16, 2016

 And a brief (and probably incomplete) look through the gcc code would indicate that it's probably using std::iter_swap instead of std::swap which doesn't have a proper overload.
Contributor

### dcdillon commented Jan 16, 2016

 OK. After further research the problem is this. Rcpp::string_proxy needs template specializations for every sequence modifying STL "primitive" (e.g. std::swap, std::move, std::copy) and all their variants. The reason is that between different versions of libstdc++ and even differences in range sizes, different optimizations are implemented in terms of the "primitives". Taking for instance the example that we had above. On a sequence of length 11 as in the example, std::sort uses an optimization that uses std::copy_backward and std::copy. If we bump that size to 50, it uses std::iter_swap instead. If we switch to C++11, it uses std::move_backward and std::move. The problem just continues and continues. I am willing to play with this a bit and try to find something that will work, however, without a full blown specialization of every STL sequence modifying "primitive", this will not work across all versions and all compilers. Let me know what you think.
Member

### eddelbuettel commented Jan 16, 2016

 Thanks so much for digging in. And this seems super annoying. Is there another string / container / iterator library we could borrow an idea from?
Contributor

### dcdillon commented Jan 16, 2016

 We could also simply specialize sort for Rcpp::CharacterVector. That's actually probably the easiest.
Contributor

### dcdillon commented Jan 16, 2016

 Actually it might be the only possible way. Another problem is...behind the scenes...the STL implementation assumes that using the copy constructor of a Rcpp::string_proxy creates a copy that is not affected by assigning to the original value. This is not the case given the implementation of Rcpp::string_proxy so even properly specializing ALL the STL stuff won't save it. I give up.
Contributor

### dcdillon commented Jan 17, 2016

 And because I cannot leave well enough alone (from the libstdc++ code). Please note lines 1849 - 1851. This is unsafe with Rcpp::internals::string_proxy regardless of whether we implement all the sequence modifying STL stuff appropriately.  1836 /// This is a helper function for the sort routine. 1837 template 1838 void 1839 __insertion_sort(_RandomAccessIterator __first, 1840 _RandomAccessIterator __last, _Compare __comp) 1841 { 1842 if (__first == __last) return; 1843 1844 for (_RandomAccessIterator __i = __first + 1; __i != __last; ++__i) 1845 { 1846 if (__comp(__i, __first)) 1847 { 1848 typename iterator_traits<_RandomAccessIterator>::value_type 1849 __val = _GLIBCXX_MOVE(*__i); 1850 _GLIBCXX_MOVE_BACKWARD3(__first, __i, __i + 1); 1851 *__first = _GLIBCXX_MOVE(__val); 1852 } 1853 else 1854 std::__unguarded_linear_insert(__i, 1855 __gnu_cxx::__ops::__val_comp_iter(__comp)); 1856 } 1857 }
Contributor

### nathan-russell commented Jan 17, 2016

 @dcdillon Thank you for taking the time to dig into this. To clarify, what is the motivation for having the copy constructor of string_proxy behave as such? Is it somehow necessitated by R's internal use of a global string cache, or is it simply a performance feature?
Contributor

### dcdillon commented Jan 17, 2016

 The formal reason why this fails is that std::sort according to the C++ standard is that Rcpp::internal::string_proxy is NOT MoveAssignable (as defined on page 466 of the standard in Table 22). This is because of the way operator=(const string_proxy &rhs) works. The left hand side of that expression is NOT "equivalent" to the right hand side before the operation. We should mark this as "impossible" for now (without a huge change to string_proxy) and add documentation indicating that using STL algorithms on Rcpp containers is not supported.
Member

### eddelbuettel commented Jan 17, 2016

 I am fine with documenting it. There are some things that remain simply hard because of internal implementation details on either the R (and its C internals) and C++ library side. Question is where to document this (best) and how. Probably couple of places...

Closed

Contributor

### dcdillon commented Mar 21, 2016

 Should we mark this as documentation or close it as currently nothing short of writing a specialization of std::sort can fix it?
Member

### eddelbuettel commented Mar 21, 2016

 May make sense to document, ie in the Rcpp FAQ. Then at least we can lean back and say you had been warned ...
Contributor

### coatless commented Jul 15, 2016

 To Do: Needs documentation flag and add a section under Examples in FAQ.

### coatless referenced this issue Jul 16, 2016

Open

#### Documentation Update / Issue Tracker Cleaning #506

39 of 43 tasks complete

Closed

Contributor

### coatless commented Mar 24, 2017

 Third entry for Section: Known Issues Title: Sorting with STL on a \code{CharacterVector} produces problematic results Text: The Standard Template Library (STL) \code{std::sort} algorithm performs adequately for the majority of \pkg{Rcpp} data types. The notable exception that makes what would otherwise be a universal quantifier into an existential quantifier is the \code{CharacterVector} data type. Chiefly, the issue with sorting strings is related to how the \code{CharacterVector} relies upon the use of \code{Rcpp::internal::string_proxy}. In particular, \code{Rcpp::internal::string_proxy} is \textit{not} MoveAssignable since the left hand side of \code{operator=(const string_proxy &rhs)} is \textit{not} viewed as equivalent to the right hand side before the operation \citep[][p. 466, Table 22]{Cpp11}. This further complicates matters when using \code{CharacterVector} with \code{std::swap}, \code{std::move}, \code{std::copy} and their variants. To avoid unwarranted pain with sorting, the preferred approach is to use the \code{.sort()} member function of \pkg{Rcpp} objects. The member function correctly applies the sorting procedure to \pkg{Rcpp} objects regardless of type. The following code example illustrates the preferred approach alongside the problematic approach: #include // [[Rcpp::export]] Rcpp::CharacterVector member_sort(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); y.sort(); return y; } // [[Rcpp::export]] Rcpp::CharacterVector stl_sort(Rcpp::CharacterVector x) { Rcpp::CharacterVector y = Rcpp::clone(x); std::sort(y.begin(), y.end()); return y; } /*** R set.seed(123) (X <- sample(c(LETTERS[1:5], letters[1:6]), 11)) #[1] "D" "c" "f" "e" "b" "A" "C" "d" "B" "a" "E" member_sort(X) #[1] "A" "B" "C" "D" "E" "a" "b" "c" "d" "e" "f" stl_sort(X) #[1] "f" "f" "f" "f" "f" "f" "D" "c" "f" "f" "f" */ In closing, the results of using the STL approach do change depending on whether \code{libc++} or \code{libstdc++} standard library is used to compile the code. When debugging, this does make the issue particularly complex to sort out. Principally, compilation with \code{libc++} and STL has been shown to yield the correct results. However, it is not wise to rely upon this library as a majority of code is compiled against \code{libstdc++} as it more complete.

Merged

### eddelbuettel added a commit that referenced this issue Mar 31, 2017

 Merge pull request #661 from coatless/feature/faq-init-known-issues 
Adds Known Issues section to Rcpp FAQ (closes #628, #563, #552, #460, #419, and #251)
 886f5df 
Contributor

### coatless commented Mar 31, 2017

 @eddelbuettel: This issue can now be closed.

### eddelbuettel closed this Mar 31, 2017

to join this conversation on GitHub. Already have an account? Sign in to comment