Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rcpp::String silently drops embedded NUL bytes #916

Closed
kevinushey opened this issue Oct 25, 2018 · 1 comment
Closed

Rcpp::String silently drops embedded NUL bytes #916

kevinushey opened this issue Oct 25, 2018 · 1 comment

Comments

@kevinushey
Copy link
Contributor

For example:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
SEXP embeddedNul() {
  std::string hasNullByte("abc\0abc", 7);
  String converted(hasNullByte);
  return wrap(converted);
}

/*** R
embeddedNul()
*/

Sourcing this gives me:

> embeddedNul()
[1] "abc"

I would hope for either:

  1. An error, since we shouldn't silently lose data following the NUL byte; or
  2. A conversion to e.g. a raw character vector, preserving the embedded NUL.

It's worth stating that attempting to do this on the R side would give an error, e.g.

data <- as.raw(c(97, 98, 99, 0, 97, 98, 99))
rawToChar(data)

giving

> data <- as.raw(c(97, 98, 99, 0, 97, 98, 99))
> data
[1] 61 62 63 00 61 62 63
> rawToChar(data)
Error in rawToChar(data) : embedded nul in string: 'abc\0abc'
@kevinushey
Copy link
Contributor Author

I believe this occurs because we use Rf_mkCharCE() throughout the String classes implementation:

$ ag -Q "mkCharCE"
inst/include/Rcpp/String.h
55:        String(): data(Rf_mkCharCE("", CE_UTF8)), buffer(), valid(true), buffer_ready(true), enc(CE_UTF8) {
369:            return valid ? data : Rf_mkCharCE(buffer.c_str(), enc);
398:                data = Rcpp_ReplaceObject(data, Rf_mkCharCE(Rf_translateCharUTF8(data), encoding));
400:                data = Rf_mkCharCE(buffer.c_str(), encoding);
472:                data = Rf_mkCharCE(buffer.c_str(), enc);

But since buffer itself is just a std::string, we should pass both the c_str() and the length() on and use Rf_mkCharLenCE(). I'll try making a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant