New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rcpp::String silently drops embedded NUL bytes #916

Closed
kevinushey opened this Issue Oct 25, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@kevinushey
Contributor

kevinushey commented Oct 25, 2018

For example:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
SEXP embeddedNul() {
  std::string hasNullByte("abc\0abc", 7);
  String converted(hasNullByte);
  return wrap(converted);
}

/*** R
embeddedNul()
*/

Sourcing this gives me:

> embeddedNul()
[1] "abc"

I would hope for either:

  1. An error, since we shouldn't silently lose data following the NUL byte; or
  2. A conversion to e.g. a raw character vector, preserving the embedded NUL.

It's worth stating that attempting to do this on the R side would give an error, e.g.

data <- as.raw(c(97, 98, 99, 0, 97, 98, 99))
rawToChar(data)

giving

> data <- as.raw(c(97, 98, 99, 0, 97, 98, 99))
> data
[1] 61 62 63 00 61 62 63
> rawToChar(data)
Error in rawToChar(data) : embedded nul in string: 'abc\0abc'
@kevinushey

This comment has been minimized.

Contributor

kevinushey commented Oct 25, 2018

I believe this occurs because we use Rf_mkCharCE() throughout the String classes implementation:

$ ag -Q "mkCharCE"
inst/include/Rcpp/String.h
55:        String(): data(Rf_mkCharCE("", CE_UTF8)), buffer(), valid(true), buffer_ready(true), enc(CE_UTF8) {
369:            return valid ? data : Rf_mkCharCE(buffer.c_str(), enc);
398:                data = Rcpp_ReplaceObject(data, Rf_mkCharCE(Rf_translateCharUTF8(data), encoding));
400:                data = Rf_mkCharCE(buffer.c_str(), encoding);
472:                data = Rf_mkCharCE(buffer.c_str(), enc);

But since buffer itself is just a std::string, we should pass both the c_str() and the length() on and use Rf_mkCharLenCE(). I'll try making a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment