-
-
Notifications
You must be signed in to change notification settings - Fork 219
Add encoding in Rcpp::String class #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks, @thirdwing -- this looks pretty from afar (Hi from Zuerich). I'll try to set up full revdep run "just in case" but because of travels it may take me a moment to get to it. |
I was just about to start a full rev.dep run when I noticed that edd@max:~/git$ R CMD INSTALL Rcpp_0.11.6.2.1.tar.gz
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘Rcpp’ ...
** libs
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/ -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -O3 -Wall -pipe -Wno-unused -pedanti\
c -c barrier.cpp -o barrier.o
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/ -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -O3 -Wall -pipe -Wno-unused -pedanti\
c -c Module.cpp -o Module.o
In file included from ../inst/include/Rcpp/Vector.h:67:0,
from ../inst/include/Rcpp.h:38,
from Module.cpp:24:
../inst/include/Rcpp/String.h: In constructor ‘Rcpp::String::String()’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
bool buffer_ready ;
^
../inst/include/Rcpp/String.h:411:18: warning: ‘cetype_t Rcpp::String::enc’ [-Wreorder]
cetype_t enc;
^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
from ../inst/include/Rcpp.h:38,
from Module.cpp:24:
../inst/include/Rcpp/String.h:56:9: warning: when initialized here [-Wreorder]
String( ): data( Rf_mkChar("") ), buffer(), valid(true), buffer_ready(true), enc(CE_NATIVE) {
^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
from ../inst/include/Rcpp.h:38,
from Module.cpp:24:
../inst/include/Rcpp/String.h: In copy constructor ‘Rcpp::String::String(const Rcpp::String&)’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
bool buffer_ready ;
^
../inst/include/Rcpp/String.h:411:18: warning: ‘cetype_t Rcpp::String::enc’ [-Wreorder]
cetype_t enc;
^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
from ../inst/include/Rcpp.h:38,
from Module.cpp:24:
../inst/include/Rcpp/String.h:61:9: warning: when initialized here [-Wreorder]
String( const String& other) : data( other.get_sexp()), valid(true), buffer_ready(false), enc(Rf_getCharCE(other.get_sexp())) {
^
[... 1600+ lines ... ] My |
It is my fault. The class members should be initialized in the same order of declaration. |
Thanks for the quick fix! Updated the branch here, and now launched a test. May not see the end of it before I go to bed here in Copenhagen... |
I think we can merge this. The revdep check had 382 good, 21 bad -- and the first 'bad' ones I looked at yesterday where all the usual test failures when suggested packages (which I do not install) were expected (as the package did not test for them) -- a bug in that package in my book. More importantly, no new build issues as best as I can tell. I'll try to put up the usual summary in a bit. |
JJ had a question regarding a |
I think you mean this line https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L370 It follows the Sometimes the We need to make sure the |
Thanks for the follow-up, KK. To me this looks good, but we may as well let JJ have another look at it. He just scored one really good comment the other day so his karma is higher than ever :) Greetings from Denmark to you. Keep us posted too on the compiler endeavour if there is anything new, and hope you get to enjoy Seattle. |
Of course I need his comments. For me comments are more important than accepting one PR. It is the way to learn from accomplished programmers. ;) |
enc = encoding; | ||
|
||
if (valid) { | ||
data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the underlying SEXP for the String class managed? When we create it to do we PROTECT it and/or do we need to? When we assign over the top of it (as is done here) do we need to do any additional UNPROTECT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I am not sure about this problem.
According to some similar SEXP
assignment code, I don't think we need additional UNPROTECT
here.
@eddelbuettel , can you please give us a clear answer? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the underlying SEXP
is a CHARSXP
, I think its lifetime is essentially 'infinite' since once a string enters the string pool R won't clean it out. At least, that's the assumption the String
class is making all over the place; this probably won't be safe assumption 'some day' but is probably permissible for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, agreed. I think we are good to merge (if I hear no objections by
tomorrow AM I'll go ahead and merge).
On Wed, Jul 1, 2015 at 6:19 PM, Kevin Ushey notifications@github.com
wrote:
In inst/include/Rcpp/String.h
#310 (comment):
case CE_BYTES:
return "bytes";
case CE_LATIN1:
return "latin1";
case CE_UTF8:
return "UTF-8";
default:
return "unknown";
}
}
inline void set_encoding( cetype_t encoding ) {
enc = encoding;
if (valid) {
data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
Because the underlying SEXP is a CHARSXP, I think its lifetime is
essentially 'infinite' since once a string enters the string pool R won't
clean it out. At least, that's the assumption the String class is making
all over the place; this probably won't be safe assumption 'some day' but
is probably permissible for now.—
Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33695590.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinushey are you sure that's true? I thought CHARSXPs were garbage collected, but normally you don't have to worry about PROTECT() because they immediately go into a STRSXP. (But I can't think of any easy way to test that hypothesis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CHARSXP
s enter a global string cache (it's something I vaguely remember reading about a long time ago) but I think it's worth confirming / testing. It should be easy to make an example with e.g. an unprotected Rf_mkChar
+ a following Rf_allocVector
+ gctorture
; presumedly that CHARSXP
would be cleaned up if the GC touched it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Writing R Extensions (
http://cran.rstudio.com/doc/manuals/r-devel/R-exts.html#Handling-character-data
):
CHARSXPs are read-only objects and must never be modified. In particular,
the C-style string contained in a CHARSXP should be treated as read-only
and for this reason the CHAR function used to access the character data of
a CHARSXP returns (const char *) (this also allows compilers to issue
warnings about improper use). Since CHARSXPs are immutable, the same
CHARSXP can
be shared by any STRSXP needing an element representing the same string. R
maintains a global cache of CHARSXPs so that there is only ever one
CHARSXP representing
a given string in memory.
On Mon, Jul 6, 2015 at 1:25 PM, Kevin Ushey notifications@github.com
wrote:
In inst/include/Rcpp/String.h
#310 (comment):
case CE_BYTES:
return "bytes";
case CE_LATIN1:
return "latin1";
case CE_UTF8:
return "UTF-8";
default:
return "unknown";
}
}
inline void set_encoding( cetype_t encoding ) {
enc = encoding;
if (valid) {
data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
I think CHARSXPs enter a global string cache (it's something I vaguely
remember reading about a long time ago) but I think it's worth confirming /
testing. It should be easy to make an example with e.g. an unprotected
Rf_mkChar + a following Rf_allocVector + gctorture; presumedly that
CHARSXP would be cleaned up if the GC touched it.—
Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33959013.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think either of those speak to whether CHARSXPs are gc'd or not. This PR by @kevinushey suggests that they are: hadley/reshape#40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a comment at the end of the PR showing that they indeed can be GCed, so we'll have to handle that. :(
Add encoding in Rcpp::String class
No objections at all ... and merged! |
Doh. I think @hadley is correct about protection of the underlying #include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector test() {
Function gc("gc");
SEXP character = Rf_mkChar("ouch");
SEXP string = Rf_mkString("ouch");
// force big allocation + gc
Shield<SEXP> other(Rf_allocVector(INTSXP, 1E6));
gc();
Rprintf("CHARSXP:\n");
Rf_PrintValue(character);
Rprintf("STRSXP:\n");
Rf_PrintValue(string);
return wrap(other);
}
/*** R
gctorture(TRUE)
head(test())
gctorture(FALSE)
*/ Odds are you will see garbage in the
which is not The |
Is the |
@jeroenooms String is a really wrapper around a CHARSXP, an equivalent of |
Encoding info in String class and getter/setter.
CE_NATIVE is used if no encoding info provided.