Skip to content

Conversation

thirdwing
Copy link
Member

Encoding info in String class and getter/setter.

CE_NATIVE is used if no encoding info provided.

@eddelbuettel
Copy link
Member

Thanks, @thirdwing -- this looks pretty from afar (Hi from Zuerich).

I'll try to set up full revdep run "just in case" but because of travels it may take me a moment to get to it.

@eddelbuettel
Copy link
Member

I was just about to start a full rev.dep run when I noticed that -Wreorder issues about 1700 lines of warnings. Here is the beginning of it:

edd@max:~/git$ R CMD INSTALL Rcpp_0.11.6.2.1.tar.gz                                                                                                                                                    
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘Rcpp’ ...
** libs
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/     -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -O3 -Wall -pipe -Wno-unused -pedanti\
c -c barrier.cpp -o barrier.o
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/     -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -O3 -Wall -pipe -Wno-unused -pedanti\
c -c Module.cpp -o Module.o
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h: In constructor ‘Rcpp::String::String()’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
         bool buffer_ready ;
              ^
../inst/include/Rcpp/String.h:411:18: warning:   ‘cetype_t Rcpp::String::enc’ [-Wreorder]
         cetype_t enc;
                  ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h:56:9: warning:   when initialized here [-Wreorder]
         String( ): data( Rf_mkChar("") ), buffer(), valid(true), buffer_ready(true), enc(CE_NATIVE) {
         ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h: In copy constructor ‘Rcpp::String::String(const Rcpp::String&)’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
         bool buffer_ready ;
              ^
../inst/include/Rcpp/String.h:411:18: warning:   ‘cetype_t Rcpp::String::enc’ [-Wreorder]
         cetype_t enc;
                  ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h:61:9: warning:   when initialized here [-Wreorder]
         String( const String& other) : data( other.get_sexp()), valid(true), buffer_ready(false), enc(Rf_getCharCE(other.get_sexp())) {
         ^
[... 1600+ lines ... ]

My CXXFLAGS here at home is CXXFLAGS += -O3 -Wall -pipe -Wno-unused -pedantic

@thirdwing
Copy link
Member Author

It is my fault. The class members should be initialized in the same order of declaration.

@eddelbuettel
Copy link
Member

Thanks for the quick fix!

Updated the branch here, and now launched a test. May not see the end of it before I go to bed here in Copenhagen...

@eddelbuettel
Copy link
Member

I think we can merge this. The revdep check had 382 good, 21 bad -- and the first 'bad' ones I looked at yesterday where all the usual test failures when suggested packages (which I do not install) were expected (as the package did not test for them) -- a bug in that package in my book. More importantly, no new build issues as best as I can tell. I'll try to put up the usual summary in a bit.

@eddelbuettel
Copy link
Member

JJ had a question regarding a if (valid), but we convinced / reminded ourselves that that is simply the toggle for wide vs normal strings (btw: should we do s/valid/wide/ in String.h ?). So I think we're still good.

@thirdwing
Copy link
Member Author

I think you mean this line https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L370

It follows the setData(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L429) function.

Sometimes the CHARSXP is not valid when constructing a String, like from const char*(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L110) or const std::string(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L101).

We need to make sure the CHARSXP is valid when setting encoding.

@eddelbuettel
Copy link
Member

Thanks for the follow-up, KK. To me this looks good, but we may as well let JJ have another look at it. He just scored one really good comment the other day so his karma is higher than ever :)

Greetings from Denmark to you. Keep us posted too on the compiler endeavour if there is anything new, and hope you get to enjoy Seattle.

@thirdwing
Copy link
Member Author

Of course I need his comments.

For me comments are more important than accepting one PR. It is the way to learn from accomplished programmers. ;)

enc = encoding;

if (valid) {
data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the underlying SEXP for the String class managed? When we create it to do we PROTECT it and/or do we need to? When we assign over the top of it (as is done here) do we need to do any additional UNPROTECT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I am not sure about this problem.

According to some similar SEXP assignment code, I don't think we need additional UNPROTECT here.

@eddelbuettel , can you please give us a clear answer? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the underlying SEXP is a CHARSXP, I think its lifetime is essentially 'infinite' since once a string enters the string pool R won't clean it out. At least, that's the assumption the String class is making all over the place; this probably won't be safe assumption 'some day' but is probably permissible for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, agreed. I think we are good to merge (if I hear no objections by
tomorrow AM I'll go ahead and merge).

On Wed, Jul 1, 2015 at 6:19 PM, Kevin Ushey notifications@github.com
wrote:

In inst/include/Rcpp/String.h
#310 (comment):

  •            case CE_BYTES:
    
  •                return "bytes";
    
  •            case CE_LATIN1:
    
  •                return "latin1";
    
  •            case CE_UTF8:
    
  •                return "UTF-8";
    
  •            default:
    
  •                return "unknown";
    
  •        }
    
  •    }
    
  •    inline void set_encoding( cetype_t encoding ) {
    
  •        enc = encoding;
    
  •        if (valid) {
    
  •            data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
    

Because the underlying SEXP is a CHARSXP, I think its lifetime is
essentially 'infinite' since once a string enters the string pool R won't
clean it out. At least, that's the assumption the String class is making
all over the place; this probably won't be safe assumption 'some day' but
is probably permissible for now.


Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33695590.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinushey are you sure that's true? I thought CHARSXPs were garbage collected, but normally you don't have to worry about PROTECT() because they immediately go into a STRSXP. (But I can't think of any easy way to test that hypothesis)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CHARSXPs enter a global string cache (it's something I vaguely remember reading about a long time ago) but I think it's worth confirming / testing. It should be easy to make an example with e.g. an unprotected Rf_mkChar + a following Rf_allocVector + gctorture; presumedly that CHARSXP would be cleaned up if the GC touched it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Writing R Extensions (
http://cran.rstudio.com/doc/manuals/r-devel/R-exts.html#Handling-character-data
):

CHARSXPs are read-only objects and must never be modified. In particular,
the C-style string contained in a CHARSXP should be treated as read-only
and for this reason the CHAR function used to access the character data of
a CHARSXP returns (const char *) (this also allows compilers to issue
warnings about improper use). Since CHARSXPs are immutable, the same
CHARSXP can
be shared by any STRSXP needing an element representing the same string. R
maintains a global cache of CHARSXPs so that there is only ever one
CHARSXP representing
a given string in memory.

On Mon, Jul 6, 2015 at 1:25 PM, Kevin Ushey notifications@github.com
wrote:

In inst/include/Rcpp/String.h
#310 (comment):

  •            case CE_BYTES:
    
  •                return "bytes";
    
  •            case CE_LATIN1:
    
  •                return "latin1";
    
  •            case CE_UTF8:
    
  •                return "UTF-8";
    
  •            default:
    
  •                return "unknown";
    
  •        }
    
  •    }
    
  •    inline void set_encoding( cetype_t encoding ) {
    
  •        enc = encoding;
    
  •        if (valid) {
    
  •            data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
    

I think CHARSXPs enter a global string cache (it's something I vaguely
remember reading about a long time ago) but I think it's worth confirming /
testing. It should be easy to make an example with e.g. an unprotected
Rf_mkChar + a following Rf_allocVector + gctorture; presumedly that
CHARSXP would be cleaned up if the GC touched it.


Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33959013.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either of those speak to whether CHARSXPs are gc'd or not. This PR by @kevinushey suggests that they are: hadley/reshape#40

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a comment at the end of the PR showing that they indeed can be GCed, so we'll have to handle that. :(

eddelbuettel added a commit that referenced this pull request Jul 1, 2015
Add encoding in Rcpp::String class
@eddelbuettel eddelbuettel merged commit 23f2b07 into RcppCore:master Jul 1, 2015
@eddelbuettel
Copy link
Member

No objections at all ... and merged!

@kevinushey
Copy link
Contributor

Doh. I think @hadley is correct about protection of the underlying CHARSXP. Try running this code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector test() {
    Function gc("gc");

    SEXP character = Rf_mkChar("ouch");
    SEXP string = Rf_mkString("ouch");

    // force big allocation + gc
    Shield<SEXP> other(Rf_allocVector(INTSXP, 1E6));
    gc();

    Rprintf("CHARSXP:\n");
    Rf_PrintValue(character);

    Rprintf("STRSXP:\n");
    Rf_PrintValue(string);

    return wrap(other);
}

/*** R
gctorture(TRUE)
head(test())
gctorture(FALSE)
*/

Odds are you will see garbage in the CHARSXP / STRSXP output; e.g. I see:

> Rcpp::sourceCpp('~/scratch/ouch.cpp')
> gctorture(TRUE)
> head(test())
CHARSXP:
[1] "error"
STRSXP:
[1] 2 6
[1] 362789616         1 362789760         1 362789904         1
> gctorture(FALSE)

which is not "ouch" as I guessed (at least for the CHARSXP).

The String class needs to be modified to ensure that the underlying SEXPs lifetime is properly managed.

@jeroen
Copy link
Contributor

jeroen commented Jul 7, 2015

Is the String class not vectorized?

@hadley
Copy link
Contributor

hadley commented Jul 7, 2015

@jeroenooms String is a really wrapper around a CHARSXP, an equivalent of std::string for R strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants