Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encoding in Rcpp::String class #310

Merged
merged 7 commits into from Jul 1, 2015
Merged

Conversation

@thirdwing
Copy link
Member

@thirdwing thirdwing commented Jun 24, 2015

Encoding info in String class and getter/setter.

CE_NATIVE is used if no encoding info provided.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 25, 2015

Thanks, @thirdwing -- this looks pretty from afar (Hi from Zuerich).

I'll try to set up full revdep run "just in case" but because of travels it may take me a moment to get to it.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 28, 2015

I was just about to start a full rev.dep run when I noticed that -Wreorder issues about 1700 lines of warnings. Here is the beginning of it:

edd@max:~/git$ R CMD INSTALL Rcpp_0.11.6.2.1.tar.gz                                                                                                                                                    
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘Rcpp’ ...
** libs
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/     -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -O3 -Wall -pipe -Wno-unused -pedanti\
c -c barrier.cpp -o barrier.o
ccache g++ -I/usr/share/R/include -DNDEBUG -I../inst/include/     -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -O3 -Wall -pipe -Wno-unused -pedanti\
c -c Module.cpp -o Module.o
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h: In constructor ‘Rcpp::String::String()’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
         bool buffer_ready ;
              ^
../inst/include/Rcpp/String.h:411:18: warning:   ‘cetype_t Rcpp::String::enc’ [-Wreorder]
         cetype_t enc;
                  ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h:56:9: warning:   when initialized here [-Wreorder]
         String( ): data( Rf_mkChar("") ), buffer(), valid(true), buffer_ready(true), enc(CE_NATIVE) {
         ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h: In copy constructor ‘Rcpp::String::String(const Rcpp::String&)’:
../inst/include/Rcpp/String.h:420:14: warning: ‘Rcpp::String::buffer_ready’ will be initialized after [-Wreorder]
         bool buffer_ready ;
              ^
../inst/include/Rcpp/String.h:411:18: warning:   ‘cetype_t Rcpp::String::enc’ [-Wreorder]
         cetype_t enc;
                  ^
In file included from ../inst/include/Rcpp/Vector.h:67:0,
                 from ../inst/include/Rcpp.h:38,
                 from Module.cpp:24:
../inst/include/Rcpp/String.h:61:9: warning:   when initialized here [-Wreorder]
         String( const String& other) : data( other.get_sexp()), valid(true), buffer_ready(false), enc(Rf_getCharCE(other.get_sexp())) {
         ^
[... 1600+ lines ... ]

My CXXFLAGS here at home is CXXFLAGS += -O3 -Wall -pipe -Wno-unused -pedantic

Qiang Kou
@thirdwing
Copy link
Member Author

@thirdwing thirdwing commented Jun 28, 2015

It is my fault. The class members should be initialized in the same order of declaration.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 28, 2015

Thanks for the quick fix!

Updated the branch here, and now launched a test. May not see the end of it before I go to bed here in Copenhagen...

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 29, 2015

I think we can merge this. The revdep check had 382 good, 21 bad -- and the first 'bad' ones I looked at yesterday where all the usual test failures when suggested packages (which I do not install) were expected (as the package did not test for them) -- a bug in that package in my book. More importantly, no new build issues as best as I can tell. I'll try to put up the usual summary in a bit.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 29, 2015

JJ had a question regarding a if (valid), but we convinced / reminded ourselves that that is simply the toggle for wide vs normal strings (btw: should we do s/valid/wide/ in String.h ?). So I think we're still good.

@thirdwing
Copy link
Member Author

@thirdwing thirdwing commented Jun 29, 2015

I think you mean this line https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L370

It follows the setData(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L429) function.

Sometimes the CHARSXP is not valid when constructing a String, like from const char*(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L110) or const std::string(https://github.com/thirdwing/Rcpp/blob/master/inst/include/Rcpp/String.h#L101).

We need to make sure the CHARSXP is valid when setting encoding.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jun 29, 2015

Thanks for the follow-up, KK. To me this looks good, but we may as well let JJ have another look at it. He just scored one really good comment the other day so his karma is higher than ever :)

Greetings from Denmark to you. Keep us posted too on the compiler endeavour if there is anything new, and hope you get to enjoy Seattle.

@thirdwing
Copy link
Member Author

@thirdwing thirdwing commented Jun 29, 2015

Of course I need his comments.

For me comments are more important than accepting one PR. It is the way to learn from accomplished programmers. ;)

enc = encoding;

if (valid) {
data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);

This comment has been minimized.

@jjallaire

jjallaire Jun 30, 2015
Member

How is the underlying SEXP for the String class managed? When we create it to do we PROTECT it and/or do we need to? When we assign over the top of it (as is done here) do we need to do any additional UNPROTECT?

This comment has been minimized.

@thirdwing

thirdwing Jul 1, 2015
Author Member

I have to admit I am not sure about this problem.

According to some similar SEXP assignment code, I don't think we need additional UNPROTECT here.

@eddelbuettel , can you please give us a clear answer? Thank you!

This comment has been minimized.

@kevinushey

kevinushey Jul 1, 2015
Contributor

Because the underlying SEXP is a CHARSXP, I think its lifetime is essentially 'infinite' since once a string enters the string pool R won't clean it out. At least, that's the assumption the String class is making all over the place; this probably won't be safe assumption 'some day' but is probably permissible for now.

This comment has been minimized.

@jjallaire

jjallaire Jul 1, 2015
Member

Okay, agreed. I think we are good to merge (if I hear no objections by
tomorrow AM I'll go ahead and merge).

On Wed, Jul 1, 2015 at 6:19 PM, Kevin Ushey notifications@github.com
wrote:

In inst/include/Rcpp/String.h
#310 (comment):

  •            case CE_BYTES:
    
  •                return "bytes";
    
  •            case CE_LATIN1:
    
  •                return "latin1";
    
  •            case CE_UTF8:
    
  •                return "UTF-8";
    
  •            default:
    
  •                return "unknown";
    
  •        }
    
  •    }
    
  •    inline void set_encoding( cetype_t encoding ) {
    
  •        enc = encoding;
    
  •        if (valid) {
    
  •            data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
    

Because the underlying SEXP is a CHARSXP, I think its lifetime is
essentially 'infinite' since once a string enters the string pool R won't
clean it out. At least, that's the assumption the String class is making
all over the place; this probably won't be safe assumption 'some day' but
is probably permissible for now.


Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33695590.

This comment has been minimized.

@hadley

hadley Jul 6, 2015
Contributor

@kevinushey are you sure that's true? I thought CHARSXPs were garbage collected, but normally you don't have to worry about PROTECT() because they immediately go into a STRSXP. (But I can't think of any easy way to test that hypothesis)

This comment has been minimized.

@kevinushey

kevinushey Jul 6, 2015
Contributor

I think CHARSXPs enter a global string cache (it's something I vaguely remember reading about a long time ago) but I think it's worth confirming / testing. It should be easy to make an example with e.g. an unprotected Rf_mkChar + a following Rf_allocVector + gctorture; presumedly that CHARSXP would be cleaned up if the GC touched it.

This comment has been minimized.

@jjallaire

jjallaire Jul 6, 2015
Member

From Writing R Extensions (
http://cran.rstudio.com/doc/manuals/r-devel/R-exts.html#Handling-character-data
):

CHARSXPs are read-only objects and must never be modified. In particular,
the C-style string contained in a CHARSXP should be treated as read-only
and for this reason the CHAR function used to access the character data of
a CHARSXP returns (const char *) (this also allows compilers to issue
warnings about improper use). Since CHARSXPs are immutable, the same
CHARSXP can
be shared by any STRSXP needing an element representing the same string. R
maintains a global cache of CHARSXPs so that there is only ever one
CHARSXP representing
a given string in memory.

On Mon, Jul 6, 2015 at 1:25 PM, Kevin Ushey notifications@github.com
wrote:

In inst/include/Rcpp/String.h
#310 (comment):

  •            case CE_BYTES:
    
  •                return "bytes";
    
  •            case CE_LATIN1:
    
  •                return "latin1";
    
  •            case CE_UTF8:
    
  •                return "UTF-8";
    
  •            default:
    
  •                return "unknown";
    
  •        }
    
  •    }
    
  •    inline void set_encoding( cetype_t encoding ) {
    
  •        enc = encoding;
    
  •        if (valid) {
    
  •            data = Rf_mkCharCE(Rf_translateCharUTF8(data), encoding);
    

I think CHARSXPs enter a global string cache (it's something I vaguely
remember reading about a long time ago) but I think it's worth confirming /
testing. It should be easy to make an example with e.g. an unprotected
Rf_mkChar + a following Rf_allocVector + gctorture; presumedly that
CHARSXP would be cleaned up if the GC touched it.


Reply to this email directly or view it on GitHub
https://github.com/RcppCore/Rcpp/pull/310/files#r33959013.

This comment has been minimized.

@hadley

hadley Jul 6, 2015
Contributor

I don't think either of those speak to whether CHARSXPs are gc'd or not. This PR by @kevinushey suggests that they are: hadley/reshape#40

This comment has been minimized.

@kevinushey

kevinushey Jul 6, 2015
Contributor

I just added a comment at the end of the PR showing that they indeed can be GCed, so we'll have to handle that. :(

eddelbuettel added a commit that referenced this pull request Jul 1, 2015
Add encoding in Rcpp::String class
@eddelbuettel eddelbuettel merged commit 23f2b07 into RcppCore:master Jul 1, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Jul 1, 2015

No objections at all ... and merged!

eddelbuettel added a commit that referenced this pull request Jul 2, 2015
@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 6, 2015

Doh. I think @hadley is correct about protection of the underlying CHARSXP. Try running this code:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector test() {
    Function gc("gc");

    SEXP character = Rf_mkChar("ouch");
    SEXP string = Rf_mkString("ouch");

    // force big allocation + gc
    Shield<SEXP> other(Rf_allocVector(INTSXP, 1E6));
    gc();

    Rprintf("CHARSXP:\n");
    Rf_PrintValue(character);

    Rprintf("STRSXP:\n");
    Rf_PrintValue(string);

    return wrap(other);
}

/*** R
gctorture(TRUE)
head(test())
gctorture(FALSE)
*/

Odds are you will see garbage in the CHARSXP / STRSXP output; e.g. I see:

> Rcpp::sourceCpp('~/scratch/ouch.cpp')
> gctorture(TRUE)
> head(test())
CHARSXP:
[1] "error"
STRSXP:
[1] 2 6
[1] 362789616         1 362789760         1 362789904         1
> gctorture(FALSE)

which is not "ouch" as I guessed (at least for the CHARSXP).

The String class needs to be modified to ensure that the underlying SEXPs lifetime is properly managed.

@jeroen
Copy link
Contributor

@jeroen jeroen commented Jul 7, 2015

Is the String class not vectorized?

@hadley
Copy link
Contributor

@hadley hadley commented Jul 7, 2015

@jeroenooms String is a really wrapper around a CHARSXP, an equivalent of std::string for R strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.