String class needs someway to declare encoding of string #263

Closed
hadley opened this Issue Feb 23, 2015 · 19 comments

Projects

None yet

5 participants

@hadley
Contributor
hadley commented Feb 23, 2015

e.g. instead of Rf_mkCharCE("abc", CE_UTF8)

@thirdwing thirdwing added the encoding label Aug 2, 2016
@thirdwing
Member

Fixed in #310

@thirdwing thirdwing closed this Aug 2, 2016
@hadley
Contributor
hadley commented Aug 2, 2016

How do you create a String from a std::string() with a specified encoding? I didn't see a new constructor for that.

@thirdwing
Member

I think I misunderstood why you opened this issue.

Right now we have String(SEXP charsxp, const std::string& enc) (https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/String.h#L91).

If you think String(const std::string& s, const std::string& enc) is necessary, I can add now.

@kevinushey
Contributor

It sounds like we want the whole matrix for:

Inputs: const std::string&, const char*
Encodings: const std::string&, ce_type

@hadley
Contributor
hadley commented Aug 2, 2016 edited

Yeah, exactly (except I don't believe that representing the encoding as a string is really that useful)

@hadley
Contributor
hadley commented Aug 2, 2016

Also, personally, setting the default encoding that doesn't vary from platform to platform would reduce a huge number of bugs in my own code.

@thirdwing
Member

OK. I will update this.

(1) the ctor matrix;

(2) when constructing from const char* or std::string, let's use CE_UTF8 as default.

Any more suggestions?

@hadley
Contributor
hadley commented Aug 2, 2016

I also worry a little that s.set_encoding("UTF8") will silently do the wrong thing

@thirdwing
Member

Can I have more info?

@hadley
Contributor
hadley commented Aug 2, 2016

I made a typo - I should have done s.set_encoding("UTF-8"). It would take a veeeeery long time to spot the problem in code because set_encoding() doesn't raise an error if you supply an encoding name that it doesn't know about.

@jjallaire
Member

I won't pretend to speak with authority here, but one issue I think we will
have if we make CE_UTF8 the default underneath existing code is that code
which currently works with calls to R and calls to the system (all of which
expect the system native encoding) will start breaking when we feed them
the CE_UTF8 encoded strings. This can be remediated by adding a
String::toSystemEncoding method and asking all existing users of the class
to add this wherever they pass a string back to R or to the operating
system.

I think it's definitely desirable if self-contained libraries try to get
everything into UTF-8, but as soon we start forcing this as the default on
all existing code everyone will need to start being aware of the necessary
conversions whenever they interact with software coded in the old-style of
"I know not of encoding" / "I only know of the system encoding" (which is a
lot of software including the OS itself and R).

On Tue, Aug 2, 2016 at 1:00 PM, Qiang Kou (KK) notifications@github.com
wrote:

OK. I will update this.

(1) the ctor matrix;

(2) when constructing from const char* or std::string, let's use CE_UTF8
as default.

Any more suggestions?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXx0ASezE1M5F3TK2tClSxboLbjEK2ks5qb3fAgaJpZM4DkfGf
.

@thirdwing
Member
thirdwing commented Aug 2, 2016 edited

@hadley I know what you mean. Let's remove the string representation of encoding or add an assert.

@jjallaire What about using macros? Using CE_UTF8 as default if RCPP_CE_UTF8 is defined, otherwise, using CE_NATIVE?

@kevinushey
Contributor
kevinushey commented Aug 2, 2016 edited

Most R APIs will attempt to translate from UTF-8 to the system encoding when making system calls; the main issue is that translation can often fail for UTF-8 characters not representable in the system encoding. R typically handles these errors by rendering the unicode code point for the non-translatable UTF-8 character, e.g. converting to <U+9B3C>.

The other commonly seen issue is that many R routines that use system calls will attempt to round-trip strings through the system encoding, e.g. 'UTF-8' => 'system' => 'UTF-8', and since the translation to the system encoding can be lossy (mainly on Windows) you get incorrect results. (One example of this I bumped into recently was with path.expand(); even though in principle all we're trying to do is substitute a leading ~ with some 'home' directory, that path string gets round-tripped and so loses UTF-8 characters not representable in the system encoding)

For what it's worth, I think in RStudio the primary issue we have is that we traffic around UTF-8 strings by default, but when passing those strings to R functions we don't mark the encoding. This means that, on Windows, we might end up with strings with contents that are UTF-8 encoded, but the CHARSXP itself has unknown encoding, and so R assumes our UTF-8 string is actually encoded in the system encoding (which is incorrect on Windows, but is usually correct on systems with UTF-8 locales). We typically resolve this with utf8ToSystem, or by explicitly re-marking the encoding on the R side post-hoc.

@jjallaire
Member

So it sounds like R will mostly do the right thing if the SEXP is marked
with UTF8 encoding. That's good.

I still wonder if a macro might be the way to activate the UTF8 by default
behavior (especially given existing code running with a different default
behavior).

On Tue, Aug 2, 2016 at 1:17 PM, Qiang Kou (KK) notifications@github.com
wrote:

@hadley https://github.com/hadley I know what you mean. Let's remove
the string representation of encoding or add an assert.

@jjallaire https://github.com/jjallaire What about using macros? Using
CE_UTF8 if RCPP_CE_UTF8 is defined, otherwise, using CE_NATIVE?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXx7qmk1r-2DUz9LGfa4NNBXRxI7mjks5qb3u9gaJpZM4DkfGf
.

@hadley
Contributor
hadley commented Aug 2, 2016

@jjallaire I think that change is relatively unlikely to affect much code because few people explicitly construct Strings. There's a separate issue of what this should do:

CharacterVector x(1);
x[0] = "abc"

I don't know whether that implicitly constructs a String or if it's handled internally by CharacterVector.

@jjallaire
Member

It seems quite likely it's handled internally by CharacterVector.

If String is indeed not used that much it wouldn't be much of a band-aid to
rip off to say it's UTF-8 by default. We certainly wouldn't break any
compilations -- there might be subtle runtime differences in a package or
two but this might be a small price to pay for longer-term sanity.

On Tue, Aug 2, 2016 at 2:56 PM, Hadley Wickham notifications@github.com
wrote:

@jjallaire https://github.com/jjallaire I think that change is
relatively unlikely to affect much code because few people explicitly
construct Strings. There's a separate issue of what this should do:

CharacterVector x(1);
x[0] = "abc"

I don't know whether that implicitly constructs a String or if it's
handled internally by CharacterVector.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXxxQa70JlHwzaAlxgKjggY6ohLLJYks5qb5LZgaJpZM4DkfGf
.

@hadley
Contributor
hadley commented Aug 2, 2016

Also the current behaviour makes it very easy to introduce bugs that only appear on Windows with non-English text. So if things do break with the change, it's likely to be just making existing breakage less silent.

@thirdwing
Member

After defining RCPP_STRING_DEBUG_LEVEL,

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
void test(){
  Rcout << "No String ctor" << std::endl;
  CharacterVector x(1);
  x[0] = "abc";
  Rcout << "String ctor" << std::endl;
  String y("abc");
  String z = "abc";
}
/*
> Rcpp::sourceCpp("test.cpp")
> test()
No String ctor
String ctor
                           Rcpp/String.h: 145 String(const char*, cetype_t)
                           Rcpp/String.h: 145 String(const char*, cetype_t)
*/
@eddelbuettel
Member
eddelbuettel commented Aug 2, 2016 edited

Reopening at @coatless 's suggestion.

@eddelbuettel eddelbuettel reopened this Aug 2, 2016
@eddelbuettel eddelbuettel added a commit that closed this issue Aug 3, 2016
@eddelbuettel eddelbuettel Merging #529 (Close #263)
From branch 'string_ctor2' of https://github.com/thirdwing/Rcpp into feature/pr529
0460f08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment