String class needs someway to declare encoding of string #263

hadley · 2015-02-23T19:43:08Z

e.g. instead of Rf_mkCharCE("abc", CE_UTF8)

The text was updated successfully, but these errors were encountered:

thirdwing · 2016-08-02T16:22:08Z

Fixed in #310

hadley · 2016-08-02T16:40:18Z

How do you create a String from a std::string() with a specified encoding? I didn't see a new constructor for that.

thirdwing · 2016-08-02T16:49:58Z

I think I misunderstood why you opened this issue.

Right now we have String(SEXP charsxp, const std::string& enc) (https://github.com/RcppCore/Rcpp/blob/master/inst/include/Rcpp/String.h#L91).

If you think String(const std::string& s, const std::string& enc) is necessary, I can add now.

kevinushey · 2016-08-02T16:51:02Z

It sounds like we want the whole matrix for:

Inputs: const std::string&, const char*
Encodings: const std::string&, ce_type

hadley · 2016-08-02T16:51:45Z

Yeah, exactly (except I don't believe that representing the encoding as a string is really that useful)

hadley · 2016-08-02T16:52:37Z

Also, personally, setting the default encoding that doesn't vary from platform to platform would reduce a huge number of bugs in my own code.

thirdwing · 2016-08-02T17:00:48Z

OK. I will update this.

(1) the ctor matrix;

(2) when constructing from const char* or std::string, let's use CE_UTF8 as default.

Any more suggestions?

hadley · 2016-08-02T17:03:52Z

I also worry a little that s.set_encoding("UTF8") will silently do the wrong thing

thirdwing · 2016-08-02T17:05:10Z

Can I have more info?

hadley · 2016-08-02T17:06:20Z

I made a typo - I should have done s.set_encoding("UTF-8"). It would take a veeeeery long time to spot the problem in code because set_encoding() doesn't raise an error if you supply an encoding name that it doesn't know about.

jjallaire · 2016-08-02T17:08:53Z

I won't pretend to speak with authority here, but one issue I think we will
have if we make CE_UTF8 the default underneath existing code is that code
which currently works with calls to R and calls to the system (all of which
expect the system native encoding) will start breaking when we feed them
the CE_UTF8 encoded strings. This can be remediated by adding a
String::toSystemEncoding method and asking all existing users of the class
to add this wherever they pass a string back to R or to the operating
system.

I think it's definitely desirable if self-contained libraries try to get
everything into UTF-8, but as soon we start forcing this as the default on
all existing code everyone will need to start being aware of the necessary
conversions whenever they interact with software coded in the old-style of
"I know not of encoding" / "I only know of the system encoding" (which is a
lot of software including the OS itself and R).

On Tue, Aug 2, 2016 at 1:00 PM, Qiang Kou (KK) notifications@github.com
wrote:

OK. I will update this.

(1) the ctor matrix;

(2) when constructing from const char* or std::string, let's use CE_UTF8
as default.

Any more suggestions?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXx0ASezE1M5F3TK2tClSxboLbjEK2ks5qb3fAgaJpZM4DkfGf
.

thirdwing · 2016-08-02T17:17:48Z

@hadley I know what you mean. Let's remove the string representation of encoding or add an assert.

@jjallaire What about using macros? Using CE_UTF8 as default if RCPP_CE_UTF8 is defined, otherwise, using CE_NATIVE?

kevinushey · 2016-08-02T17:23:15Z

Most R APIs will attempt to translate from UTF-8 to the system encoding when making system calls; the main issue is that translation can often fail for UTF-8 characters not representable in the system encoding. R typically handles these errors by rendering the unicode code point for the non-translatable UTF-8 character, e.g. converting 鬼 to <U+9B3C>.

The other commonly seen issue is that many R routines that use system calls will attempt to round-trip strings through the system encoding, e.g. 'UTF-8' => 'system' => 'UTF-8', and since the translation to the system encoding can be lossy (mainly on Windows) you get incorrect results. (One example of this I bumped into recently was with path.expand(); even though in principle all we're trying to do is substitute a leading ~ with some 'home' directory, that path string gets round-tripped and so loses UTF-8 characters not representable in the system encoding)

For what it's worth, I think in RStudio the primary issue we have is that we traffic around UTF-8 strings by default, but when passing those strings to R functions we don't mark the encoding. This means that, on Windows, we might end up with strings with contents that are UTF-8 encoded, but the CHARSXP itself has unknown encoding, and so R assumes our UTF-8 string is actually encoded in the system encoding (which is incorrect on Windows, but is usually correct on systems with UTF-8 locales). We typically resolve this with utf8ToSystem, or by explicitly re-marking the encoding on the R side post-hoc.

jjallaire · 2016-08-02T17:25:18Z

So it sounds like R will mostly do the right thing if the SEXP is marked
with UTF8 encoding. That's good.

I still wonder if a macro might be the way to activate the UTF8 by default
behavior (especially given existing code running with a different default
behavior).

On Tue, Aug 2, 2016 at 1:17 PM, Qiang Kou (KK) notifications@github.com
wrote:

@hadley https://github.com/hadley I know what you mean. Let's remove
the string representation of encoding or add an assert.

@jjallaire https://github.com/jjallaire What about using macros? Using
CE_UTF8 if RCPP_CE_UTF8 is defined, otherwise, using CE_NATIVE?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXx7qmk1r-2DUz9LGfa4NNBXRxI7mjks5qb3u9gaJpZM4DkfGf
.

hadley · 2016-08-02T18:56:25Z

@jjallaire I think that change is relatively unlikely to affect much code because few people explicitly construct Strings. There's a separate issue of what this should do:

CharacterVector x(1);
x[0] = "abc"

I don't know whether that implicitly constructs a String or if it's handled internally by CharacterVector.

jjallaire · 2016-08-02T19:01:00Z

It seems quite likely it's handled internally by CharacterVector.

If String is indeed not used that much it wouldn't be much of a band-aid to
rip off to say it's UTF-8 by default. We certainly wouldn't break any
compilations -- there might be subtle runtime differences in a package or
two but this might be a small price to pay for longer-term sanity.

On Tue, Aug 2, 2016 at 2:56 PM, Hadley Wickham notifications@github.com
wrote:

@jjallaire https://github.com/jjallaire I think that change is
relatively unlikely to affect much code because few people explicitly
construct Strings. There's a separate issue of what this should do:

CharacterVector x(1);
x[0] = "abc"

I don't know whether that implicitly constructs a String or if it's
handled internally by CharacterVector.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#263 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAGXxxQa70JlHwzaAlxgKjggY6ohLLJYks5qb5LZgaJpZM4DkfGf
.

hadley · 2016-08-02T19:06:08Z

Also the current behaviour makes it very easy to introduce bugs that only appear on Windows with non-English text. So if things do break with the change, it's likely to be just making existing breakage less silent.

thirdwing · 2016-08-02T19:09:01Z

After defining RCPP_STRING_DEBUG_LEVEL,

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
void test(){
  Rcout << "No String ctor" << std::endl;
  CharacterVector x(1);
  x[0] = "abc";
  Rcout << "String ctor" << std::endl;
  String y("abc");
  String z = "abc";
}
/*
> Rcpp::sourceCpp("test.cpp")
> test()
No String ctor
String ctor
                           Rcpp/String.h: 145 String(const char*, cetype_t)
                           Rcpp/String.h: 145 String(const char*, cetype_t)
*/

eddelbuettel · 2016-08-02T20:40:02Z

Reopening at @coatless 's suggestion.

thirdwing added the encoding label Aug 2, 2016

thirdwing closed this as completed Aug 2, 2016

thirdwing mentioned this issue Aug 2, 2016

Rcpp::String ctor update(close #263) #529

Merged

eddelbuettel reopened this Aug 2, 2016

eddelbuettel closed this as completed in 0460f08 Aug 3, 2016

eddelbuettel added a commit that referenced this issue Aug 3, 2016

Second step of merging #529 (close #263)

c0ad307

coatless mentioned this issue Aug 12, 2019

String encoding is dropped in C++ round-trip through String class #988

Closed

knapply mentioned this issue Nov 3, 2019

Encoding, std::string, and Rcpp::String SymbolixAU/jsonify#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String class needs someway to declare encoding of string #263

String class needs someway to declare encoding of string #263

hadley commented Feb 23, 2015

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

kevinushey commented Aug 2, 2016

hadley commented Aug 2, 2016 •

edited

Loading

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

jjallaire commented Aug 2, 2016

thirdwing commented Aug 2, 2016 •

edited

Loading

kevinushey commented Aug 2, 2016 •

edited

Loading

jjallaire commented Aug 2, 2016

hadley commented Aug 2, 2016

jjallaire commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

eddelbuettel commented Aug 2, 2016 •

edited

Loading

String class needs someway to declare encoding of string #263

String class needs someway to declare encoding of string #263

Comments

hadley commented Feb 23, 2015

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

kevinushey commented Aug 2, 2016

hadley commented Aug 2, 2016 • edited Loading

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

hadley commented Aug 2, 2016

jjallaire commented Aug 2, 2016

thirdwing commented Aug 2, 2016 • edited Loading

kevinushey commented Aug 2, 2016 • edited Loading

jjallaire commented Aug 2, 2016

hadley commented Aug 2, 2016

jjallaire commented Aug 2, 2016

hadley commented Aug 2, 2016

thirdwing commented Aug 2, 2016

eddelbuettel commented Aug 2, 2016 • edited Loading

hadley commented Aug 2, 2016 •

edited

Loading

thirdwing commented Aug 2, 2016 •

edited

Loading

kevinushey commented Aug 2, 2016 •

edited

Loading

eddelbuettel commented Aug 2, 2016 •

edited

Loading