Skip to content

Using UTF-8 in String ctor #280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

thirdwing
Copy link
Member

This is for #189 . String ctor now uses UTF-8.

There seems much more work if we want to support different encoding in String.

Feel free to close this PR.

@eddelbuettel
Copy link
Member

Looks good to me though utf-8 support probably needs a lot more work in more places.

Any thoughts on how to get from here to there?

Any seconds on whether to fold this in, @kevinushey or @jjallaire ?

@kevinushey
Copy link
Contributor

I think this is a step in the right direction, although I think we need to find a compromise between:

  1. String objects that just store some bytes, alongside whatever declared encoding they came with, and
  2. String objects that are always internally stored as UTF-8, but are converted back to the appropriate locale on demand.

I think performance-intensive applications will want to avoid round-trip translations between various encodings, so I am somewhat hesitant to accept this PR right away. Thoughts, @jjallaire?

@jjallaire
Copy link
Member

The other problem with this is that on Windows the OS interfaces, R itself,
and many other libraries assume that single-character (char) strings use
the system encoding. If we start auto-converting to UTF-8 we will surely
break things that currently work.

So any conversion to UTF8 will need to explicit (e.g. an encoding parameter
on the constructor, a static construction function, etc.)

On Sat, Mar 21, 2015 at 5:22 PM, Kevin Ushey notifications@github.com
wrote:

I think this is a step in the right direction, although I think we need to
find a compromise between:

  1. String objects that just store some bytes, alongside whatever
    declared encoding they came with, and
  2. String objects that are always internally stored as UTF-8, but are
    converted back to the appropriate locale on demand.

I think performance-intensive applications will want to avoid round-trip
translations between various encodings, so I am somewhat hesitant to accept
this PR right away. Thoughts, @jjallaire https://github.com/jjallaire?


Reply to this email directly or view it on GitHub
#280 (comment).

@thirdwing
Copy link
Member Author

Thanks @jjallaire and @kevinushey .There are many things I don't know before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants