Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken unicode uses <U+884C> which needs to be escaped #21

Open
flying-sheep opened this issue Dec 4, 2015 · 21 comments
Open

broken unicode uses <U+884C> which needs to be escaped #21

flying-sheep opened this issue Dec 4, 2015 · 21 comments

Comments

@flying-sheep
Copy link
Member

@takluyver said in IRkernel/IRkernel#224 (comment):

I ran into another unicode issue while testing this. If R thinks it can't display a character, it escapes it like this: <U+884C> (vs Python style \u884c). These sequences are being included raw in the HTML repr produces, so the browser tries to interpret them as HTML tags and doesn't show anything. repr should probably be escaping strings for the HTML representation.

please tell me what makes R output this.

probably a good idea to html-encode all character arrays before repr_htmling them, but still…

@takluyver
Copy link
Member

I came across it on Windows - e.g. by trying print("行政法") in a notebook. I would assume that R tries to determine what encoding the system uses, and if that encoding can't handle the code point in question, it escapes it to the <U+884C> format.

Ideally, our output should bypass that unicode escaping and just send the real unicode code points. But either way, strings in the HTML output need to be HTML escaped so that you can use <, > and & in strings and have them display correctly.

@jankatins
Copy link
Contributor

See also #28

@jankatins
Copy link
Contributor

Ok, this works in RStudio:

> print("行政法")
[1] "行政法"

but not in the notebook:

> print("行政法")
[1] "<U+884C><U+653F><U+6CD5>"

@flying-sheep
Copy link
Member Author

spectacle t23011

works for me. also we send encoding now. hmm. are you sure this happens with newest everything?

@jankatins
Copy link
Contributor

Still happening with the newest everything...

@flying-sheep You are on a non-windows system?

@jankatins
Copy link
Contributor

Found this blog post mentioning the problem, but haven't looked deep enough to understand what's going on... https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

@flying-sheep
Copy link
Member Author

You are on a non-windows system?

Yeh

@jankatins
Copy link
Contributor

jikes:

x = "A行政法ß"
nchar(x)
x
26
"A<U+884C><U+653F><U+6CD5>ß"

My interpetation is that the string is already wrong when it comes in?

@jankatins
Copy link
Contributor

Even clearer:

"\u8FDB"

Produces this:

"<U+6CD5>进"

@jankatins
Copy link
Contributor

My current guess is that this is happening in evaluate -> see last element...

Input cell:

x = ""
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)

Output in the notebook:

8
1
"<U+6CD5>"
"进"
[1] "<U+6CD5>"
[1] "<U+8FDB>"

Using the IRkernel/IRkernel#293, this is what ends up in the file log:

2016-04-10 22:26:10 DEBUG: main loop: after poll
2016-04-10 22:26:10 DEBUG: main loop: shell
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_input
2016-04-10 22:26:10 DEBUG: Executing code: x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 8"
 $ text/html    : chr "8"
 $ text/markdown: chr "8"
 $ text/latex   : chr "8"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 1"
 $ text/html    : chr "1"
 $ text/markdown: chr "1"
 $ text/latex   : chr "1"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+6CD5>\""
 $ text/html    : chr "\"&lt;U+6CD5&gt;\""
 $ text/markdown: chr "\"&lt;U+6CD5&gt;\""
 $ text/latex   : chr "\"<U+6CD5>\""
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+8FDB>\""
 $ text/html    : chr "\"<U+8FDB>\"""| __truncated__
 $ text/markdown: chr "\"<U+8FDB>\"""| __truncated__
 $ text/latex   : chr "\"<U+8FDB>\"""| __truncated__
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+6CD5>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+8FDB>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_reply
2016-04-10 22:26:10 DEBUG: main loop: beginning

Guess:

8 # it's fine when it comes from zmq (see log), but it's already screwed up when it gets executed
1 # evaluate parses the unicode escape to a single value -> everything is fine
"<U+6CD5>" # dito above
"进" # printing in the context of the kernel of a returned value is ok
[1] "<U+6CD5>" # no change...
[1] "<U+8FDB>" # but printing in evaluate will screw up the unicode again

So it looks like evalue needs some encoding, both in and out?

@jankatins
Copy link
Contributor

https://stat.ethz.ch/R-manual/R-devel/library/base/html/source.html -> Encoding section

This what I get on my windows R:

> localeToCharset()
[1] "ISO8859-1"

And this is what I get on my NAS (linux based, hadleyverse docker image):

> localeToCharset()
[1] "UTF-8"     "ISO8859-1"

@jankatins
Copy link
Contributor

And here is an example of the evaluate problem (both executed in an RStudio window...):

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"

l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)

x <- evaluate(code, output_handler = oh)
l

Windows:

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '<U+6CD5>', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 8\n" #> bad in

[[2]]
[1] "[1] 1\n" # ok if escaped...

[[3]]
[1] "[1] \"<U+6CD5>\"\n" # -> Just the bad in

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> but here it's bad out...

Linux (NAS):

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 1\n"

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"\"\n"

[[4]]
[1] "[1] \"\"\n"

@jankatins
Copy link
Contributor

And even further down for the input problem:

Windows:

> parse(text='"法 \\u8FDB"')
expression("<U+6CD5> \u8FDB")

Linux:

> parse(text='"法 \\u8FDB"')
expression("\u8FDB")

@jankatins
Copy link
Contributor

If someone wants to have fun: c sources of parse: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/main/source.c#L193

I tried to set my locale, but everything I tired was rejected by Sys.setlocales(...).

@flying-sheep
Copy link
Member Author

thanks for digging into this. i think you were almost there. parse has an encoding argument.

i filed r-lib/evaluate#66.

depending on how it is resolved (automatic/manually) we might need to extract and specify the encoding when calling a fixed/enhanced version of evaluate or not.

@jankatins
Copy link
Contributor

I tried that argument and it didn't make any difference :-(

@jankatins
Copy link
Contributor

I updated r-lib/evaluate#66 with code examples which demonstrate what goes wrong here...

@jankatins
Copy link
Contributor

Current status here: it's an ustream bug and we have some workarounds (warn if unicode input and don't send the eclipse char on such systems. So not a blocker for the next release IMO -> restor teh milestone if you have a different opinion...]

@jankatins jankatins removed this from the 0.5 milestone Apr 21, 2016
@takluyver
Copy link
Member

But HTML output is now being escaped, right? So you can at least see <U+884C>?

@jankatins
Copy link
Contributor

jankatins commented Apr 21, 2016

But HTML output is now being escaped, right? So you can at least see <U+884C>?

Yes and no: yes because html is escaped and no, because of #43 I see three dots (=3 chars).

But "OUT" is not the problem: you always see something, it's just escaped in the funny <U+xxxx> and therefore not C&P-able... "IN" is the bigger problem, but that was taken care of in IRkernel/IRkernel#296

@flying-sheep
Copy link
Member Author

Since R 4.2, it has support for UTF-8 support in windows. Anything one needs to do there or will it just work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants