You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unicode characters in the JavaScript kata output are being mangled.
Each byte of the UTF-8 encoding seems to be printing as a separate Unicode character. So the Chinese greeting 你好 displays as ä½ å¥½.
This is very bad for anyone needing more than 7-bit ASCII.
Now for some examples of what I think may be happening. This code in a JavaScript kata:
console.log("£");
displays as the two characters £ ("\u00c2\u00a3"). The Unicode code point for £ ("\u00a3") is normally encoded in UTF-8 as 0xc2a3. But Codewars apparently re-encodes each byte: 0xc2, 0xa3 to get £.
This:
console.log("\uffff")
is displayed as three characters ï¿¿ ("\u00ef\u00bf\u00bf"). The Unicode code point 0xffff is normally encoded in UTF-8 as 0xefbfbf. But as above, Codewars then seems to re-encode 0xef, 0xbf, 0xbf to ï¿¿.
I could give as many examples as there are multiple-byte UTF-8 encodings, but this suffices to show the pattern for a single character. Longer strings just repeat the problem, so that console.log("£££££"); displays as £££££ for example.
As I said, this seems to be pretty serious for anyone needing Unicode.
Note: I discovered this while completing Simple Change Machine, which uses the pound symbol.
The text was updated successfully, but these errors were encountered:
The Chinese Numeral Encoder kata has lots of Unicode characters, so it is affected by this bug. In this screenshot it is noticeable in the test output.
Unicode characters in the JavaScript kata output are being mangled.
Each byte of the UTF-8 encoding seems to be printing as a separate Unicode character. So the Chinese greeting
你好
displays asä½ å¥½
.This is very bad for anyone needing more than 7-bit ASCII.
Now for some examples of what I think may be happening. This code in a JavaScript kata:
displays as the two characters
£
("\u00c2\u00a3"
). The Unicode code point for£
("\u00a3"
) is normally encoded in UTF-8 as0xc2a3
. But Codewars apparently re-encodes each byte:0xc2
,0xa3
to get£
.This:
is displayed as three characters
ï¿¿
("\u00ef\u00bf\u00bf"
). The Unicode code point0xffff
is normally encoded in UTF-8 as0xefbfbf
. But as above, Codewars then seems to re-encode0xef
,0xbf
,0xbf
toï¿¿
.I could give as many examples as there are multiple-byte UTF-8 encodings, but this suffices to show the pattern for a single character. Longer strings just repeat the problem, so that
console.log("£££££");
displays as£££££
for example.As I said, this seems to be pretty serious for anyone needing Unicode.
Note: I discovered this while completing Simple Change Machine, which uses the pound symbol.
The text was updated successfully, but these errors were encountered: