-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle lone surrogates #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some feedback. Thanks for your work on this!
src/index.ts
Outdated
result += '\\"'; | ||
} else if (char in escaped) { | ||
result += escaped[char]; | ||
} else if ((code >= 0xD800 && code <= 0xDBFF)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: redundant parens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks
test/test.ts
Outdated
describe('strings', () => { | ||
test('newline', 'a\nb', JSON.stringify('a\nb')); | ||
test('double quotes', '"yar"', JSON.stringify('"yar"')); | ||
test('lone surrogate', "\uD800", '"\\uD800"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests I'd consider adding for this change (with the ones that are already there marked as completed):
- a surrogate pair, e.g.
'a\uD800\uDC00b'
- a surrogate pair in the wrong order, e.g.
'a\uDC00\uD800b'
- a lone high surrogate, e.g.
'a\uD800b'
- a lone low surrogate, e.g.
'a\uDC00b'
- two lone high surrogates in a row, e.g.
'a\uD800\uD800b'
- two lone low surrogates in a row, e.g.
'a\uDC00\uDC00b'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added these tests, though I'm not entirely clear on what the expected values are 😂 — should any combination of surrogates other than [high, low] be replaced with \\u[XXXX]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly; anything that is not [high, low] is not a valid surrogate pair, but rather a series of lone surrogates.
When in doubt, you can compare the results to JSON.stringify
in a recent V8 build (perhaps using jsvu to easily grab the latest binary).
src/index.ts
Outdated
} else if ((code >= 0xD800 && code <= 0xDBFF)) { | ||
const next = str.charCodeAt(i + 1); | ||
if (next >= 0xDC00 && next <= 0xDFFF) { | ||
result += char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could append the next char here as well and increment i
, saving a loop iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/index.ts
Outdated
'\n': '\\n', | ||
'\r': '\\r', | ||
'\t': '\\t', | ||
'\0': '\\u0000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just escape this as \0
? (It's not an octal escape in JS.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal to match JSON.stringify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an explicit goal, I just didn't realise it wasn't necessary
src/index.ts
Outdated
@@ -11,7 +11,6 @@ const escaped: Record<string, string> = { | |||
'\n': '\\n', | |||
'\r': '\\r', | |||
'\t': '\\t', | |||
'\0': '\\u0000', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being unclear earlier; I meant “why not produce the escape sequence for \0
(i.e. '\\0'
) instead of the long-form \u
one”. I think escaping U+0000 makes a lot of sense and is preferable to not escaping it.
test/test.ts
Outdated
test('surrogate pair', '𝌆', JSON.stringify('𝌆')); | ||
test('nul', '\0', JSON.stringify('\0')); | ||
test('surrogate pair in wrong order', 'a\uDC00\uD800b', '"a\uDC00\uD800b"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these need to be escaped
Thank you @mathiasbynens — so I think I understand now: surrogates need to be escaped unless they're part of a [low, high] pair. Did I get that right? |
Gah, I've just re-read your earlier comments (and my own!) that were hidden because 'outdated' — I seem to have it exactly backwards... |
Actually, I'm still a little confused... earlier you said
but the character '𝌆', which doesn't seem to need to be escaped, appears to be a [low, high] pair: '𝌆' === String.fromCharCode(55348) + String.fromCharCode(57094); // true Where have I gone wrong? Sorry, you probably get very bored of explaining this stuff. I owe you a beer next time I see you! |
See https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs:
In other words, I think you have the terminology backwards. The "high" in "high surrogate" doesn't refer to the numeric code point value; those are actually smaller ("lower") than those of "low surrogates". If it helps, use the terms "lead/trail surrogate" instead. |
Ha, ok — yep, I thought high/low referred to code point value. Makes sense now, cheers! |
Fixes (I think!) #13