Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle lone surrogates #17

Merged
merged 8 commits into from
Oct 26, 2018
Merged

Handle lone surrogates #17

merged 8 commits into from
Oct 26, 2018

Conversation

Rich-Harris
Copy link
Owner

Fixes (I think!) #13

Copy link

@mathiasbynens mathiasbynens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some feedback. Thanks for your work on this!

src/index.ts Outdated
result += '\\"';
} else if (char in escaped) {
result += escaped[char];
} else if ((code >= 0xD800 && code <= 0xDBFF)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: redundant parens

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks

test/test.ts Outdated
describe('strings', () => {
test('newline', 'a\nb', JSON.stringify('a\nb'));
test('double quotes', '"yar"', JSON.stringify('"yar"'));
test('lone surrogate', "\uD800", '"\\uD800"');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests I'd consider adding for this change (with the ones that are already there marked as completed):

  • a surrogate pair, e.g. 'a\uD800\uDC00b'
  • a surrogate pair in the wrong order, e.g. 'a\uDC00\uD800b'
  • a lone high surrogate, e.g. 'a\uD800b'
  • a lone low surrogate, e.g. 'a\uDC00b'
  • two lone high surrogates in a row, e.g. 'a\uD800\uD800b'
  • two lone low surrogates in a row, e.g. 'a\uDC00\uDC00b'

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these tests, though I'm not entirely clear on what the expected values are 😂 — should any combination of surrogates other than [high, low] be replaced with \\u[XXXX]?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly; anything that is not [high, low] is not a valid surrogate pair, but rather a series of lone surrogates.

When in doubt, you can compare the results to JSON.stringify in a recent V8 build (perhaps using jsvu to easily grab the latest binary).

src/index.ts Outdated
} else if ((code >= 0xD800 && code <= 0xDBFF)) {
const next = str.charCodeAt(i + 1);
if (next >= 0xDC00 && next <= 0xDFFF) {
result += char;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could append the next char here as well and increment i, saving a loop iteration.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

src/index.ts Outdated
'\n': '\\n',
'\r': '\\r',
'\t': '\\t',
'\0': '\\u0000',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just escape this as \0? (It's not an octal escape in JS.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal to match JSON.stringify?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an explicit goal, I just didn't realise it wasn't necessary

src/index.ts Outdated
@@ -11,7 +11,6 @@ const escaped: Record<string, string> = {
'\n': '\\n',
'\r': '\\r',
'\t': '\\t',
'\0': '\\u0000',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear earlier; I meant “why not produce the escape sequence for \0 (i.e. '\\0') instead of the long-form \u one”. I think escaping U+0000 makes a lot of sense and is preferable to not escaping it.

test/test.ts Outdated
test('surrogate pair', '𝌆', JSON.stringify('𝌆'));
test('nul', '\0', JSON.stringify('\0'));
test('surrogate pair in wrong order', 'a\uDC00\uD800b', '"a\uDC00\uD800b"');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to be escaped

@Rich-Harris
Copy link
Owner Author

Thank you @mathiasbynens — so I think I understand now: surrogates need to be escaped unless they're part of a [low, high] pair. Did I get that right?

@Rich-Harris
Copy link
Owner Author

Gah, I've just re-read your earlier comments (and my own!) that were hidden because 'outdated' — I seem to have it exactly backwards...

@Rich-Harris
Copy link
Owner Author

Actually, I'm still a little confused... earlier you said

anything that is not [high, low] is not a valid surrogate pair

but the character '𝌆', which doesn't seem to need to be escaped, appears to be a [low, high] pair:

'𝌆' === String.fromCharCode(55348) + String.fromCharCode(57094); // true

Where have I gone wrong? Sorry, you probably get very bored of explaining this stuff. I owe you a beer next time I see you!

@mathiasbynens
Copy link

See https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs:

The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.

The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

In other words, I think you have the terminology backwards. The "high" in "high surrogate" doesn't refer to the numeric code point value; those are actually smaller ("lower") than those of "low surrogates". If it helps, use the terms "lead/trail surrogate" instead.

@Rich-Harris
Copy link
Owner Author

Ha, ok — yep, I thought high/low referred to code point value. Makes sense now, cheers!

@Rich-Harris Rich-Harris merged commit 0fc4067 into master Oct 26, 2018
@Rich-Harris Rich-Harris deleted the gh-13 branch October 26, 2018 13:33
@mathiasbynens
Copy link

https://www.youtube.com/watch?v=dv6pJ2D_Sek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants