Handle lone surrogates #17

Rich-Harris · 2018-10-26T02:34:23Z

Fixes (I think!) #13

mathiasbynens

LGTM. Left some feedback. Thanks for your work on this!

mathiasbynens · 2018-10-26T08:15:56Z

src/index.ts

+			result += '\\"';
+		} else if (char in escaped) {
+			result += escaped[char];
+		} else if ((code >= 0xD800 && code <= 0xDBFF)) {


Nit: redundant parens

good catch, thanks

mathiasbynens · 2018-10-26T08:31:18Z

test/test.ts

+	describe('strings', () => {
+		test('newline', 'a\nb', JSON.stringify('a\nb'));
+		test('double quotes', '"yar"', JSON.stringify('"yar"'));
+		test('lone surrogate', "\uD800", '"\\uD800"');


Tests I'd consider adding for this change (with the ones that are already there marked as completed):

a surrogate pair, e.g. 'a\uD800\uDC00b'

a surrogate pair in the wrong order, e.g. 'a\uDC00\uD800b'

a lone high surrogate, e.g. 'a\uD800b'

a lone low surrogate, e.g. 'a\uDC00b'

two lone high surrogates in a row, e.g. 'a\uD800\uD800b'

two lone low surrogates in a row, e.g. 'a\uDC00\uDC00b'

I've added these tests, though I'm not entirely clear on what the expected values are 😂 — should any combination of surrogates other than [high, low] be replaced with \\u[XXXX]?

Exactly; anything that is not [high, low] is not a valid surrogate pair, but rather a series of lone surrogates.

When in doubt, you can compare the results to JSON.stringify in a recent V8 build (perhaps using jsvu to easily grab the latest binary).

mathiasbynens · 2018-10-26T08:32:52Z

src/index.ts

+		} else if ((code >= 0xD800 && code <= 0xDBFF)) {
+			const next = str.charCodeAt(i + 1);
+			if (next >= 0xDC00 && next <= 0xDFFF) {
+				result += char;


You could append the next char here as well and increment i, saving a loop iteration.

mathiasbynens · 2018-10-26T08:34:03Z

src/index.ts

+	'\n': '\\n',
+	'\r': '\\r',
+	'\t': '\\t',
+	'\0': '\\u0000',


Why not just escape this as \0? (It's not an octal escape in JS.)

Is the goal to match JSON.stringify?

It's not an explicit goal, I just didn't realise it wasn't necessary

mathiasbynens · 2018-10-26T11:10:51Z

src/index.ts

@@ -11,7 +11,6 @@ const escaped: Record<string, string> = {
 	'\n': '\\n',
 	'\r': '\\r',
 	'\t': '\\t',
-	'\0': '\\u0000',


Sorry for being unclear earlier; I meant “why not produce the escape sequence for \0 (i.e. '\\0') instead of the long-form \u one”. I think escaping U+0000 makes a lot of sense and is preferable to not escaping it.

mathiasbynens · 2018-10-26T11:11:33Z

test/test.ts

 		test('surrogate pair', '𝌆', JSON.stringify('𝌆'));
-		test('nul', '\0', JSON.stringify('\0'));
+		test('surrogate pair in wrong order', 'a\uDC00\uD800b', '"a\uDC00\uD800b"');


these need to be escaped

Rich-Harris · 2018-10-26T12:43:10Z

Thank you @mathiasbynens — so I think I understand now: surrogates need to be escaped unless they're part of a [low, high] pair. Did I get that right?

Rich-Harris · 2018-10-26T13:05:04Z

Gah, I've just re-read your earlier comments (and my own!) that were hidden because 'outdated' — I seem to have it exactly backwards...

Rich-Harris · 2018-10-26T13:12:51Z

Actually, I'm still a little confused... earlier you said

anything that is not [high, low] is not a valid surrogate pair

but the character '𝌆', which doesn't seem to need to be escaped, appears to be a [low, high] pair:

'𝌆' === String.fromCharCode(55348) + String.fromCharCode(57094); // true

Where have I gone wrong? Sorry, you probably get very bored of explaining this stuff. I owe you a beer next time I see you!

mathiasbynens · 2018-10-26T13:29:17Z

See https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs:

The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.

The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

In other words, I think you have the terminology backwards. The "high" in "high surrogate" doesn't refer to the numeric code point value; those are actually smaller ("lower") than those of "low surrogates". If it helps, use the terms "lead/trail surrogate" instead.

Rich-Harris · 2018-10-26T13:33:41Z

Ha, ok — yep, I thought high/low referred to code point value. Makes sense now, cheers!

mathiasbynens · 2018-10-26T13:34:30Z

https://www.youtube.com/watch?v=dv6pJ2D_Sek

Rich-Harris added 2 commits October 25, 2018 21:31

reimplement JSON.stringify

bd8029f

fix lone surrogate handling, add tests

207040a

Rich-Harris mentioned this pull request Oct 26, 2018

Consider escaping lone surrogates #13

Closed

mathiasbynens approved these changes Oct 26, 2018

View reviewed changes

make suggested changes

0f2501c

mathiasbynens reviewed Oct 26, 2018

View reviewed changes

Rich-Harris added 3 commits October 26, 2018 08:41

escape expected string

0d7c36a

escape surrogates except in [low, high] pair

df1848a

remove some unused code

081f528

Rich-Harris added 2 commits October 26, 2018 09:32

escape U+0000

bb328ea

up is down and black is white

9463b0f

Rich-Harris merged commit 0fc4067 into master Oct 26, 2018

Rich-Harris deleted the gh-13 branch October 26, 2018 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle lone surrogates #17

Handle lone surrogates #17

Rich-Harris commented Oct 26, 2018

mathiasbynens left a comment

mathiasbynens Oct 26, 2018

Rich-Harris Oct 26, 2018

mathiasbynens Oct 26, 2018

Rich-Harris Oct 26, 2018

mathiasbynens Oct 26, 2018

mathiasbynens Oct 26, 2018

Rich-Harris Oct 26, 2018

mathiasbynens Oct 26, 2018

mathiasbynens Oct 26, 2018

Rich-Harris Oct 26, 2018

mathiasbynens Oct 26, 2018

mathiasbynens Oct 26, 2018

Rich-Harris commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

mathiasbynens commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

mathiasbynens commented Oct 26, 2018

Handle lone surrogates #17

Handle lone surrogates #17

Conversation

Rich-Harris commented Oct 26, 2018

mathiasbynens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rich-Harris commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

mathiasbynens commented Oct 26, 2018

Rich-Harris commented Oct 26, 2018

mathiasbynens commented Oct 26, 2018