Improper Decoding Algorithm #5

Dylan16807 · 2014-11-26T22:03:48Z

In section 4.2 the decoding algorithm says to unconditionally consume the character after a lead surrogate unit. This results in improper decoding when there is a sequence of two lead surrogate units followed by a trail surrogate unit. Three code points will be emitted instead of two. The note at the end of the section suggests that only two code points should be emitted, and in fact the two implementations I checked (wtf-8.js and rust-wtf8) will emit two code points in this situation.

The algorithm should consume the next code unit if and only if it is a trail surrogate unit.

SimonSapin · 2014-11-26T22:13:47Z

Good catch, thanks!

@Dylan16807

For example, [0xD83D, 0xD83D, 0xDCA9] would have incorrectly decoded to [U+D83D, U+D83D, U+DCA9] rather than [U+D83D, U+1F4A9]. Thanks @Dylan16807 for the #5 bug report.

SimonSapin · 2014-11-26T22:27:38Z

I pushed a fix in 7cf8092#diff-1. How does it look?

SimonSapin · 2014-11-27T11:46:55Z

Closing as fixed. Please comment here in it still looks wrong.

SimonSapin added a commit that referenced this issue Nov 26, 2014

Fix a bug in the UTF-16 decoding algorithm.

7cf8092

For example, [0xD83D, 0xD83D, 0xDCA9] would have incorrectly decoded to [U+D83D, U+D83D, U+DCA9] rather than [U+D83D, U+1F4A9]. Thanks @Dylan16807 for the #5 bug report.

SimonSapin closed this as completed Nov 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improper Decoding Algorithm #5

Improper Decoding Algorithm #5

Dylan16807 commented Nov 26, 2014

SimonSapin commented Nov 26, 2014

SimonSapin commented Nov 26, 2014

SimonSapin commented Nov 27, 2014

Improper Decoding Algorithm #5

Improper Decoding Algorithm #5

Comments

Dylan16807 commented Nov 26, 2014

SimonSapin commented Nov 26, 2014

SimonSapin commented Nov 26, 2014

SimonSapin commented Nov 27, 2014