Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper Decoding Algorithm #5

Closed
Dylan16807 opened this issue Nov 26, 2014 · 3 comments
Closed

Improper Decoding Algorithm #5

Dylan16807 opened this issue Nov 26, 2014 · 3 comments

Comments

@Dylan16807
Copy link

In section 4.2 the decoding algorithm says to unconditionally consume the character after a lead surrogate unit. This results in improper decoding when there is a sequence of two lead surrogate units followed by a trail surrogate unit. Three code points will be emitted instead of two. The note at the end of the section suggests that only two code points should be emitted, and in fact the two implementations I checked (wtf-8.js and rust-wtf8) will emit two code points in this situation.

The algorithm should consume the next code unit if and only if it is a trail surrogate unit.

@SimonSapin
Copy link
Owner

Good catch, thanks!

SimonSapin added a commit that referenced this issue Nov 26, 2014
For example, [0xD83D, 0xD83D, 0xDCA9] would have incorrectly decoded to
[U+D83D, U+D83D, U+DCA9] rather than [U+D83D, U+1F4A9].

Thanks @Dylan16807 for the #5 bug report.
@SimonSapin
Copy link
Owner

I pushed a fix in 7cf8092#diff-1. How does it look?

@SimonSapin
Copy link
Owner

Closing as fixed. Please comment here in it still looks wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants