Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not correctly parse surrogate pairs #42

Closed
johnezang opened this issue Feb 3, 2011 · 2 comments
Closed

Does not correctly parse surrogate pairs #42

johnezang opened this issue Feb 3, 2011 · 2 comments
Labels
Milestone

Comments

@johnezang
Copy link
Contributor

The following is not parsed correctly:

{ "MATHEMATICAL ITALIC CAPITAL ALPHA": "\uD835\uDEE2" }

Expected result:

{ "MATHEMATICAL ITALIC CAPITAL ALPHA": "𝛢" }

(note: github seems to have problems dealing with unicode characters > U+10000. This is why it looks funky, I did my best with what I could.)

Using the following code:

NSString *json = [NSString stringWithUTF8String:"{ \"MATHEMATICAL ITALIC CAPITAL ALPHA\": \"\\uD835\\uDEE2\" }"];
id obj = [json JSONValue];
NSLog(@"stringWithObject: %@", [writer stringWithObject:obj]);

... produces the following:

stringWithObject: {"MATHEMATICAL ITALIC CAPITAL ALPHA":"훢"}

Also, the code in parseUnicodeEscape and decodeHexQuad "may" (on a zero order approximation) have corner cases that "read past the end of the array", in particular when dealing with surrogate pairs. The code that calls parseUnicodeEscape seems to have an explicit length variable, while the unicode parsing code does not, instead relying on \0 termination. It's not clear to me if this assumption is guaranteed to be valid, looks very suspicious to me.

@stig
Copy link
Collaborator

stig commented Feb 3, 2011

There is a hack in -appendBytes: that appends a \0 to make sure the hecodeHexQuad worsk. Let me stress again that it's a hack. One of these days I want to make the code completely length-based.

@stig
Copy link
Collaborator

stig commented Feb 13, 2011

Thanks. Having looked into this the decoding of the code point seems to work, but my conversion from the code point to the string was not. I'll try to fix this.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants