Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to properly deal with emoji / 4 byte utf8? #348

Closed
osheroff opened this issue Jan 25, 2017 · 6 comments
Closed

How to properly deal with emoji / 4 byte utf8? #348

osheroff opened this issue Jan 25, 2017 · 6 comments

Comments

@osheroff
Copy link

Hi, I write https://github.com/zendesk/maxwell, and am having trouble with emoji characters in my json output.

If I send the string "We are the robots.馃馃馃馃" through the system, the output I get out of jackson is odd, I get:

"We are the robots.\uD83E\uDD16\uD83E\uDD16\uD83E\uDD16\uD83E\uDD16'"

I have varying degrees of success parsing this json. ruby barfs, chrome and scala and python appear to be fine, but I'd prefer going to 4 byte utf8 if possible.

I suspect this is the same issue as #223, if there's a workaround other than just a +1 to that discussion, it'd be great to know.

Thanks!
-ben

@cowtowncoder
Copy link
Member

@osheroff Yes, I think this is #223. Depending on reading of the JSON spec (or even version of it, as there are multiple by now), Jackson's behavior is what is specified as expected (original one), or possibly not. But unfortunately allowing alternate output is rather non-trivial due to the way Java's internal representation (UCS-2) interacts with input/output buffer boundaries.

@osheroff
Copy link
Author

gotcha. is there an easy way to get at or reproduce the utter basics of json encoding, leaving multi-byte chars alone? my input is mostly-trusted, as mysql should probably have sanitized out invalid chars.

@cowtowncoder
Copy link
Member

@osheroff You mean to pass content as is? If you know what you are doing, you can use writeRawValue() (and remember to include enclosing double-quotes) to force exact output.
Otherwise CharacterEscapes implementation may be registered, but I am not 100% sure if that would get appropriately called -- if it does, you could make it indicate that surrogate characters (two 2-byte characters that form one logical character) are NOT to be escaped and that might work as well; however note that this would not produce same 4-byte sequence as proper UTF-8 encoding, but rather 6-byte sequence where surrogates themselves are encoded (in direct violation of UTF-8 encoding... yet many decoders may be fine with it).

@osheroff
Copy link
Author

@cowtowncoder cool, thanks for the info. actually even ruby parses the escape sequence just fine (I was just confused because it has its own \u), so I'm gonna leave maxwell as-is and go hunt whatever parser or encoder issue I have downstream.

thanks again for your time!

@osheroff
Copy link
Author

(fwiw, if anyone ever finds this thread again; I was not checking for surrogate pairs before doing .slice on the string in the code that picked up the json)

@cowtowncoder
Copy link
Member

@osheroff Glad you can make it work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants