How to properly deal with emoji / 4 byte utf8? #348

osheroff · 2017-01-25T18:30:39Z

Hi, I write https://github.com/zendesk/maxwell, and am having trouble with emoji characters in my json output.

If I send the string "We are the robots.🤖🤖🤖🤖" through the system, the output I get out of jackson is odd, I get:

"We are the robots.\uD83E\uDD16\uD83E\uDD16\uD83E\uDD16\uD83E\uDD16'"

I have varying degrees of success parsing this json. ruby barfs, chrome and scala and python appear to be fine, but I'd prefer going to 4 byte utf8 if possible.

I suspect this is the same issue as #223, if there's a workaround other than just a +1 to that discussion, it'd be great to know.

Thanks!
-ben

The text was updated successfully, but these errors were encountered:

cowtowncoder · 2017-01-25T18:37:33Z

@osheroff Yes, I think this is #223. Depending on reading of the JSON spec (or even version of it, as there are multiple by now), Jackson's behavior is what is specified as expected (original one), or possibly not. But unfortunately allowing alternate output is rather non-trivial due to the way Java's internal representation (UCS-2) interacts with input/output buffer boundaries.

osheroff · 2017-01-25T18:42:47Z

gotcha. is there an easy way to get at or reproduce the utter basics of json encoding, leaving multi-byte chars alone? my input is mostly-trusted, as mysql should probably have sanitized out invalid chars.

cowtowncoder · 2017-01-25T21:14:41Z

@osheroff You mean to pass content as is? If you know what you are doing, you can use writeRawValue() (and remember to include enclosing double-quotes) to force exact output.
Otherwise CharacterEscapes implementation may be registered, but I am not 100% sure if that would get appropriately called -- if it does, you could make it indicate that surrogate characters (two 2-byte characters that form one logical character) are NOT to be escaped and that might work as well; however note that this would not produce same 4-byte sequence as proper UTF-8 encoding, but rather 6-byte sequence where surrogates themselves are encoded (in direct violation of UTF-8 encoding... yet many decoders may be fine with it).

osheroff · 2017-01-25T22:39:27Z

@cowtowncoder cool, thanks for the info. actually even ruby parses the escape sequence just fine (I was just confused because it has its own \u), so I'm gonna leave maxwell as-is and go hunt whatever parser or encoder issue I have downstream.

thanks again for your time!

osheroff · 2017-01-25T22:55:57Z

(fwiw, if anyone ever finds this thread again; I was not checking for surrogate pairs before doing .slice on the string in the code that picked up the json)

cowtowncoder · 2017-01-26T05:55:53Z

@osheroff Glad you can make it work!

osheroff closed this as completed Jan 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly deal with emoji / 4 byte utf8? #348

How to properly deal with emoji / 4 byte utf8? #348

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 25, 2017

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 25, 2017

osheroff commented Jan 25, 2017

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 26, 2017

How to properly deal with emoji / 4 byte utf8? #348

How to properly deal with emoji / 4 byte utf8? #348

Comments

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 25, 2017

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 25, 2017

osheroff commented Jan 25, 2017

osheroff commented Jan 25, 2017

cowtowncoder commented Jan 26, 2017