Skip to content

from_json and utf8 #771

Open
pdl opened this Issue Apr 3, 2012 · 7 comments

4 participants

@pdl
pdl commented Apr 3, 2012

I have run into a problem when I use POST characters above \x7F to Dancer 1.3093 and then applying from_json to them.

I have discussed the issue with @ambs, who reported an issue with to_json in versions prior to 1.3093, and he has helpfully (and speedily) found a fix, but it's still unclear why it needs fixing, and whether this is something that Dancer should fix. I've posted this to the dancer-users list but not had any reply so I'm adding to to GitHub so the details are easier to find.

To recreate:

Create a new app:

dancer -a MyWeb::App

and apply diff at https://gist.github.com/2293055

Load it into your browser, click the button, and it sends {"q":"café"} to the server, which processes it fine and returns that word. All good so far.

Notice that in MyWeb/App.pm, the to_json has a flag utf8=>0. This is the mysterious fix.

Now, remove that flag, so the line reads

my $data = from_json( param('json'));

... and reload the app, click the button, and you will get an 500 internal error response, reading:

{
   "exception" : "malformed UTF-8 character in JSON string, at character offset 9 (before \"\\x{98bd}\") at /usr/lib/perl5/site_perl/5.10/JSON.pm line 171.\n",
   "error" : "malformed UTF-8 character in JSON string, at character offset 9 (before \"\\x{98bd}\") at /usr/lib/perl5/site_perl/5.10/JSON.pm line 171.\n"
}

(NB: in earlier versions, such as 1.3072, you won't get an error.)

What puzzles me most here is the reference to \x{98bd} - I have no idea how from_json is getting \x98bd. What gets sent is json=%7B%22q%22%3A%22caf%C3%A9%22%7D - %C3%A9 being utf8 for \xe9 i.e é.

@ambs says

Now, why you need to make utf8 to false, because the string is in UTF8 but doesn't have the utf8 flag on. So, when asking to parse it as utf8 it will double encode the thing (I think).

The question I have is "Should the utf8 flag be on anyway?" - is this something Dancer should be doing?

It seems odd to me that Dancer makes available to the user a utf8 string without the utf8 flag, but perhaps there is a good reason for it? (or I have misunderstood?)

Possibly Relevant links...

@nicolasfranck

Dancer has decoded the binary utf8 into a string, where he uses special marks like x{ }
to refer to special characters. This is the internal representation of utf8.

But this should normally work, because from_json works on utf8-strings (whereas decode_json works on binary utf8).

I have the same problem when using this code (even with "use utf8"):

from_json("{ \"title\":\"café\" ");

But this works:

JSON::from_json("{ \"title\":\"café\" ");

although the documentation states that all parameters are sent to JSON::from_json..

Any idea

@yanick
yanick commented Sep 22, 2013

The documentation is doing one (fairly important) omission: if the option utf8 is not passed, it is assumed to be set to true (instead of the JSON's default of false). So it means that by default the Dancer to_json, from_json are actually behaving like 'encode_json' and 'decode_json'.

@yanick
yanick commented Sep 22, 2013

Aargh. We deserialize with utf8 set to 1, but don't serialize with utf8 from the get-go. That's confusing. :-P

@yanick
yanick commented Sep 22, 2013

Which means that the following fails:

use 5.10.0;

use Dancer::Serializer::JSON;
use utf8;

my $data = { foo => 'café' };

$data = Dancer::Serializer::JSON::from_json( Dancer::Serializer::JSON::to_json( $data ) );

say $data->{foo};
@yanick yanick added a commit that referenced this issue Sep 22, 2013
@yanick yanick encode/decode as binary utf8
Deals with the funny stuff seen in #771.

The documentation should also make clear what we do turn on the
utf8 flag on by default.

We should also make sure this doesn't
break other stuff at a distance -- I'm quite weary of the
comment in 'serialize' that we don't utf8 the thing there
because it's done "later on"...
18c5517
@nicolasfranck

ok thanks for your quick reply ;-). But I think using from_json and to_json also in binary context is the confusing part, certainly when stating that the parameters are sent JSON::from_json. Anyway ;-)

@yanick
yanick commented Sep 22, 2013

This is unicode: confusing is de par with the course. :-)

I think the underlying intent here is to have the default serialization be utf8 binary, so that when peeps do:

get '/foo' => { return to_json $sumfin };

it'll do The Right Thing(tm) and make sure that the data bandied back and forth is ferried using a sane format used by all. I can dig that, but yeah, I totally agree that behavior must be made crystal-clear in the docs.

@andrei-cacio

I ran into the same problem on Dancer2 but the from_json method works fine when I include the JSON package. I have no ideea why but it works this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.