Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF damage #36

Open
Rhaley opened this issue Jun 19, 2015 · 10 comments
Open

UTF damage #36

Rhaley opened this issue Jun 19, 2015 · 10 comments

Comments

@Rhaley
Copy link

Rhaley commented Jun 19, 2015

in the subroutine '_dump_scalar', at line 675, there's a test which my UTF passed, as it has bytes that match the test. Once the code in that block is finished, my UTF isn't UTF anymore.

I fixed this on my own machine by adding ! utf8::valid($string) before the test, and my UTF stayed UTF.

However, the side effect is that some characters which were formerly escaped in that block are creeping through and causing the code to die on 'illegal characters'. I haven't gotten back to looking at which characters, but I suspect quotes.

@writch
Copy link

writch commented Jul 8, 2015

I'm not Rhaley any longer, BTW.

@karenetheridge
Copy link
Member

as on irc: what do you mean by "my UTF isn't UTF anymore"?

Can you provide a code snippet that demonstrates your issue?

@writch
Copy link

writch commented Jul 8, 2015

I've placed the following branch up:

writch@450eef5

In the test, t/36_dump_utf8.t, I take the same string and utf8 encode one element of a hash while leaving the other in wide char mode before calling YAML::Tiny::Dump.

@writch
Copy link

writch commented Jul 8, 2015

I also noted that the test was which I moved to allow escaping of newlines and quotes was testing for single quote and then escaping double-quotes.

I changed it to test for and escape all three.

@karenetheridge
Copy link
Member

thanks, I'll take a closer look tonight.

I tidied and simplified the test to this, to more clearly indicate what is happening:

use strict;
use warnings;
use utf8;
use Test::More 0.88;
use YAML::Tiny;

my $string = "美国的私有退休金体制\n";

my %hash = (
    octets => $string,
    characters => $string,
);

utf8::encode($hash{octets});

my $yaml = YAML::Tiny::Dump(\%hash);

my($t1) = $yaml =~ /characters: (.*)/;
my($t2) = $yaml =~ /octets: (.*)/;

utf8::encode($t1);
is($t1, $t2, "didn't corrupt the utf8-encoded string");

done_testing;

@miyagawa
Copy link
Member

miyagawa commented Jul 8, 2015

if you have a string with utf8 pragma ($string in your example), and then another with utf8::encode, they are not supposed to be the same string, perl considers the latter as latin-1 strings.

There's no way YAML or any other libraries can handle them, and since YAML requires all text as just text, you have to pass them in perl text strings. The current behavior of YAML::Tiny that renders some byte format might have a bit to be desired (like, properly encoding them using YAML's !binary tag?) but "fixing" it to output the same string sounds not right to me.

@karenetheridge
Copy link
Member

@miyagawa I'm not sure of his intent, but my (perhaps naive) expectation is that even encoded strings should properly round-trip. But to test that, we should not be extracting a snippet of the output from Dump() using a regex, but rather call Load() on the dumped output and check that we got the same series of octets back again. That's where we'd use the !binary tag -- a utf8-encoded string is not a string, so we cannot simply Dump() it and expect to the see the same thing as the unicode character string.

@karenetheridge
Copy link
Member

the PR is in #40

@ingydotnet
Copy link
Contributor

I'll have a look at this tomorrow.

@writch
Copy link

writch commented Jul 13, 2015

I already submitted the code which deals with the actual problem.

Instead of utf8::valid, I decode it and check the length. If it changes, it's encoded, if not, it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants