Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

Closed
eserte opened this issue May 3, 2021 · 8 comments
Closed

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

eserte opened this issue May 3, 2021 · 8 comments
Assignees
Labels
dist-Data-Dumper issues in the dual-life blead-first Data-Dumper distribution

Comments

@eserte
Copy link
Contributor

eserte commented May 3, 2021

The following script indicates a problem since 5.33.8:

use strict;
use warnings;
use Data::Dumper;
use Devel::Peek;

my $utf8_qr = "\x{20ac}"; $utf8_qr = qr{$utf8_qr};
my $s = [$utf8_qr, "\344"];
my $d = Data::Dumper->new([$s])->Dump;
Dump $d;
#binmode STDOUT, ':utf8'; print $d;

Output is (with 5.33.8, 5.33.9):

SV = PV(0x56407351a490) at 0x5640733eb400
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x56407341c9a0 "$VAR1 = [\n          qr/\342\202\254/u,\n          '\344'\n        ];\n"\0Malformed UTF-8 character: \xe4\x27\x0a (unexpected non-continuation byte 0x27, immediately after start byte 0xe4; need 3 bytes, got 1) in Dump at /tmp/dd.pl line 11.
 [UTF8 "$VAR1 = [\n          qr/\x{20ac}/u,\n          '\x{0}        ];\n"]
  CUR = 55
  LEN = 64

With 5.33.7 it looks OK:

SV = PV(0x55dd1ca94ec0) at 0x55dd1c966400
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55dd1c975960 "$VAR1 = [\n          qr/\342\202\254/u,\n          '\344'\n        ];\n"\0
  CUR = 55
  LEN = 64

Printing the dumped string gives me a � instead of a-umlaut.

I suspect this could be due to #18619 --- @arc, will you take a look?

@khwilliamson
Copy link
Contributor

khwilliamson commented May 3, 2021 via email

@arc
Copy link
Contributor

arc commented May 3, 2021

Sorry, I'm not sure when I'll have chance to look into this.

@khwilliamson
Copy link
Contributor

I single stepped through and found where the fault occurs, but I don't know enough to know what the proper course of action should be.

In Dumper.xs, there is this code

   integer_came_from_string:
       c = SvPV(val, i);
       /* the pure perl and XS non-qq outputs have historically been
        * different in this case, but for useqq, let's try to match
        * the pure perl code.
        * see [perl #74798]
        */
       if (style->useqq && safe_decimal_number(c, i)) {
           sv_catsv(retval, val);
       }
       else if (DO_UTF8(val) || style->useqq)
           i += esc_q_utf8(aTHX_ retval, c, i, DO_UTF8(val), style->useqq);
       else {
           sv_grow(retval, SvCUR(retval)+3+2*i); /* 3: ""\0 */
           r = SvPVX(retval) + SvCUR(retval);
           r[0] = '\'';
           i += esc_q(r+1, c, i);
           ++i;
           r[i++] = '\'';
           r[i] = '\0';
           SvCUR_set(retval, SvCUR(retval)+i);
       }

What is happening is that val contains "\344" which c is set to, and in the final else that is concatenated to retval which has previously been upgraded to UTF-8. Thus we concatenate an illegal byte sequence to it.

@arc
Copy link
Contributor

arc commented May 4, 2021

What happens if we add || SvUTF8(retval) to the else if condition?

Alternatively, should we change dump_regexp() to esc_q_utf8() on sv_pattern if it's SvUTF8?

@khwilliamson
Copy link
Contributor

khwilliamson commented May 4, 2021 via email

@arc
Copy link
Contributor

arc commented May 4, 2021

I don't think this is about names of dumped objects — the output in the original bug report (correctly AFAICT) uses $VAR1.

(Having looked, I think there's a separate bug in supplying supra-ASCII names for dumped objects, but I don't think that's what's going on here.)

The original report says that the 5.33.7 output looks good, but I'm unconvinced: the U+20AC is dumped as the three bytes of its UTF-8 representation, and the U+00E4 is dumped as the single byte of its Latin-1 representation. The bug has existed throughout; it's just that now its symptom is a "malformed UTF-8" error rather than a more silent "surprise mojibake" error.

Possibly the change in the bisected-to commit should be adjusted to quote the literal parts of supra-ASCII regexes using backslash notation (just as happens for strings). That might be an easier fix than ensuring everything else knows that the generated output might be SvUTF8.

arc added a commit that referenced this issue May 6, 2021
The previous approach was to upgrade the output to the internal UTF-8
encoding when dumping a regex containing supra-Latin-1 characters. That
has the disadvantage that nothing else generates wide characters in the
output, or even knows that the output might be upgraded.

A better approach, and one that's more consistent with the one taken for
string literals, is to use `\x{…}` notation where needed.

Closes #18764
@arc
Copy link
Contributor

arc commented May 6, 2021

I've opened PR #18771 to address this issue.

I can't guarantee I'll be able to get that merged (once blead reopens), so if someone else could do that, I'd appreciate it greatly.

@nwc10 nwc10 self-assigned this May 12, 2021
nwc10 added a commit that referenced this issue May 13, 2021
This reverts the XS code change from March 2021 from commit c71f1f2:
    Make Data::Dumper mark regex output as UTF-8 if needed

retains the new tests, but skips them for Dumpxs.

The change fixed one bug, but introduced another (GH #18764). The fix for
both seems a little too risky this late in the release cycle, so revert to
the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix
with a CPAN release very soon, which likely will also be in v5.34.1
nwc10 added a commit that referenced this issue May 13, 2021
This approach (and this commit message) are based on Aaron Crane's original
in GH #18771. However, we leave the pure-Perl Dump unchanged (which means
changing the tests somewhat), and need to handle one more corner case
(\x{...} escaping a Unicode character that follows a backslash).

The previous approach was to upgrade the output to the internal UTF-8
encoding when dumping a regex containing supra-Latin-1 characters. That
has the disadvantage that nothing else generates wide characters in the
output, or even knows that the output might be upgraded.

A better approach, and one that's more consistent with the one taken for
string literals, is to use `\x{…}` notation where needed.

Closes #18764
@nwc10
Copy link
Contributor

nwc10 commented May 13, 2021

Proposed fix in #18793

nwc10 added a commit that referenced this issue May 14, 2021
This approach (and this commit message) are based on Aaron Crane's original
in GH #18771. However, we leave the pure-Perl Dump unchanged (which means
changing the tests somewhat), and need to handle one more corner case
(\x{...} escaping a Unicode character that follows a backslash).

The previous approach was to upgrade the output to the internal UTF-8
encoding when dumping a regex containing supra-Latin-1 characters. That
has the disadvantage that nothing else generates wide characters in the
output, or even knows that the output might be upgraded.

A better approach, and one that's more consistent with the one taken for
string literals, is to use `\x{…}` notation where needed.

Closes #18764
rjbs pushed a commit that referenced this issue May 15, 2021
This reverts the XS code change from March 2021 from commit c71f1f2:
    Make Data::Dumper mark regex output as UTF-8 if needed

retains the new tests, but skips them for Dumpxs.

The change fixed one bug, but introduced another (GH #18764). The fix for
both seems a little too risky this late in the release cycle, so revert to
the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix
with a CPAN release very soon, which likely will also be in v5.34.1
@jkeenan jkeenan added dist-Data-Dumper issues in the dual-life blead-first Data-Dumper distribution and removed affects-5.34 labels May 20, 2021
@nwc10 nwc10 closed this as completed in d305535 May 22, 2021
Corion pushed a commit to Corion/perl5 that referenced this issue Jun 20, 2021
…ssion.

This reverts the XS code change from March 2021 from commit c71f1f2:
    Make Data::Dumper mark regex output as UTF-8 if needed

retains the new tests, but skips them for Dumpxs.

The change fixed one bug, but introduced another (GH Perl#18764). The fix for
both seems a little too risky this late in the release cycle, so revert to
the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix
with a CPAN release very soon, which likely will also be in v5.34.1
Corion pushed a commit to Corion/perl5 that referenced this issue Jun 20, 2021
This approach (and this commit message) are based on Aaron Crane's original
in GH Perl#18771. However, we leave the pure-Perl Dump unchanged (which means
changing the tests somewhat), and need to handle one more corner case
(\x{...} escaping a Unicode character that follows a backslash).

The previous approach was to upgrade the output to the internal UTF-8
encoding when dumping a regex containing supra-Latin-1 characters. That
has the disadvantage that nothing else generates wide characters in the
output, or even knows that the output might be upgraded.

A better approach, and one that's more consistent with the one taken for
string literals, is to use `\x{…}` notation where needed.

Closes Perl#18764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dist-Data-Dumper issues in the dual-life blead-first Data-Dumper distribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants