Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

eserte · 2021-05-03T14:40:53Z

The following script indicates a problem since 5.33.8:

use strict;
use warnings;
use Data::Dumper;
use Devel::Peek;

my $utf8_qr = "\x{20ac}"; $utf8_qr = qr{$utf8_qr};
my $s = [$utf8_qr, "\344"];
my $d = Data::Dumper->new([$s])->Dump;
Dump $d;
#binmode STDOUT, ':utf8'; print $d;

Output is (with 5.33.8, 5.33.9):

SV = PV(0x56407351a490) at 0x5640733eb400
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x56407341c9a0 "$VAR1 = [\n          qr/\342\202\254/u,\n          '\344'\n        ];\n"\0Malformed UTF-8 character: \xe4\x27\x0a (unexpected non-continuation byte 0x27, immediately after start byte 0xe4; need 3 bytes, got 1) in Dump at /tmp/dd.pl line 11.
 [UTF8 "$VAR1 = [\n          qr/\x{20ac}/u,\n          '\x{0}        ];\n"]
  CUR = 55
  LEN = 64

With 5.33.7 it looks OK:

SV = PV(0x55dd1ca94ec0) at 0x55dd1c966400
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x55dd1c975960 "$VAR1 = [\n          qr/\342\202\254/u,\n          '\344'\n        ];\n"\0
  CUR = 55
  LEN = 64

Printing the dumped string gives me a � instead of a-umlaut.

I suspect this could be due to #18619 --- @arc, will you take a look?

The text was updated successfully, but these errors were encountered:

khwilliamson · 2021-05-03T17:21:05Z

On 5/3/21 8:41 AM, Slaven Rezić wrote: The following script indicates a problem since 5.33.8: |use strict; use warnings; use Data::Dumper; use Devel::Peek; my $utf8_qr = "\x{20ac}"; $utf8_qr = qr{$utf8_qr}; my $s = [$utf8_qr, "\344"]; my $d = Data::Dumper->new([$s])->Dump; Dump $d; #binmode STDOUT, ':utf8'; print $d; | Output is (with 5.33.8, 5.33.9): |SV = PV(0x56407351a490) at 0x5640733eb400 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x56407341c9a0 "$VAR1 = [\n qr/\342\202\254/u,\n '\344'\n ];\n"\0Malformed UTF-8 character: \xe4\x27\x0a (unexpected non-continuation byte 0x27, immediately after start byte 0xe4; need 3 bytes, got 1) in Dump at /tmp/dd.pl line 11. [UTF8 "$VAR1 = [\n qr/\x{20ac}/u,\n '\x{0} ];\n"] CUR = 55 LEN = 64 | With 5.33.7 it looks OK: |SV = PV(0x55dd1ca94ec0) at 0x55dd1c966400 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x55dd1c975960 "$VAR1 = [\n qr/\342\202\254/u,\n '\344'\n ];\n"\0 CUR = 55 LEN = 64 | Printing the dumped string gives me a � instead of a-umlaut. I suspect this could be due to #18619 <#18619> --- @arc <https://github.com/arc>, will you take a look?

I bisected this to c71f1f2 is the first bad commit commit c71f1f2 Author: Aaron Crane ***@***.***> Date: Wed Mar 3 18:11:45 2021 +0000 Make Data::Dumper mark regex output as UTF-8 if needed

arc · 2021-05-03T20:59:37Z

Sorry, I'm not sure when I'll have chance to look into this.

khwilliamson · 2021-05-04T02:39:16Z

I single stepped through and found where the fault occurs, but I don't know enough to know what the proper course of action should be.

In Dumper.xs, there is this code

   integer_came_from_string:
       c = SvPV(val, i);
       /* the pure perl and XS non-qq outputs have historically been
        * different in this case, but for useqq, let's try to match
        * the pure perl code.
        * see [perl #74798]
        */
       if (style->useqq && safe_decimal_number(c, i)) {
           sv_catsv(retval, val);
       }
       else if (DO_UTF8(val) || style->useqq)
           i += esc_q_utf8(aTHX_ retval, c, i, DO_UTF8(val), style->useqq);
       else {
           sv_grow(retval, SvCUR(retval)+3+2*i); /* 3: ""\0 */
           r = SvPVX(retval) + SvCUR(retval);
           r[0] = '\'';
           i += esc_q(r+1, c, i);
           ++i;
           r[i++] = '\'';
           r[i] = '\0';
           SvCUR_set(retval, SvCUR(retval)+i);
       }

What is happening is that val contains "\344" which c is set to, and in the final else that is concatenated to retval which has previously been upgraded to UTF-8. Thus we concatenate an illegal byte sequence to it.

arc · 2021-05-04T09:58:10Z

What happens if we add || SvUTF8(retval) to the else if condition?

Alternatively, should we change dump_regexp() to esc_q_utf8() on sv_pattern if it's SvUTF8?

khwilliamson · 2021-05-04T14:32:53Z

On 5/4/21 3:58 AM, Aaron Crane wrote: What happens if we add ||| SvUTF8(retval)| to the |else if| condition? Alternatively, should we change |dump_regexp()| to |esc_q_utf8()| on |sv_pattern| if it's SvUTF8?

The first thing doesn't work, and the 2nd I don't understand enough to try, but I think it's barking up the wrong tree. The problem here, from reading the man page, is that the user is giving a non-ASCII single character name to a variable, so it doesn't print as the default VAR1, etc. Your first suggestion still is about retval, and not about 'val', which is the name. (Better naming of variables would have helped here, unless this is a more general routine, of which this is a particular usage.) The name looks to be appended to retval, so both must be in UTF-8 or both not. I would think that the name should be displayed as-is, without being turned into a \x{...} output. I think that what should be done is for someone who understands the intent of the program to look at this. I wonder if the original author(s) even contemplated the possibility of a non-ASCII name.

arc · 2021-05-04T17:33:36Z

I don't think this is about names of dumped objects — the output in the original bug report (correctly AFAICT) uses $VAR1.

(Having looked, I think there's a separate bug in supplying supra-ASCII names for dumped objects, but I don't think that's what's going on here.)

The original report says that the 5.33.7 output looks good, but I'm unconvinced: the U+20AC is dumped as the three bytes of its UTF-8 representation, and the U+00E4 is dumped as the single byte of its Latin-1 representation. The bug has existed throughout; it's just that now its symptom is a "malformed UTF-8" error rather than a more silent "surprise mojibake" error.

Possibly the change in the bisected-to commit should be adjusted to quote the literal parts of supra-ASCII regexes using backslash notation (just as happens for strings). That might be an easier fix than ensuring everything else knows that the generated output might be SvUTF8.

The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764

arc · 2021-05-06T19:40:46Z

I've opened PR #18771 to address this issue.

I can't guarantee I'll be able to get that merged (once blead reopens), so if someone else could do that, I'd appreciate it greatly.

This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH #18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1

This approach (and this commit message) are based on Aaron Crane's original in GH #18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764

nwc10 · 2021-05-13T19:44:48Z

Proposed fix in #18793

This approach (and this commit message) are based on Aaron Crane's original in GH #18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764

This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH #18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1

…ssion. This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH Perl#18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1

This approach (and this commit message) are based on Aaron Crane's original in GH Perl#18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes Perl#18764

eserte added the Needs Triage label May 3, 2021

jkeenan added affects-5.34 and removed Needs Triage labels May 3, 2021

arc mentioned this issue May 6, 2021

Data::Dumper: rework Unicode-in-qr support #18771

Closed

nwc10 self-assigned this May 12, 2021

jkeenan added dist-Data-Dumper issues in the dual-life blead-first Data-Dumper distribution and removed affects-5.34 labels May 20, 2021

nwc10 closed this as completed in d305535 May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

eserte commented May 3, 2021

khwilliamson commented May 3, 2021 via email

arc commented May 3, 2021

khwilliamson commented May 4, 2021

arc commented May 4, 2021

khwilliamson commented May 4, 2021 via email

arc commented May 4, 2021

arc commented May 6, 2021

nwc10 commented May 13, 2021

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764

Comments

eserte commented May 3, 2021

khwilliamson commented May 3, 2021 via email

arc commented May 3, 2021

khwilliamson commented May 4, 2021

arc commented May 4, 2021

khwilliamson commented May 4, 2021 via email

arc commented May 4, 2021

arc commented May 6, 2021

nwc10 commented May 13, 2021