New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data::Dumper: Malformed UTF-8 character since 5.33.8 #18764
Comments
On 5/3/21 8:41 AM, Slaven Rezić wrote:
The following script indicates a problem since 5.33.8:
|use strict; use warnings; use Data::Dumper; use Devel::Peek; my
$utf8_qr = "\x{20ac}"; $utf8_qr = qr{$utf8_qr}; my $s = [$utf8_qr,
"\344"]; my $d = Data::Dumper->new([$s])->Dump; Dump $d; #binmode
STDOUT, ':utf8'; print $d; |
Output is (with 5.33.8, 5.33.9):
|SV = PV(0x56407351a490) at 0x5640733eb400 REFCNT = 1 FLAGS =
(POK,pPOK,UTF8) PV = 0x56407341c9a0 "$VAR1 = [\n qr/\342\202\254/u,\n
'\344'\n ];\n"\0Malformed UTF-8 character: \xe4\x27\x0a (unexpected
non-continuation byte 0x27, immediately after start byte 0xe4; need 3
bytes, got 1) in Dump at /tmp/dd.pl line 11. [UTF8 "$VAR1 = [\n
qr/\x{20ac}/u,\n '\x{0} ];\n"] CUR = 55 LEN = 64 |
With 5.33.7 it looks OK:
|SV = PV(0x55dd1ca94ec0) at 0x55dd1c966400 REFCNT = 1 FLAGS = (POK,pPOK)
PV = 0x55dd1c975960 "$VAR1 = [\n qr/\342\202\254/u,\n '\344'\n ];\n"\0
CUR = 55 LEN = 64 |
Printing the dumped string gives me a � instead of a-umlaut.
I suspect this could be due to #18619
<#18619> --- @arc
<https://github.com/arc>, will you take a look?
|
Sorry, I'm not sure when I'll have chance to look into this. |
I single stepped through and found where the fault occurs, but I don't know enough to know what the proper course of action should be. In Dumper.xs, there is this code
What is happening is that |
What happens if we add Alternatively, should we change |
On 5/4/21 3:58 AM, Aaron Crane wrote:
What happens if we add ||| SvUTF8(retval)| to the |else if| condition?
Alternatively, should we change |dump_regexp()| to |esc_q_utf8()| on
|sv_pattern| if it's SvUTF8?
The first thing doesn't work, and the 2nd I don't understand enough to
try, but I think it's barking up the wrong tree.
The problem here, from reading the man page, is that the user is giving
a non-ASCII single character name to a variable, so it doesn't print as
the default VAR1, etc.
Your first suggestion still is about retval, and not about 'val', which
is the name. (Better naming of variables would have helped here, unless
this is a more general routine, of which this is a particular usage.)
The name looks to be appended to retval, so both must be in UTF-8 or
both not.
I would think that the name should be displayed as-is, without being
turned into a \x{...} output.
I think that what should be done is for someone who understands the
intent of the program to look at this. I wonder if the original
author(s) even contemplated the possibility of a non-ASCII name.
|
I don't think this is about names of dumped objects — the output in the original bug report (correctly AFAICT) uses (Having looked, I think there's a separate bug in supplying supra-ASCII names for dumped objects, but I don't think that's what's going on here.) The original report says that the 5.33.7 output looks good, but I'm unconvinced: the U+20AC is dumped as the three bytes of its UTF-8 representation, and the U+00E4 is dumped as the single byte of its Latin-1 representation. The bug has existed throughout; it's just that now its symptom is a "malformed UTF-8" error rather than a more silent "surprise mojibake" error. Possibly the change in the bisected-to commit should be adjusted to quote the literal parts of supra-ASCII regexes using backslash notation (just as happens for strings). That might be an easier fix than ensuring everything else knows that the generated output might be SvUTF8. |
The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764
I've opened PR #18771 to address this issue. I can't guarantee I'll be able to get that merged (once blead reopens), so if someone else could do that, I'd appreciate it greatly. |
This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH #18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1
This approach (and this commit message) are based on Aaron Crane's original in GH #18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764
Proposed fix in #18793 |
This approach (and this commit message) are based on Aaron Crane's original in GH #18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes #18764
This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH #18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1
…ssion. This reverts the XS code change from March 2021 from commit c71f1f2: Make Data::Dumper mark regex output as UTF-8 if needed retains the new tests, but skips them for Dumpxs. The change fixed one bug, but introduced another (GH Perl#18764). The fix for both seems a little too risky this late in the release cycle, so revert to the v5.32.0 behaviour for the v5.34.0 release itself. Both bugs will be fix with a CPAN release very soon, which likely will also be in v5.34.1
This approach (and this commit message) are based on Aaron Crane's original in GH Perl#18771. However, we leave the pure-Perl Dump unchanged (which means changing the tests somewhat), and need to handle one more corner case (\x{...} escaping a Unicode character that follows a backslash). The previous approach was to upgrade the output to the internal UTF-8 encoding when dumping a regex containing supra-Latin-1 characters. That has the disadvantage that nothing else generates wide characters in the output, or even knows that the output might be upgraded. A better approach, and one that's more consistent with the one taken for string literals, is to use `\x{…}` notation where needed. Closes Perl#18764
The following script indicates a problem since 5.33.8:
Output is (with 5.33.8, 5.33.9):
With 5.33.7 it looks OK:
Printing the dumped string gives me a � instead of a-umlaut.
I suspect this could be due to #18619 --- @arc, will you take a look?
The text was updated successfully, but these errors were encountered: