Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect invalid UTF-8 data at end of file when using PerlIO :encoding(utf-8) #59

Open
hakonhagland opened this issue Aug 4, 2016 · 7 comments

Comments

@hakonhagland
Copy link

PerlIO layer :encoding(utf-8) seems to fail to report malformed data at the end of a file.
Suppose a file $fn contains valid UTF-8, except for the final character in the file. The last character in the file has an invalid UTF-8 encoding. I would like to have a warning printed to STDERR about invalid UTF-8 when reading this file, but strangely it seems not possible to achieve.
For example:

use feature qw(say);
use strict;
use warnings;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $bytes = "\x{61}\x{E5}";  # 2 bytes in iso 8859-1: aå
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;

now $fn contains invalid UTF-8 (the last byte). If I now try to read the file using PerlIO layer :encoding(utf-8):

my $str = '';
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
$str = do { local $/; <$fh> };
close $fh;
say "Read string: '$str'";

the output is

Read string: 'a'

Note, that there is no warning "\xE5" does not map to Unicode in this case.

However, if I read the file as bytes and then use Encode::decode() on the raw data, the warnings is printed:

open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
$raw_data = do { local $/; <$fh> };
close $fh;
my $str2 = decode( 'utf-8', $raw_data, Encode::FB_WARN | Encode::LEAVE_SRC );
# warning is printed to STDERR

Why cannot the same thing be achieved with PerlIO::encoding? Is it a bug?

@pali
Copy link
Contributor

pali commented Aug 5, 2016

See https://metacpan.org/pod/PerlIO::encoding There is variable $PerlIO::encoding::fallback and by default WARN_ON_ERR bit is set.

So yes, it is bug as you did not get warning.

@hakonhagland
Copy link
Author

hakonhagland commented Aug 5, 2016

@pali Yes when I try add in the code above (before starting to read the file):

use PerlIO::encoding;
printf "Current value of \$PerlIO::encoding::fallback is '0x%X'\n", $PerlIO::encoding::fallback;

The output is

Current value of $PerlIO::encoding::fallback is '0x902'

which shows that the bitmask constants WARN_ON_ERR and PERLQQ are set by default. There is also an undefined/undocumented bitmask 0x800
(0x902 & 0x800) == 0x800 that is set by default.

Interestingly, if I try to change the value to a code ref before reading:

$PerlIO::encoding::fallback = sub{ sprintf "<U+%04X>", shift };

The code hangs at readline (i.e. : <$fh>).. Is this another bug?

@pali
Copy link
Contributor

pali commented Aug 5, 2016

Look at PerlIO::encoding source code, by default are set these bits:

our $fallback =
    Encode::PERLQQ()|Encode::WARN_ON_ERR()|Encode::STOP_AT_PARTIAL();

Coderef check is supported only by some XS Encode modules, probably not by PerlIO::encoding.

@pali
Copy link
Contributor

pali commented Nov 3, 2016

Looks like this is not Encode bug, but PerlIO::encoding! And PerlIO is part of Perl itself. Please report this bug directly to Perl.

I used this test script:

use strict;
use warnings;
use Encode;

binmode STDOUT, ':utf8';

my $bytes = "\x{61}\x{E5}";
my $fh;

my $buf;
open $fh, '>:raw', \$buf;
print $fh $bytes;
close $fh;

open $fh, "<:encoding(UTF-8)", \$buf;
my $str = do { local $/; <$fh> };
close $fh;

print "$str\n";

open $fh, "<:raw", \$buf;
my $raw = do { local $/; <$fh> };
close $fh;
my $str2 = decode('UTF-8', $raw, Encode::FB_WARN | Encode::LEAVE_SRC);
print "$str2\n";

@tonycoz
Copy link
Contributor

tonycoz commented Dec 13, 2016

It turns out this is partly an Encode issue too.

PerlIO::encoding "renew"s the encoding object to ensure it has it's own encoding object (per Encode::Encoding), but Encode::decode_xs() treats such a renewed object as always stop_at_partial, which means that PerlIO::encoding can't use that encoding object to process that little bit of excess data at eof.

So I'm stuck trying to fix this on the PerlIO::encoding side.

Unfortunately, simply removing that renewed -> stop_at_partial will break PerlIO::encoding on validly encoded files on older perls, so I don't see a simple fix.

@pali
Copy link
Contributor

pali commented Jan 26, 2017

Bug is in PerlIO::scalar and was fixed in perl 5.25.8 by this commit:
https://perl5.git.perl.org/perl.git/commit/c47992b404786dcb8752239045e21cbcd7e3d103

@tonycoz
Copy link
Contributor

tonycoz commented Jan 26, 2017

There's an issue in PerlIO::encoding and the way it interacts with Encode too:

$ ./perl -e 'print "\xef\xbe"' >shortuni.txt
$ hd shortuni.txt
00000000 ef be |..|
00000002
$ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while () { print }' <shortuni.txt
(no output)

but it should be outputing a warning and \x{00EF}, like the following does:

$ ./perl -e 'print "\xef\xbeA"' >shortuni.txt
$ ./perl -Ilib -e 'binmode STDIN, ":encoding(UTF-8)"; while () { print }' <shortuni.txt
utf8 "\xEF" does not map to Unicode at -e line 1.
\x{00EF}A

This is blead at v5.25.9-35-g32207c6 which includes the (irrelevant) PerlIO::scalar fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants