Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode (and core), wrong handling of STOP_AT_PARTIAL #18908

Closed
wolfdiem opened this issue Jun 19, 2021 · 1 comment
Closed

Encode (and core), wrong handling of STOP_AT_PARTIAL #18908

wolfdiem opened this issue Jun 19, 2021 · 1 comment

Comments

@wolfdiem
Copy link

Summary:

The STOP_AT_PARTIAL bit in the CHECK parameter is ignored by
(at least) some encodings in Encode::encode() and iso-8859-1 in
core. The analysis below suggests that the action expected from
STOP_AT_PARTIAL == 0 or 1 is instead governed by some internally
generated decision (based on the ordinal value of the partial
character and on properties of the encoding).

The bug appears on encode(), when a partial multibyte character
is located at the end of the input buffer.

The incorrect handling of STOP_AT_PARTIAL also has influence on
some other values in CHECK. Processing of WARN_ON_ERR, PERLQQ,
HTMLCREF, XMLCREF and CODEREF is never executed for all partial
characters which result in a "yes" decision (see description below).
It also causes a malfunction in PerlIO::encoding for characters
split at buffer boundaries (see at bottom).

The current behaviour is buggy, independent of the fact that
STOP_AT_PARTIAL is not really specified and documented, as Leont
noted in github #16345. This missing specification is discussed
near the end of this report.

==================================================================
Expected behaviour:

encode() shall obey the STOP_AT_PARTIAL bit when a partial character
(according to any of the choices (b) or (c), cf. discussion near the
end) is found at the end of the input buffer, and

  • if "1", shall not process this partial character, but instead
    return it in the input buffer to the caller,
  • if "0", shall perform the fallback as given in the CHECK
    parameter, i.e. replace the partial (and thus malformed) character
    with e.g.:
    CHECK == 0: substitution character "?"
    PERLQQ : Unicode replacement character and apply the fallback
    "\x{hhhh}" (= "\x{fffd}") to that (HTMLCREF and XMLCREF similar).

==================================================================
Description of bug:

The expected behaviour is not followed at least in the tested encodings,
i.e. all iso-8859-x encodings (and also a few test encodings generated
over the enc2xs path as described in the enc2xs documentation).

The actual behaviour (seen from a user level) can be described by the
following logical steps in processing (maybe they are not intended in
the implementation, and only happen as side effect):
1 It seems that all Unicode characters to be mapped (in .ucm file)
are "grouped" according to their start byte (\xE0, \xE1, \xE2, ...,
\xEF, \xF0 ... ???). Only exception is the lowest "group"
comprising all 1- and 2-byte characters. If this "grouping" is done
explicitly or as side effect of the implementation is not clear.
2 If there is a partial character at the end of the buffer then go
to 3. Else follow the ordinary procedures not handled in this issue.
3 Get the start byte of the partial character and look up, if there
exists at least one Unicode character to be mapped
(<U....> in .ucm) with the same start byte (i.e. in the same
"group". 2-byte characters are in one separate "group").
This look-up decision is also referred to below as "yes/no"
decision, as this governs the action taken.
4 If "yes" then act as if the STOP_AT_PARTIAL bit was "1".
If "no" then act as if the STOP_AT_PARTIAL bit was "0".

Steps 3 and 4 are executed regardless of the actual value of the
STOP_AT_PARTIAL bit ("0" or "1"). This means that for step 4 the
STOP_AT_PARTIAL bit in CHECK is completely ignored and it is replaced
by the "yes/no" decision of step 3.

This is not the expected (and clearly not a documented) behaviour.

==================================================================
Some examples reproducing the bug:

The test program "part.pl" (see below) shows this behaviour, used
with STOP_AT_PARTIAL = "0" / "1" and with the encodings iso-8859-1
and iso-8859-15. These two encodings were taken as example, as they
are similar but differ in having a Unicode character in "group \xE2"
or not. The output should depend on STOP_AT_PARTIAL, but instead it
is only dependent on the encoding. iso-8859-1 has characters in the
lowest group <= \x{7FF} only, thus "yes" (as if STOP_AT_PARTIAL)
is only applied to partial characters with start byte < \xE0.
iso-8859-15 additionally has the Euro Sign \x{20AC} in "group \xE2",
thus "yes" is also applied to all partial characters with
start byte \xE2.
(below "x=" shows the input buffer given back to the caller,
which should be empty on check == 0x0102,
and should contain the partial character on check == 0x0902.)

D:\Perl_bug\encode_xs>part.pl
"\x{fffd}" does not map to iso-8859-1 at ...\part.pl line 19.
ab\x{fffd}
check=0x0102, encoding=iso-8859-1, bytes=\x61\x62\xE2\x99, x=""

ab
check=0x0102, encoding=iso-8859-15, bytes=\x61\x62\xE2\x99, x="\xE2\x99"

"\x{fffd}" does not map to iso-8859-1 at ...\part.pl line 19.
ab\x{fffd}
check=0x0902, encoding=iso-8859-1, bytes=\x61\x62\xE2\x99, x=""

ab
check=0x0902, encoding=iso-8859-15, bytes=\x61\x62\xE2\x99, x="\xE2\x99"

If CHECK == 0 or CHECK == CODEREF or STOP_AT_PARTIAL only is set
(and not PERLQQ etc.), the partial character is printed (with "?"
or according to CODEREF) on iso-8859-1, but nothing is printed on
iso-8859-15. Also here STOP_AT_PARTIAL ("0" or "1") is ignored,
and instead "yes/no" determines the response.


The test program "ret.pl" (see below) shows that the same bug also
causes a malfunction with 'RETURN_ON_ERR | WARN_ON_ERR' or with
'PERLQQ etc.' only. The response should be independent of the
encoding, but on "yes" 'as if STOP_AT_PARTIAL' is invoked, and thus
the malformed (partial) character is not processed (i.e. no output of
warning or fallback), but it is only given back to the caller.

D:\Perl_bug\encode_xs>ret.pl
"\x{fffd}" does not map to iso-8859-1 at ...\ret.pl line 14.
ab
check=0x0006, encoding=iso-8859-1, bytes=\x61\x62\xE2\x99, x="\xE2\x99"

ab
check=0x0006, encoding=iso-8859-15, bytes=\x61\x62\xE2\x99, x="\xE2\x99"

======================================================================
Quick test of UTF8 encodings for comparison:

  • "utf8" (loose) seems to completely ignore STOP_AT_PARTIAL, as it
    never returned a partial character to the caller. It always assumes
    'as if not STOP_AT_PARTIAL'. In addition, it also ignores any
    fallback, thus partial characters are just passed through as bytes.
  • "utf-8-strict" seems to obey STOP_AT_PARTIAL = "0" or "1" for all
    partial characters according to choice (c) (see below) and for the
    indecisive single start byte \xF4. Even with STOP_AT_PARTIAL = "1"
    it does apply fallback without returning anything to the caller for
    some formally correct, but invalid partial characters (i.e.
    utf8-loose, utf8-Perl-extended and non-canonical characters, e.g.
    start bytes \xC0 and \xC1). If this is fine or questionable depends
    on the (future) exact definition of 'STOP_AT_PARTIAL'. If the
    remaining bytes at the end of the buffer do not constitute a
    formally correct partial character, then correctly "fallback" is
    applied.
    Disadvantage of the definition of 'partial character' used here is
    the inconsistent handling of invalid characters with
    PerlIO::encoding. If such character is split over two buffers two
    or even more replacement characters (and warnings) are returned,
    otherwise only one replacement character (and warning) is given.

=================================================================
Impact on PerlIO::encoding:

This strange behaviour also leads to surprising results when the
encoding is used in PerlIO::encoding. E.g. a Unicode character in
range \N{U+2000}-\N{U+2FFF} (when split by a buffer boundary) will
give a different response depending on the encoding and the split
position (example: \N{U+2660} == \xE2\x99\xA0 in test program
"perlio.pl" below):

for split after 2nd byte (to iso-8859-1):
"\x{fffd}" does not map to iso-8859-1 at ...\perlio.pl line 14.
"\x{fffd}" does not map to iso-8859-1 at ...\perlio.pl line 14.
output: '...ab\x{fffd}\x{fffd}c'

for split after 1st byte (to iso-8859-1):
"\x{fffd}" does not map to iso-8859-1 at ...\perlio.pl line 14.
"\x{fffd}" does not map to iso-8859-1 at ...\perlio.pl line 14.
"\x{fffd}" does not map to iso-8859-1 at ...\perlio.pl line 14.
output: '...ab\x{fffd}\x{fffd}\x{fffd}c'

for any split position (to iso-8859-15):
"\x{2660}" does not map to iso-8859-15 at ...\perlio.pl line 14.
output: '...ab\x{2660}c'

The two upper (unexpected) responses only stem from the fact that
here STOP_AT_PARTIAL is ignored by encode() and on "no" (in
iso-8859-1) is replaced by 'as if not STOP_AT_PARTIAL'.
Thus the partial character is not given back to the caller and the
complete character is never seen for fallback handling.

BTW, this bug can only come up if the (split-up) Unicode character
needs fallback handling. Characters resulting in a "yes" (including
all mappable characters) are handled 'as if STOP_AT_PARTIAL' is
set, which coincides with the mandatory setting for PerlIO::encoding.

(On the other hand, this bug automatically provides in all affected
encode() modules the 'as if STOP_AT_PARTIAL' for all "yes"
characters (includes all mappable characters). Thus not setting
STOP_AT_PARTIAL does not even lead to faulty behaviour for these
characters. E.g. for the iso-8859-15 case above the partial
character is given back to the caller even with STOP_AT_PARTIAL == 0,
and thus the response with \x{2660} is given. The correct response
would have been \x{fffd}\x{fffd} (or \x{fffd}\x{fffd}\x{fffd}), as
the partial character should not have been given back to the caller.
This is surely not the intention of the bug, but the bug may be
a remnant of trying to adapt encode() to PerlIO::encoding before
STOP_AT_PARTIAL was introduced.)

Note on Perl v5.34: github #18496 was applied with the right
intention, but because of this bug in Encode it can not help with
problems around STOP_AT_PARTIAL, as this bit is just ignored.
Note that github #18496 always enforces STOP_AT_PARTIAL = "1" in
$PerlIO::encoding::fallback, thus tests with STOP_AT_PARTIAL = "0"
in $PerlIO::encoding::fallback are no longer possible in v5.34.
(Not tested, as strawberry perl v5.34 is not yet available.)

(Test with $PerlIO::encoding::fallback == default 0x0912 / 0x0902)

=============================================================
Specification of STOP_AT_PARTIAL and 'partial character':

This issue should be decided, probably before fixing the bug to
avoid any later adaptations.

It is not specified, which byte sequence (not representing a
complete Unicode character) is seen as "partial character", and
where it can appear. The following tries to discuss the issues
(for encode() only):

  1. Position of partial character:
    STOP_AT_PARTIAL is only mentioned, but not really documented.
    The use case given for FB_QUIET (== RETURN_ON_ERR) is more an
    example for STOP_AT_PARTIAL, but this bit was probably
    introduced later than the writing of the FB_QUIET documentation.
    From the use cases e.g. in PerlIO::encoding one can assume that
    STOP_AT_PARTIAL shall only act for partial characters at the end
    of the buffer given to encode(). This allows correct action for
    PerlIO::encoding in contrast to RETURN_ON_ERR, which also acts
    in the middle of the buffer or on any character not convertible,
    where returning the unprocessed part to the caller would not
    allow faultless continuation of the conversion anyhow. So let's
    assume 'end of input buffer' for the further discussion.
  2. But what are partial characters?
    They may be defined differently based on the complete character
    they may belong to:
    a) any byte sequence which does not give a complete internal
    character (this is not really 'partial', so let's drop this)
    b) formally correct beginning of a multibyte internal character
    (start byte \xC0-\xFF followed by 0 to n-2 continuation
    bytes \x80-\xBF with n = length of complete character)
    c) (b) and excluding non-canonical codings and/or surrogates
    and/or utf8-loose and/or utf8-Perl-extended
    Anyhow the definition of "partial character" only depends on the
    internal native encoding (when in Unicode). It has no dependence
    on the target encoding of encode().

STOP_AT_PARTIAL itself does not originate any mapping, conversion
or output, but it only allows handling complete Unicode characters
despite these being split over two buffers. Processing for mapping
or warning/error can always be done when the input buffer with the
complete Unicode character is received again from the caller.

Proposal:
Thus it seems best to "STOP" for any formally correct partial
character, i.e. choice (b) above, if STOP_AT_PARTIAL = "1".
Besides making the handling of STOP_AT_PARTIAL more simple, this
also unifies the processing of characters being split or not split
over two buffers. Such consistent handling is important for
PerlIO::encoding in particular, as the user is not (and should not)
be aware of the PerlIO buffer structure (see the disadvantage of
"utf-8-strict" encoding in the description above).

================================================================
Tested configurations:

Most tests done with "strawberry-perl-5.32.1.1-32bit.msi".
Tests with other versions gave the same results from
"strawberry-perl-5.32.1.1-32bit-portable.zip" down to
"strawberry-perl-5.16.3.1-32bit-portable.zip".
Probably "strawberry-perl-5.14.4.1-32bit-portable.zip" shows the
same bug, but that is masked by the "panic" fixed in Perl v5.16.

On quick testing all iso-8859-x encodings and a few encodings
generated via enc2xs from .ucm for testing Unicode characters with
other start bytes (i.e in other "groups") showed the same behaviour.
No other encodings (except 'utf8' and 'utf-8-strict') were tested,
but maybe this bug occurs in all encodings generated via enc2xs.
It was not tested in the context of this bug report if decode()
will show a similar bug for multi-byte encodings,as partial
characters may occur there also.

=================================================================
perl -V (for "strawberry-perl-5.32.1.1-32bit.msi")

Summary of my perl5 (revision 5 version 32 subversion 1) configuration:

Platform:
osname=MSWin32
osvers=10.0.19042.746
archname=MSWin32-x86-multi-thread-64int
uname='Win32 strawberry-perl 5.32.1.1 #1 Sun Jan 24 12:17:47 2021 i386'
config_args='undef'
hint=recommended
useposix=true
d_sigaction=undef
useithreads=define
usemultiplicity=define
use64bitint=define
use64bitall=undef
uselongdouble=undef
usemymalloc=n
default_inc_excludes_dot=define
bincompat5005=undef
Compiler:
cc='gcc'
ccflags =' -DWIN32 -D__USE_MINGW_ANSI_STDIO -DPERL_TEXTMODE_SCRIPTS
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv
-fno-strict-aliasing -mms-bitfields'
optimize='-s -O2'
cppflags='-DWIN32'
ccversion=''
gccversion='8.3.0'
gccosandvers=''
intsize=4
longsize=4
ptrsize=4
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=12
longdblkind=3
ivtype='long long'
ivsize=8
nvtype='double'
nvsize=8
Off_t='long long'
lseeksize=8
alignbytes=8
prototype=define
Linker and Libraries:
ld='g++'
ldflags ='-s -L"C:\Perl\perl\lib\CORE" -L"C:\Perl\c\lib"'
libpth=C:\Perl\c\lib C:\Perl\c\i686-w64-mingw32\lib
C:\Perl\c\lib\gcc\i686-w64-mingw32\8.3.0
libs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
perllibs= -lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
libc=
so=dll
useshrplib=true
libperl=libperl532.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs
dlext=xs.dll
d_dlsymun=undef
ccdlflags=' '
cccdlflags=' '
lddlflags='-mdll -s -L"C:\Perl\perl\lib\CORE" -L"C:\Perl\c\lib"'

Characteristics of this binary (from libperl):
Compile-time options:
HAS_TIMES
HAVE_INTERP_INTERN
MULTIPLICITY
PERLIO_LAYERS
PERL_COPY_ON_WRITE
PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT
PERL_IMPLICIT_SYS
PERL_MALLOC_WRAP
PERL_OP_PARENT
PERL_PRESERVE_IVUV
USE_64_BIT_INT
USE_ITHREADS
USE_LARGE_FILES
USE_LOCALE
USE_LOCALE_COLLATE
USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC
USE_LOCALE_TIME
USE_PERLIO
USE_PERL_ATOF
Built under MSWin32
Compiled at Jan 24 2021 12:22:49
@inc:
C:/Perl/perl/site/lib
C:/Perl/perl/vendor/lib
C:/Perl/perl/lib

== test program ===== part.pl ===================================

test for PARTIAL, encoding

use warnings;
use strict;
use Encode qw'encode WARN_ON_ERR STOP_AT_PARTIAL PERLQQ HTMLCREF XMLCREF';

my $chk = WARN_ON_ERR; # $chk != 0 !!!
$chk |= PERLQQ; # comment this out if not wanted
#$chk |= HTMLCREF; # comment this out if not wanted
#$chk |= XMLCREF; # comment this out if not wanted
#$chk = 0; # special case. comment this out if not wanted

my $byt = "ab\xE2\x99";
utf8::downgrade($byt); # force bytes
for (0..3) {
my $x = $byt;
Encode::utf8_on($x); # partial UTF-8 character "\xE2\x99" at end
my $enc = $
% 2 ? 'iso-8859-15' : 'iso-8859-1'; # even / odd: Latin1 / 9
my $check = $_ > 1 ? $chk | STOP_AT_PARTIAL : $chk; # $_ > 1: STOP_AT_PARTIAL
print encode($enc,$x,$check); # line 19
Encode::_utf8_off($x); # for byte output of $x below
print sprintf("\n check=0x%04X, encoding=%s, bytes=",
$check,$enc),map(sprintf('\x%02X',ord),split '',$byt),
', x="',map(sprintf('\x%02X',ord),split '',$x),'"',
"\n",'-'x30,"\n";
}

== test program ===== ret.pl ====================================

test for RETURN, encoding

use warnings;
use strict;
use Encode qw'encode WARN_ON_ERR RETURN_ON_ERR';

my $check = WARN_ON_ERR;
$check |= RETURN_ON_ERR; # comment this out if not wanted

my $byt = "ab\xE2\x99";
utf8::downgrade($byt); # force bytes
for ('iso-8859-1','iso-8859-15') {
my $x = $byt;
Encode::utf8_on($x); # partial UTF-8 character "\xE2\x99" at end
print encode($
,$x,$check); # line 14
Encode::utf8_off($x); # for byte output of $x below
print sprintf("\n check=0x%04X, encoding=%s, bytes=",
$check,$
),map(sprintf('\x%02X',ord),split '',$byt),
', x="',map(sprintf('\x%02X',ord),split '',$x),'"',
"\n",'-'x30,"\n";
}

== test program ===== perlio.pl ==================================

test for PerlIO::encoding

use warnings;
use strict;
use Encode qw'WARN_ON_ERR STOP_AT_PARTIAL';
use PerlIO::encoding;
#$PerlIO::encoding::fallback &= ~STOP_AT_PARTIAL; # uncomment to "mask off" for test purpose
#$PerlIO::encoding::fallback &= ~WARN_ON_ERR; # if not masked off, prints warning between output buffers

my $str = ' 'x1020 ."ab\N{U+2660}c\n"; # buffer split after 2nd byte of '\xE2\x99\xA0'
#$str = ' 'x1021 ."ab\N{U+2660}c\n"; # uncomment for buffer split after 1st byte

for ('iso-8859-1', 'iso-8859-15') {
binmode STDOUT,':bytes:encoding('.$.')';
print $str; # line 14
use bytes; # for 'length $str'
print sprintf "--- encoding = %s, fallback = 0x%04X, Length (Bytes) = %u\n-------\n",
$
,$PerlIO::encoding::fallback,length $str;
}

@wolfdiem
Copy link
Author

This issue will be followed on rt.cpan.org for module Encode.

For the case as if STOP_AT_PARTIAL with STOP_AT_PARTIAL = "0"
see https://rt.cpan.org/Public/Bug/Display.html?id=136983

For the case as if not STOP_AT_PARTIAL with STOP_AT_PARTIAL = "1"
the issue is still not clarified. If a path forward is found, it will also be handled in module Encode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant