Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues #9260

Closed
p5pRT opened this issue Mar 20, 2008 · 12 comments
Closed

Inconsistent handling of characters with value > 0x7FFF_FFFF and other issues #9260

p5pRT opened this issue Mar 20, 2008 · 12 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Mar 20, 2008

Migrated from rt.perl.org#51936 (status was 'resolved')

Searchable as RT51936$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

From chris.hall@highwayman.com

Created by chris.hall@highwayman.com

Amongst the issues​:

  * Character values > 0x7FFF_FFFF are not consistently handled.

  IMO​: the handling is so broken that it would be much better
  to draw the line at 0x7FFF_FFFF.

  * chr and pack respond differently to large and out of range
  values.

  * pack can generate strings that unpack will not process.

  * warnings about 'illegal' non-characters are arguably spurious.
  Certainly there are many cases which are more illegal where
  no warnings are issued.

  Treating 0xFFFF_FFFF as a non-character is interesting.

  * IMO​: chr(-1) complete nonsense == undef, not "a character I
  cannot handle" == U+FFFD.

Perl strings containing characters >0x7FFF_FFFF use a non-standard
extension to UTF-8. Strictly speaking, UTF-8 stops at U+10FFFF.
However, sequences up to 0x7FFF_FFFF are well defined.

Bits of Perl are happier with these non-standard sequences than
others.

Consider​:

  1​: use strict ;
  2​: use warnings ;
  3​:
  4​: warn "__Runtime__" ;
  5​:
  6​: my $q = chr(0x7FFF_FFFF).chr(0xE0).chr(0x8000_0000).chr(0xFFFF_FFFD) ;
  7​: my $v = utf8​::valid($q) ? 'Valid' : 'Invalid' ;
  8​: my $l = length($q) ;
  9​: my $r = $q.$q ;
  10​: $q =~ s/\x{E0}/ / ;
  11​: $q =~ s/\x{7FFF_FFFF}/Hello/ ;
  12​: $q =~ s/\x{8000_0000}/World/ ;
  13​: $q =~ s/\x{FFFF_FFFD}/ !/ ;
  14​: print "$v($l)​: '$q'\n" ;
  15​:
  16​: $r = substr($r, 3, 4) ;
  17​: print "\$r=", hx(sc($r)), "\n" ;
  18​: my @​w = unpack('U*', $r) ;
  19​: print "\@​w=", hx(@​w), "\n" ;
  20​:
  21​: $r = pack('U*', sc($r), 0x1_1234_5678) ;
  22​: print "\$r=", hx(sc($r)), "\n" ;
  23​: @​w = unpack('U*', $r) ;
  24​: print "\@​w=", hx(@​w), "\n" ;
  25​:
  26​: sub sc { map ord, split(//, $_[0]) ; } ;
  27​: sub hx { map sprintf('\\x{%X}', $_), @​_ ; } ;

which generates​:

  A​: Unicode character 0x7fffffff is illegal at tb.pl line 11.
  B​: Malformed UTF-8 character (byte 0xfe) at tb.pl line 12.
  C​: Malformed UTF-8 character (byte 0xfe) at tb.pl line 13.
  D​: Integer overflow in hexadecimal number at tb.pl line 21.
  E​: Hexadecimal number > 0xffffffff non-portable at tb.pl line 21.
  --​: __Runtime__ at tb.pl line 4.
  a​: Unicode character 0x7fffffff is illegal at tb.pl line 6.
  b​: Invalid(4)​: 'Hello World !'
  c​: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}
  d​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.
  e​: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
  : preceding start byte) in unpack at tb.pl line 18.

  ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd

  f​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 18.
  g​: Malformed UTF-8 character (unexpected continuation byte 0x82, with no
  : preceding start byte) in unpack at tb.pl line 18.

  ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80

  h​: @​w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}
  : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}
  i​: Unicode character 0x7fffffff is illegal at tb.pl line 21.
  j​: Unicode character 0xffffffff is illegal at tb.pl line 21.
  k​: $r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{FFFFFFFF}
  l​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
  m​: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
  : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbd

  n​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
  o​: Malformed UTF-8 character (unexpected continuation byte 0x82, with no
  : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0x80, 0x80, 0x80, 0x80, 0x80

  p​: Malformed UTF-8 character (byte 0xfe) in unpack at tb.pl line 23.
  q​: Malformed UTF-8 character (unexpected continuation byte 0x83, with no
  : preceding start byte) in unpack at tb.pl line 23.

  ... repeated for 0xbf, 0xbf, 0xbf, 0xbf, 0xbf

  r​: @​w=\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{7FFFFFFF}\x{E0}
  : \x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}\x{0}

NOTES​:

  1. chr(n) is happy with characters > 0x7FFF_FFFF

  BUT​: note the runtime warning about 0x7FFF_FFFF itself -- output line a.

  Unicode defines characters U+xxFFFF as non-characters, for all xx from
  0x00 to 0x10 -- the (current) Unicode range.

  These characters are NOT illegal. Unicode states​:

  "Noncharacter code points are reserved for internal use, such as
  for sentinel values. They should never be interchanged. They do,
  however, have well-formed representations in Unicode encoding
  forms and survive conversions between encoding forms. This allows
  sentinel values to be preserved internally across Unicode
  encoding forms, even though they are not designed to be used in
  open interchange."

  Characters > 0x10_FFFF are not known to Unicode.

  IMO, chr(n) should not be issuing warnings about non-characters at all.

  IMO, to project non-characters beyond the Unicode range is doubly
  perverse.

  FURTHER​: although characters > 0x10_FFFF are beyond Unicode, and
  characters > 0x7FFF_FFFF are beyond UTF-8, chr(n) is only warning
  about actual and invented non-characters (and surrogates).

  2. Similarly "\x{8000_0000} and "\x{7FFF_FFFF}" -- output line A.

  3. HOWEVER​: utf8​::valid() considers a string containing characters
  which are > 0x7FFF_FFFF to be *invalid* -- see code lines 7 & 14 and
  output line b.

  IMO allowing for characters 0x7FFF_FFFF in the first place is a mistake.

  But having allowed them, why flag the string as invalid ?

  4. However​: length() is happy, and issues no warning.

  Either length() is accepting the non-standard encoding, or some other
  mechanism means that it's not scanning the string.

  5. Lines 12 & 13 generate warnings about malformed UTF-8, at compile time.

  However, the run-time copes with these super-large characters.

  6. substr is happy with the super-large characters -- line 16.

  7. split is happy with the super-large characters -- line 26.

  8. ord is happy with the super-large characters -- line 26.

  9. unpack 'U' throws up all over super-large characters !

  See lines 18 & 23, and output d-h and l-r.

  unpack has no idea about the non-standard encoding of characters
  greater than 0x7FFF_FFFF, and unpacks each 'invalid' byte as
  0x00.

10. pack 'U' complains about character values in much the same way as
  chr does -- output i & j.

  However, pack and chr are by no means consistent with each other,
  see below.

11. pack 'U' is generating stuff that unpack 'U' cannot cope with !

  See lines 21-24 and output k-r

___________________________________________________________

Looking further at chr and pack​:

  1​: use strict ;
  2​: use warnings ;
  3​:
  4​: warn "__Runtime__" ;
  5​:
  6​: my $q = chr(0xD800).chr(0xFFFF).chr(0x7FFF_FFFF) ;
  7​: my $v = utf8​::valid($q) ? 'Valid' : 'Invalid' ;
  8​: print "\$q = ", hx(sc($q)), " -- $v\n" ;
  9​:
  10​: my @​t = (0x1_2345_6789, -1, -10, 0xD800) ;
  11​: my $r = join '', map(chr, @​t) ;
  12​: print "\$r=", hx(sc($r)), "\n" ;
  13​:
  14​: my $s = pack('U*', @​t) ;
  15​: print "\$s=", hx(sc($s)), "\n" ;
  16​:
  17​: sub sc { map ord, split(//, $_[0]) ; } ;
  18​: sub hx { map sprintf('\\x{%X}', $_), @​_ ; } ;

On a 64-bit v5.8.8​:

  A​: UTF-16 surrogate 0xd800 at tb2.pl line 6.
  B​: Unicode character 0xffff is illegal at tb2.pl line 6.
  C​: Unicode character 0x7fffffff is illegal at tb2.pl line 6.
  D​: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.
  -- __Runtime__ at tb2.pl line 4.
  a​: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid
  b​: Unicode character 0xffffffffffffffff is illegal at tb2.pl line 11.
  c​: UTF-16 surrogate 0xd800 at tb2.pl line 11.
  d​: $r=\x{123456789}\x{FFFFFFFFFFFFFFFF}\x{FFFFFFFFFFFFFFF6}\x{D800}
  e​: Unicode character 0xffffffff is illegal at tb2.pl line 14.
  f​: UTF-16 surrogate 0xd800 at tb2.pl line 14.
  g​: $s=\x{23456789}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

  * chr(-1) generates a warning, not because it's complete rubbish,
  but because 0xffffffffffffffff is a non-character !!!

  chr(-3) doesn't merit a warning.

  * note that surrogates and non-characters are OK as far as utf8​::valid
  is concerned -- no warnings, even.

  * pack is masking stuff to 32 bit unsigned !!

  * both chr and pack are throwing warnings about surrogates

On a 32-bit v5.10.0​:

  A​: Integer overflow in hexadecimal number at tb2.pl line 10.
  B​: Hexadecimal number > 0xffffffff non-portable at tb2.pl line 10.
  -- __Runtime__ at tb2.pl line 4.
  a​: UTF-16 surrogate 0xd800 at tb2.pl line 6.
  b​: Unicode character 0xffff is illegal at tb2.pl line 6.
  c​: Unicode character 0x7fffffff is illegal at tb2.pl line 6.
  d​: $q = \x{D800}\x{FFFF}\x{7FFFFFFF} -- Valid
  e​: Unicode character 0xffffffff is illegal at tb2.pl line 11.
  f​: UTF-16 surrogate 0xd800 at tb2.pl line 11.
  g​: $r=\x{FFFFFFFF}\x{FFFD}\x{FFFD}\x{D800}
  h​: Unicode character 0xffffffff is illegal at tb2.pl line 14.
  i​: Unicode character 0xffffffff is illegal at tb2.pl line 14.
  j​: UTF-16 surrogate 0xd800 at tb2.pl line 14.
  k​: $s=\x{FFFFFFFF}\x{FFFFFFFF}\x{FFFFFFF6}\x{D800}

  * chr is mapping -ve values to U+FFFD -- without warning.

  This is as per documentation.

  However, character 0xFFFF_FFFF, merits a warning, but does NOT
  get translated to U+FFFD !!

  IMO​: this is a dog's dinner. I think​:

  - non-characters and surrogates should not trouble chr
  (any more than they trouble utf8​::valid)

  - values that are invalid should generate undef, not U+FFFD
  replacement characters​:

  a) cannot distinguish chr(0xFFFD) and chr(-10)

  b) U+FFFD is a replacement for a character that we don't
  know -- it's not a replacement for something that
  just isn't a character in the first place !

  [-1 is a banana. U+FFFD is an orange, which we may
  substitute for another form of orange.]

  - limiting characters to 0x7FFF_FFFF is no great loss, and
  avoids a ton of portability and non-standard-ness issues.

  * pack 'U' is NOT mapping -ve values to U+FFFD !!

Perl Info

Flags:
     category=core
     severity=medium

Site configuration information for perl 5.10.0:

Configured by SYSTEM at Thu Jan 10 11:00:30 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
   Platform:
     osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
     uname=''
     config_args='undef'
     hint=recommended, useposix=true, d_sigaction=undef
     useithreads=define, usemultiplicity=define
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=undef, use64bitall=undef, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -
DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX',
     optimize='-MD -Zi -DNDEBUG -O1',
     cppflags='-DWIN32'
     ccversion='12.00.8804', gccversion='', gccosandvers=''
     intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
     d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
     ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf  -libpath:"C:\Program Files\Perl\lib\CORE"  -machine:x86'
     libpth=\lib
     libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
     perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
     libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
     gnulibc_version=''
   Dynamic Linking:
     dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
     cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf  -libpath:"C:\Program Files\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
     ACTIVEPERL_LOCAL_PATCHES_ENTRY
     32809 Load 'loadable object' with non-default file extension
     32728 64-bit fix for Time::Local


@INC for perl 5.10.0:
     d:\gmch_root\gmch perl lib
     d:\gmch_root\gmch perl lib\windows
     C:/Program Files/Perl/site/lib
     C:/Program Files/Perl/lib
     .


Environment for perl 5.10.0:
     HOME (unset)
     LANG (unset)
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=C:\Program Files\Perl\site\bin;C:\Program Files\Perl\bin;C:\PROGRAM FILES\_BATCH;C:\PROGRAM FILES\_BIN;C:\PROGRAM FILES\ARM\BIN\WIN_32-
PENTIUM;C:\PROGRAM FILES\PERL\BIN\;C:\WINDOWS\SYSTEM32;C:\WINDOWS;C:\WINDOWS\SYSTEM32\WBEM;C:\PROGRAM FILES\ATI TECHNOLOGIES\ATI CONTROL
PANEL;C:\PROGRAM FILES\MICROSOFT SQL SERVER\80\TOOLS\BINN\;C:\PROGRAM FILES\ARM\UTILITIES\FLEXLM\10.8.0\12\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\R
VCT\PROGRAMS\3.0\441\EVAL2-SC\WIN_32-PENTIUM;C:\PROGRAM FILES\ARM\RVD\CORE\3.0\675\EVAL2-SC\WIN_32-PENTIUM\BIN;C:\PROGRAM FILES\SUPPORT
TOOLS\;C:\Program Files\QuickTime\QTSystem\
     PERLLIB=d:\gmch_root\gmch perl lib;d:\gmch_root\gmch perl lib\windows
     PERL_BADLANG (unset)
     SHELL (unset)
-- 
Chris Hall               highwayman.com

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

From jgmyers@proofpoint.com

This is similar to bug #43294.

I agree that allowing characters above the Unicode maximum of U+10FFFF
is a mistake. It serves no useful purpose and just causes trouble for
those of us who are trying to process externally-provided UTF-8 data.
To safely process untrusted UTF-8 data, we poor implementors need to
learn all of the dark corners of Perl's nonstandard UTF-8 processing and
somehow deal with the fact that Perl doesn't even agree with itself as
to what is valid UTF-8. (see also bug 38722).

Allowing surrogates and the non-character U+FFFE in UTF-8 is a security
problem in much the same way that allowing non-shortest form sequences
(such as C0 80) is. For that reason, chr() should not be permitted to
create a surrogate or noncharacter--such a character cannot be
represented in a well-formed UTF-8 sequence.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 20, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 21, 2008

From chris.hall@highwayman.com

On Thu Mar 20 15​:25​:57 2008, jgmyers@​proofpoint.com wrote​:

This is similar to bug #43294.

I agree that allowing characters above the Unicode maximum of U+10FFFF
is a mistake. It serves no useful purpose and just causes trouble for
those of us who are trying to process externally-provided UTF-8 data.
To safely process untrusted UTF-8 data, we poor implementors need to
learn all of the dark corners of Perl's nonstandard UTF-8 processing
and
somehow deal with the fact that Perl doesn't even agree with itself as
to what is valid UTF-8. (see also bug 38722).

Oh dear. I was actually trying to argue for decoupling general
characters from Unicode and strict UTF-8.

I think Perl's character general character handling has been mixed up
with the handling of Unicode exchange formats. Partly because the
internal form is utf8-like and stuff is called utf8 !

IMO Perl should, internally, handle characters with values
0x0..0x7FFF_FFFF -- interpretting the subset which is Unicode as
Unicode when required, and only when required.

I would dispense with all the broken and incomplete handling
of "illegal" Unicode values, and the OTT values > 0x7FFF_FFFF, which I
imagine would simplify things.

Separately there is clearly the need to filter strict UTF-8 for
exchange. Encode's strict-UTF handling isn't complete, but I don't
think the requirements are either simple or consistent across
applications.

Seems to me that the current code falls between two stools and is not
fully satisfying either the needs of general character string handling
or the needs of strict interchange.

Allowing surrogates and the non-character U+FFFE in UTF-8 is a
security
problem in much the same way that allowing non-shortest form sequences
(such as C0 80) is. For that reason, chr() should not be permitted to
create a surrogate or noncharacter--such a character cannot be
represented in a well-formed UTF-8 sequence.

I don't think it's chr's job to police anything. The current
inconsistencies etc IMO indicate that once you start trying to police
these things you hit conflicting requirements, e.g.​:

  * non-characters are OK for internal use, but not for external
  interchange.

  * use of strings for things other than Unicode

  [I note that printf '%vX' is suggested for IPv6. This implies
  holding IPv6 addresses in 8 characters, each 0x0..0xFFFF.
  Which would be impossible if strings refused to allow
  values that aren't kosher UTF-8 !]

  * processing UTF-16 as strings of characters 0..0xFFFF.

and trying to do two things at once -- e.g. allowing chr() to generate
surrogates but throw warnings about them I doubt satisfies anyone !

Applications that need strict UTF-8 (and possibly subsets thereof) need
a layer of support for filtering and dealing with stuff that is
application-invalid.

But I don't think the needs of strict UTF-8 should get in the way of
simple, general string handling.

--
Chris Hall

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 21, 2008

From jgmyers@proofpoint.com

On Thu Mar 20 17​:56​:04 2008, chris_hall wrote​:

I think Perl's character general character handling has been mixed up
with the handling of Unicode exchange formats. Partly because the
internal form is utf8-like and stuff is called utf8 !

IMO Perl should, internally, handle characters with values
0x0..0x7FFF_FFFF -- interpretting the subset which is Unicode as
Unicode when required, and only when required.

I disagree--Perl should adopt and conform to the Unicode standard.
Implementing something which is similar but nonconforming is just laying
traps for unwary developers.

Particularly heinous is the concept of calling something "utf8" that
violates the conformance requirements that Unicode places on UTF-8.
Some of the conformance requirements that Perl violates, those against
decoding surrogates or U+FFFE, are necessary for security.

(Actually, Unicode does not define the UTF-8 byte sequence for U+FFFE as
being ill-formed, even though doing so is necessary for security. If
you have a security syntax check followed by encoding and decoding in
UTF-16, then an attacker could use U+FFFE to trick the UTF-16 decoder
into byteswapping the data and having it interpreted differently than
what was checked. I have not been able to get the Unicode Consortium to
acknowledge this error.)

I would dispense with all the broken and incomplete handling
of "illegal" Unicode values, and the OTT values > 0x7FFF_FFFF, which I
imagine would simplify things.

By allowing values that are not permitted by Unicode, you are laying a
trap for developers not wary of getting get such illegal input.

Separately there is clearly the need to filter strict UTF-8 for
exchange. Encode's strict-UTF handling isn't complete, but I don't
think the requirements are either simple or consistent across
applications.

The requirements with respect to ill-formed sequences, including
surrogates and values above 10FFFF, are clearly specified in the Unicode
standard.

The requirements with respect to noncharacters are admittedly complex
and obscure. Because of the U+FFFE issue, my experience has been that
it is best to simply disallow them all.

Seems to me that the current code falls between two stools and is not
fully satisfying either the needs of general character string handling
or the needs of strict interchange.

I would agree with this, though I disagree about the need for
non-interchange strings. How are developers to keep straight which
interfaces are to allow or disallow non-interchange strings?

I don't think it's chr's job to police anything.

I disagree. It is chr's job to police chr('orange'). Similarly, it
should police chr(0x7FFF_FFFF).

The current
inconsistencies etc IMO indicate that once you start trying to police
these things you hit conflicting requirements, e.g.​:

* non-characters are OK for internal use, but not for external
interchange.

Which then begs the question of what is "internal use" versus "external
interchange" and what a module must do when given "internal" data and
must "interchange" it.

* use of strings for things other than Unicode

\[I note that printf '%vX' is suggested for IPv6\.  This implies
 holding IPv6 addresses in 8 characters\, each 0x0\.\.0xFFFF\.
 Which would be impossible if strings refused to allow
 values that aren't kosher UTF\-8 \!\]

To use strings for things other than Unicode, one should use byte
sequences. To use Unicode characters for things that are not Unicode
characters is a mistake.

* processing UTF-16 as strings of characters 0..0xFFFF.

UTF-16 is a character encoding scheme and should be processed as a byte
sequence just like every other character encoding scheme is. Surrogates
are not characters as defined in chapter 2 of Unicode 5.0.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 23, 2008

From chris.hall@highwayman.com

On Fri, 21 Mar 2008 you wrote

On Thu Mar 20 17​:56​:04 2008, chris_hall wrote​:
...
(Actually, Unicode does not define the UTF-8 byte sequence for U+FFFE as
being ill-formed, even though doing so is necessary for security. If
you have a security syntax check followed by encoding and decoding in
UTF-16, then an attacker could use U+FFFE to trick the UTF-16 decoder
into byteswapping the data and having it interpreted differently than
what was checked. I have not been able to get the Unicode Consortium to
acknowledge this error.)

The standard already says that non-characters should not be exchanged
externally -- so a careful UTF-8 decoder would intercept U+FFFE (what it
would do with it might vary from application to application).

I don't quite understand why you'd want to apply a UTF-16 decoder after
a UTF-8 one. Or why a UTF-16 decoder would worry about byte-swapping
after the BOM. What am I missing, here ?

I would dispense with all the broken and incomplete handling
of "illegal" Unicode values, and the OTT values > 0x7FFF_FFFF, which I
imagine would simplify things.

By allowing values that are not permitted by Unicode, you are laying a
trap for developers not wary of getting get such illegal input.

No, I'm suggesting removing all the clutter from simple character
handling, which gets in the way of some applications.

Applications that don't trust their input have their own issues, which I
think need to be treated separately, with facilities for different
applications to specify (a) what is invalid for them and (b) how to deal
with invalid input.

Separately there is clearly the need to filter strict UTF-8 for
exchange. Encode's strict-UTF handling isn't complete, but I don't
think the requirements are either simple or consistent across
applications.

The requirements with respect to ill-formed sequences, including
surrogates and values above 10FFFF, are clearly specified in the Unicode
standard.

It says they are ill-formed. It doesn't mandate what your application
might do with them when they appear.

A quick and dirty application might just throw rubbish away, and might
get away with it.

Another application might convert rubbish to U+FFFD and later whinge
about unspecified errors in the input.

Yet another application might which to give more specific diagnostics,
either at the time the data is first received, or at some later time.

Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood, though
UTF-8 declares them ill-formed (while noting that ISO-10646 accepts
them). Should an entire 4, 5 or 6 byte sequence beyond 0x10_FFFF be
treated as ill-formed, or each individual byte as being part of an
ill-formed sequence ?

Similarly, the redundant longer forms, which UTF-8 says are ill-formed,
different applications may wish to handle differently.

The requirements with respect to noncharacters are admittedly complex
and obscure. Because of the U+FFFE issue, my experience has been that
it is best to simply disallow them all.

Except that non-characters are entirely legal, and may be essential to
some applications.

Then there's what to do with (a) unassigned characters, (b) private use
characters when exchanging data between unconnected parties, (c)
characters not known to the recipient, (d) control characters, etc. etc.

...

The current
inconsistencies etc IMO indicate that once you start trying to police
these things you hit conflicting requirements, e.g.​:

* non-characters are OK for internal use, but not for external
interchange.

Which then begs the question of what is "internal use" versus "external
interchange" and what a module must do when given "internal" data and
must "interchange" it.

Indeed, and that may vary from application to application.

So, not only is it (a) more general and (b) conceptually simpler to
treat strings as sequences of abstract entities, but we can see that as
soon as we try to do more, we run into (i) definition issues and (ii)
application-dependent issues.

* use of strings for things other than Unicode

\[I note that printf '%vX' is suggested for IPv6\.  This implies
 holding IPv6 addresses in 8 characters\, each 0x0\.\.0xFFFF\.
 Which would be impossible if strings refused to allow
 values that aren't kosher UTF\-8 \!\]

To use strings for things other than Unicode, one should use byte
sequences. To use Unicode characters for things that are not Unicode
characters is a mistake.

I don't see why handling an IPv6 address as a short sequence of 16 bit
"characters" (that is, things that go in strings) is any less reasonable
than handling IPv4 addresses as short sequences of 8 bit "characters".

In the old 8-bit character world what one did with characters was not
limited by any given character set interpretation. The new world of 31
(or more) bit characters should not be limited either.

* processing UTF-16 as strings of characters 0..0xFFFF.

UTF-16 is a character encoding scheme and should be processed as a byte
sequence just like every other character encoding scheme is.

Not really. UTF-16 is defined in terms of 16 bit values. I can use
strings for byte sequences. Why not word (16 bit) sequences ?

Surrogates
are not characters as defined in chapter 2 of Unicode 5.0.

Well, this is it in a nut-shell.

I don't think that Perl characters (that is, things that are components
of Perl strings) are, or should be, defined to be Unicode characters. I
think they should be abstract values -- with a 1-to-1 mapping to/from
31- or 32- bit unsigned integers.

[OK, the size here is arbitrary... 31-bits fits with the well
understood encoding, 32-bits would be trivially portable, 64-bits seems
OTT, but you could argue for that too.]

[Even if Perl characters were exactly Unicode characters, there would
still be the application specific issues of what to do with
non-characters, private use characters, undefined characters, etc. etc.]

On top of a generic string data structure there should clearly be
extensive support for Unicode. On top of that comes the need for
controlled exchange of the various Unicode encoding formats -- for which
different applications may have different requirements.

Chris
--
Chris Hall highwayman.com

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 25, 2008

From jgmyers@proofpoint.com

Chris Hall via RT wrote​:

It says they are ill-formed. It doesn't mandate what your application
might do with them when they appear.

Unicode 5.0 conformance requirement C10 does mandate a restriction on
what an application might do with ill-formed sequences. It states
"When a process interprets a code unit sequence which purports to be in
a Unicode character encoding form, it shall treat ill-formed code unit
sequences as an error condition and shall not interpret such sequences
as characters."

A quick and dirty application might just throw rubbish away, and might
get away with it.

Another application might convert rubbish to U+FFFD and later whinge
about unspecified errors in the input.

Unicode permits such behavior.

Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood, though
UTF-8 declares them ill-formed (while noting that ISO-10646 accepts
them).
ISO-10646 Amendment 2 no longer accepts characters above U+10FFFF.
Should an entire 4, 5 or 6 byte sequence beyond 0x10_FFFF be
treated as ill-formed, or each individual byte as being part of an
ill-formed sequence ?

Unicode permits either behavior.

Similarly, the redundant longer forms, which UTF-8 says are ill-formed,
different applications may wish to handle differently.

Again, Unicode conformance requirement C10 prohibits applications from
interpreting such sequence of characters. To interpret such sequences
as characters leaves applications vulnerable to serious security holes.
See Unicode 5.0 section 5.19, Unicode Security, which addresses this
very issue.

(Actually, Unicode does not define the UTF-8 byte sequence for U+FFFE as
being ill-formed, even though doing so is necessary for security. If
you have a security syntax check followed by encoding and decoding in
UTF-16, then an attacker could use U+FFFE to trick the UTF-16 decoder
into byteswapping the data and having it interpreted differently than
what was checked. I have not been able to get the Unicode Consortium to
acknowledge this error.)

I don't quite understand why you'd want to apply a UTF-16 decoder after
a UTF-8 one. Or why a UTF-16 decoder would worry about byte-swapping
after the BOM. What am I missing, here ?

Software is sometimes constructed by connecting together modules that
were previously made elsewhere. So an application might, after passing
data through the security syntax check (or "security watchdog module" in
Unicode section 5.19), process it through a separate module that writes
the data out in UTF-16BE. That UTF-16BE data might in turn be processed
by a third module that interprets its input as UTF-16.

The potential attack does require a UTF-16 decoder that is sloppier than
the one in Encode--either willing to switch endianness mid-stream or
willing to treat an initial BOM as optional.

By allowing values that are not permitted by Unicode, you are laying a
trap for developers not wary of getting get such illegal input.

No, I'm suggesting removing all the clutter from simple character
handling, which gets in the way of some applications.

Applications that don't trust their input have their own issues, which I
think need to be treated separately, with facilities for different
applications to specify (a) what is invalid for them and (b) how to deal
with invalid input.

Ill-formed sequences are invalid for everybody. By pushing the
responsibility for handling such non-obvious character handling issues
from the Perl core to individual applications, you would be
significantly increasing the number of applications that fail to handle
such issues as needed. This is laying traps.

Even you seem to have been unaware of the seriously adverse security
impact of handling the redundant longer forms as characters. How do you
expect a run-of-the-mill Perl script writer to even know that they might
have to run extra Unicode-specific validity checks? The current
distinction that Perl makes between "utf8" and "utf-8" is quite obscure.

The requirements with respect to noncharacters are admittedly complex
and obscure. Because of the U+FFFE issue, my experience has been that
it is best to simply disallow them all.

Except that non-characters are entirely legal, and may be essential to
some applications.

Please provide an example of a reasonable application to which a
non-character is essential. There is no shortage of private use
characters--I find it hard to believe that the loss of 66 potential
characters is quite so catastrophic.

Then there's what to do with (a) unassigned characters, (b) private use
characters when exchanging data between unconnected parties, (c)
characters not known to the recipient, (d) control characters, etc. etc.

There are no such problems with any of these categories.

So, not only is it (a) more general and (b) conceptually simpler to
treat strings as sequences of abstract entities, but we can see that as
soon as we try to do more, we run into (i) definition issues and (ii)
application-dependent issues.

No, it's the converse. When you fail to provide a consistent definition
across the language, you run into issues with mismatched and
inconsistent definitions within and across applications.

I don't see why handling an IPv6 address as a short sequence of 16 bit
"characters" (that is, things that go in strings) is any less reasonable
than handling IPv4 addresses as short sequences of 8 bit "characters".

In neither case are they characters.

In the old 8-bit character world what one did with characters was not
limited by any given character set interpretation. The new world of 31
(or more) bit characters should not be limited either.

The old 8-bit character world is hardly a model of reasonableness. One
didn't necessarily know what the character encoding scheme was, so one
was quite likely to give the data the wrong interpretation. Some
schemes, such as ISO 2022, were an absolute nightmare.

The ability to store arbitrary 16 bit quantities in UTF-8 strings is
hardly an overriding concern. The overriding concern is to handle text
correctly.

UTF-16 is a character encoding scheme and should be processed as a byte
sequence just like every other character encoding scheme is.

Not really. UTF-16 is defined in terms of 16 bit values. I can use
strings for byte sequences. Why not word (16 bit) sequences ?

Perl strings are not word (16 bit) sequences.

UTF-16 was a bad idea to begin with. Let it die a natural death, just
like UTF-7.

Surrogates are not characters as defined in chapter 2 of Unicode 5.0.

Well, this is it in a nut-shell.

I don't think that Perl characters (that is, things that are components
of Perl strings) are, or should be, defined to be Unicode characters. I
think they should be abstract values -- with a 1-to-1 mapping to/from
31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand, it
could adopt and conform to Unicode, taking advantage of all the work and
expertise put into the foremost international standard for character
encoding. On the other hand, Perl could decide that it somehow knows
more about character encoding than the Unicode Consortium (and the
subject experts that contributed to their standard) and go off and
invent something new and inconsistent with the constraints the Unicode
Consortium found it necessary to impose.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 26, 2008

From perl@nevcal.com

On approximately 3/25/2008 3​:27 PM, came the following characters from
the keyboard of John Gardiner Myers​:

Chris Hall via RT wrote​:

Well, this is it in a nut-shell.

I don't think that Perl characters (that is, things that are
components of Perl strings) are, or should be, defined to be Unicode
characters. I think they should be abstract values -- with a 1-to-1
mapping to/from 31- or 32- bit unsigned integers.
This is indeed it in a nut-shell. Perl has a choice​: On one hand, it
could adopt and conform to Unicode, taking advantage of all the work
and expertise put into the foremost international standard for
character encoding. On the other hand, Perl could decide that it
somehow knows more about character encoding than the Unicode
Consortium (and the subject experts that contributed to their
standard) and go off and invent something new and inconsistent with
the constraints the Unicode Consortium found it necessary to impose.

Perl seems to have already made that choice... and chose TMTOWTDI.

The language implements an extension of UTF-8 encoding rules for 70**
bit integers (which is very space inefficient above 31 bit integers,
even more so that UTF-8 itself) which it calls utf8.

The language has certain string operations* that implement certain
Unicode semantics for strings stored in utf8 format.

Module Encode implements (as best as Dan and kibitzers can) UTF-8
encoding and decoding and validity checking.

So people that want to use utf8 strings as containers for 16-bit
integers are welcome to, as Chris suggests. And people that want to
conform to strict Unicode interpretations have the tools to do so. And
people that choose to use utf8 strings only for Unicode codepoints can
restrict themselves to doing so.

It appears that reported bugs get fixed, as time permits. It appears
that the goal is to conform toUnicode semantics in Module Encode, and
certain string operations* within the language.

* This list is fairly well known, including "\l\L\u\U" uc ucfirst lc
lcfirst and certain regexp operations, all of which have different
semantics when applied to utf8 strings vs non-utf8 strings. This is
considered a bug by some, and a deficiency by most of the rest.

** maybe it is 72? Larger than 64, apparently, and such values higher
than the platform's native integer size (usually 32 or 64) are hard to
access... chr and ord can't deal with them.

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 27, 2008

From chris.hall@highwayman.com

On Tue, 25 Mar 2008 you wrote

On approximately 3/25/2008 3​:27 PM, came the following characters from
the keyboard of John Gardiner Myers​:

Chris Hall via RT wrote​:

Well, this is it in a nut-shell.

I don't think that Perl characters (that is, things that are
components of Perl strings) are, or should be, defined to be Unicode
characters. I think they should be abstract values -- with a 1-to-1
mapping to/from 31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand, it
could adopt and conform to Unicode, taking advantage of all the work
and expertise put into the foremost international standard for
character encoding. On the other hand, Perl could decide that it
somehow knows more about character encoding than the Unicode
Consortium (and the subject experts that contributed to their
standard) and go off and invent something new and inconsistent with
the constraints the Unicode Consortium found it necessary to impose.

Perl seems to have already made that choice... and chose TMTOWTDI.

The language implements an extension of UTF-8 encoding rules for 70**
bit integers (which is very space inefficient above 31 bit integers,
even more so that UTF-8 itself) which it calls utf8.

As reported​: the 7 and 13 byte extended sequences are not properly
handled everywhere. The documentation is coy about characters greater
than 31 bits.

IMO things are so broken (not even utf8​::valid() likes the 7- and
13-byte sequences !) that it's not too late to row back from this....

  - stopping at 31 bit integers is at least consistent with well-known
  4, 5 and 6 byte "UTF-8" sequences.

  - 32 bit integers could be supported in 6 byte sequences (by treating
  0xFC..0xFF prefixes as containing the MS 2 bits) and would be
  portable (and remains reasonably practical, space-wise).

....the extent of brokenness recalls the guiding mantra​: KISS.

The language has certain string operations* that implement certain
Unicode semantics for strings stored in utf8 format.

Module Encode implements (as best as Dan and kibitzers can) UTF-8
encoding and decoding and validity checking.

So people that want to use utf8 strings as containers for 16-bit
integers are welcome to, as Chris suggests. And people that want to
conform to strict Unicode interpretations have the tools to do so. And
people that choose to use utf8 strings only for Unicode codepoints can
restrict themselves to doing so.

The separation between the content of strings and Unicode is unclear.

The name utf8 doesn't help !

A good example of this is chr(n) which​:

  - issues warnings if 'n' is a Unicode surrogate or non-character.

  These warnings are a nuisance for people using strings as containers
  for n-bit integers.

  Those wanting help with strict Unicode aren't materially helped by
  this behaviour.

  - accepts characters beyond the Unicode range without warning.

  So isn't consistent in its "Unicode support".

  - generates a chr(0xFFFD) in response to chr(-1).

  Which makes no sense where strings are used as containers for n-bit
  integers !

It appears that reported bugs get fixed, as time permits. It appears
that the goal is to conform toUnicode semantics in Module Encode, and
certain string operations* within the language.

There are plenty of bugs to go round :-}

I hope that has not been lost in the discussion.

I'm not sure that the Encode Module is the right place for all support
for Unicode.

It seems to me that Encode is to do with mapping between Perl characters
(interpreted as Unicode code-points) and a variety of Character Encoding
Schemes, including UTF-8. On input this concerns itself with ill-formed
stuff. On output it concerns itself with things that cannot be output.
In both cases there is mapping between different Coded Character Sets,
and coping with impossible mappings.

With Unicode there are additional, specific options required either to
allow or do something else with​:

  * the non-characters -- all of them.

  * the Replacement Character -- perhaps should not send these, or
  perhaps do not wish to receive these.

  * private use characters -- which may or may not be suitable for
  exchange.

  * perhaps more general support for sub-sets of Unicode.

  * dealing with canonical equivalences.

Now I suppose that a lot can be done by regular expressions and other
such processing. This looks like hard work. And might not be terribly
efficient ? (Certainly John Gardiner Myers wants utf8​::valid to do very
strict UTF-8 checking as an efficient test for whether other processing
is required.)

* This list is fairly well known, including "\l\L\u\U" uc ucfirst lc
lcfirst and certain regexp operations, all of which have different
semantics when applied to utf8 strings vs non-utf8 strings. This is
considered a bug by some, and a deficiency by most of the rest.

** maybe it is 72? Larger than 64, apparently, and such values higher
than the platform's native integer size (usually 32 or 64) are hard to
access... chr and ord can't deal with them.

It's 72​: 13 byte sequence, starting 0xFF followed by 12 bytes carrying 6
bits of the value each.

FWIW, on 64-bit integer machine​:

  $v = 0xFFFF_FFFF_FFFF_FFFD ;
  if ($v != ord(chr($v))) { die ; } ;

work just fine. Though Perl whimpers​:

  Hexadecimal number > 0xffffffff non-portable at .... (compile time)

While​:

  $v = 0xFFFF_FFFF_FFFF_FFFF ;
  if ($v != ord(chr($v))) { die ; } ;

whimpers​:

  Hexadecimal number > 0xffffffff non-portable at .... (compile time)
  Unicode character 0xffffffffffffffff is illegal at .... (run time)

where the second whinge is baroque.

Chris
--
Chris Hall highwayman.com +44 7970 277 383

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 27, 2008

From chris.hall@highwayman.com

On Tue, 25 Mar 2008 John Gardiner Myers wrote

Chris Hall via RT wrote​:

It says they are ill-formed. It doesn't mandate what your application
might do with them when they appear.

Unicode 5.0 conformance requirement C10 does mandate a restriction on
what an application might do with ill-formed sequences. It states
"When a process interprets a code unit sequence which purports to be in
a Unicode character encoding form, it shall treat ill-formed code unit
sequences as an error condition and shall not interpret such sequences
as characters."

Sure. But the real point is that this doesn't specify how the error
condition must be dealt with.

And, as previously discussed, the issue of ill-formed UTF-8 is only part
of the problem.

A quick and dirty application might just throw rubbish away, and might
get away with it.

Another application might convert rubbish to U+FFFD and later whinge
about unspecified errors in the input.

Unicode permits such behavior.

Sure. But the point is that there isn't a single correct approach, it
depends on the application.

Sequences between 0x10_FFFF and 0x7FFF_FFFF are well understood, though
UTF-8 declares them ill-formed (while noting that ISO-10646 accepts
them).

ISO-10646 Amendment 2 no longer accepts characters above U+10FFFF.

OK. I was going by what Unicode 5.0 says​:

  "The definition of UTF-8 in Annex D of ISO/IEC 10646​:2003 also allows
  for the use of five and six-byte sequences to encode characters that
  are outside the range of the Unicode character set; those five- and
  six-byte sequences are illegal for the use of UTF-8 as an encoding
  form of Unicode characters."

Should an entire 4, 5 or 6 byte sequence beyond 0x10_FFFF be
treated as ill-formed, or each individual byte as being part of an
ill-formed sequence ?

Unicode permits either behavior.

And any given application may wish to do one or the other.

Similarly, the redundant longer forms, which UTF-8 says are ill-formed,
different applications may wish to handle differently.

Again, Unicode conformance requirement C10 prohibits applications from
interpreting such sequence of characters. To interpret such sequences
as characters leaves applications vulnerable to serious security holes.
See Unicode 5.0 section 5.19, Unicode Security, which addresses this
very issue.

It takes a narrow view of this. Obviously it is good to encourage
strict encoding ! If one wanted to be "generous in what one accepts"
one might accept and recode the redundant but longer forms -- which
deals with the security issue. But this is not a big requirement.

....

By allowing values that are not permitted by Unicode, you are laying a
trap for developers not wary of getting get such illegal input.

No, I'm suggesting removing all the clutter from simple character
handling, which gets in the way of some applications.

Applications that don't trust their input have their own issues, which I
think need to be treated separately, with facilities for different
applications to specify (a) what is invalid for them and (b) how to deal
with invalid input.

Ill-formed sequences are invalid for everybody. By pushing the
responsibility for handling such non-obvious character handling issues
from the Perl core to individual applications, you would be
significantly increasing the number of applications that fail to handle
such issues as needed. This is laying traps.

The existing handling is in a mess. I suggest that this is partly
because the problem is not straightforward, and there is no single,
universal solution. The problem is not simply ill-formed sequences.

IMO the solution is (a) to simplify the base string and character data
structures -- so that they are not confused by the conflicting
requirements, and (b) to beef up support for strict Unicode character
and UTF handling, with sufficient flexibility to allow for different
applications to do different things, and with a sensible set of defaults
for straightforward use.

Even you seem to have been unaware of the seriously adverse security
impact of handling the redundant longer forms as characters.

As above. I grant that handling the redundant longer forms is not a big
requirement, but if handled correctly the security issue is dealt with.

How do you
expect a run-of-the-mill Perl script writer to even know that they might
have to run extra Unicode-specific validity checks? The current
distinction that Perl makes between "utf8" and "utf-8" is quite obscure.

Yes, it doesn't help clarify things.

The requirements with respect to noncharacters are admittedly complex
and obscure. Because of the U+FFFE issue, my experience has been that
it is best to simply disallow them all.

Except that non-characters are entirely legal, and may be essential to
some applications.

Please provide an example of a reasonable application to which a
non-character is essential. There is no shortage of private use
characters--I find it hard to believe that the loss of 66 potential
characters is quite so catastrophic.

Except that it would no longer be Unicode conformant.

If you want to argue that non-characters are a Bad Thing, that's a
separate topic.

Using private use characters instead simply moves the problem. If I use
non-characters as delimiters in my application, I should remove them
before sending the text to somebody who does not expect them. If I were
to use some private-use characters for the same thing, I should still
remove them, shouldn't I ?

Then there's what to do with (a) unassigned characters, (b) private use
characters when exchanging data between unconnected parties, (c)
characters not known to the recipient, (d) control characters, etc. etc.

There are no such problems with any of these categories.

Well... if you're troubled by the exchange of 66 non-character values
I'm surprised you're not troubled by the huge number of private use
characters ! If my system were to place some internal significance on
some private use characters, it might be a security issue if these were
not filtered out on exchange with third parties -- much like the
non-characters.

An application that was *really* worried about what it was being sent
might wish to filter any or all of these things. It might wish to
filter down to some supported sub-set. Not to mention the reduction to
canonical form(s).

So, not only is it (a) more general and (b) conceptually simpler to
treat strings as sequences of abstract entities, but we can see that as
soon as we try to do more, we run into (i) definition issues and (ii)
application-dependent issues.

No, it's the converse. When you fail to provide a consistent definition
across the language, you run into issues with mismatched and
inconsistent definitions within and across applications.

I agree that without a consistent definition you get a mess.

I don't see why handling an IPv6 address as a short sequence of 16 bit
"characters" (that is, things that go in strings) is any less reasonable
than handling IPv4 addresses as short sequences of 8 bit "characters".

In neither case are they characters.

Looking at the "Character Encoding Model", where I said "characters" a
little loosely, the jargon suggests 'code units'. But in any case, if a
thing that is an element of a string is not a "character" what would you
recommend I call it ?

In the old 8-bit character world what one did with characters was not
limited by any given character set interpretation. The new world of 31
(or more) bit characters should not be limited either.

The old 8-bit character world is hardly a model of reasonableness. One
didn't necessarily know what the character encoding scheme was, so one
was quite likely to give the data the wrong interpretation. Some
schemes, such as ISO 2022, were an absolute nightmare.

Granted that character encodings in the 8-bit world were tricky.

But chr() didn't get upset about, for example, DEL (0x7F) or DLE (0x10)
despite the obvious issues of interpretation. Core Perl does not
attempt to intervene here. I realise that this may appear trivial, but
it illustrates the difference between treating strings as sequences of
generic 'code units' and treating them as characters according to some
specific 'coded character set'.

.....

Well, this is it in a nut-shell.

I don't think that Perl characters (that is, things that are components
of Perl strings) are, or should be, defined to be Unicode characters. I
think they should be abstract values -- with a 1-to-1 mapping to/from
31- or 32- bit unsigned integers.

This is indeed it in a nut-shell. Perl has a choice​: On one hand, it
could adopt and conform to Unicode, taking advantage of all the work and
expertise put into the foremost international standard for character
encoding.

Leaving to one side any questions about ill-formed sequences. What
should be done with​:

  * non-characters -- allow, filter out, replace, ... ?

  * private-use characters -- allow, filter out, replace, ... ?

  * unassigned characters -- allow, filter out, replace, ... ?

  * canonical equivalences -- allow, filter out, replace, ... ?

  The standard acknowledges a security issue here, but punts it​:

  "However, another level of alternate representation has raised
  other security questions​: the canonical equivalences between
  precomposed characters and combining character sequences that
  represent the same abstract characters. .... The conformance
  requirement, however, is that conforming implementations cannot
  be required to make an interpretation distinction between
  canonically equivalent representations. The way for a security-
  conscious application to guarantee this is to carefully observe
  the normalization specifications (see Unicode Standard Annex
  #15, “Unicode Normalization Forms”) so that data is handled
  consistently in a normalized form."

  * requirements to handle only sub-sets of characters.

  * other things, perhaps ?

Even surrogates are potentially tricky...

... in UTF-8 surrogate values are explicitly ill-formed.

... in UTF-16 they should travel in pairs, but I guess decoders need to
  do something with poorly formed or possibly incomplete input.

... but it appears that some code will combine surrogate code points
  even after decoding the UTF -- I suspect that this is a hangover
  from older systems where a 16-bit internal character form looked
  like a reasonable compromise.

... so banning these values from Perl strings is problematic.

With ill-formed sequences the question is how to deal with the error
condition(s).

The point here is that the requirements are not simple and not
universal.

There is, absolutely, a crying need for clear and effective support for
handling Unicode and the UTFs -- especially UTF-8, given its increasing
dominance.

On the other hand, Perl could decide that it somehow knows
more about character encoding than the Unicode Consortium (and the
subject experts that contributed to their standard) and go off and
invent something new and inconsistent with the constraints the Unicode
Consortium found it necessary to impose.

This is a false alternative.

Supporting generic "character" and string primitives does not preclude
layering strong and flexible UTF and Unicode handling on top, allowing
different applications to take more or less control over the various
options/ambiguities.

At present Perl is achieving neither.

Chris

PS​: big-endian integers are sinful.
--
Chris Hall highwayman.com +44 7970 277 383

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

From @khwilliamson

After much further discussion and gnashing of teeth, this has been
resolved. The output of the original program in this ticket on current
blead is​:

Hexadecimal number > 0xffffffff non-portable at 51936.pl line 21.
__Runtime__ at 51936.pl line 4.
Valid(4)​: 'Hello World !'
$r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}
@​w=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}
$r=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{112345678}
@​w=\x{FFFFFFFD}\x{7FFFFFFF}\x{E0}\x{80000000}\x{112345678}

The decision was made to allow any unsigned values be stored as strings
internally in Perl. The non-character code points are all recognized,
and allowed. When printing any of the surrogates, non-char code points,
or above-legal-Unicode code points, a warning is raised. All are
prohibited under strict UTF-8 input. There has been discussion and some
work on making the :utf8 layer more strict. I believe this will happen.

When doing an operation that requires Unicode semantics on an
above-Unicode code point, a warning is raised. An example is changing
the case, and this is a no-op.

Unicode doesn't actually forbid the use of isolated surrogates in
strings inside languages, even though a non-lawyer, such as myself,
reading the standard would think that it did. There is some text that
allows it. I posted to p5p a portion of an email from the president of
Unicode that reiterated this (sent to someone on another project). The
clincher is that ICU, the semi-official Unicode implementation does
allow isolated surrogates in strings. And, Unicode as of version 5.1
does give property values for every property for the surrogates. At
this time, we are warning on surrogates if a case change (including /i
regular expression matching) is done on them. I'm not sure that this is
correct, as Unicode does furnish casing semantics for them, but it is
easier to remove a warning later than to add one.

The portion of the original ticket involving chr(-1) has not been
resolved. I submitted a bug report for just that, but have not gotten a
reply back as to the number assigned to it.

In any event, I believe much of the inglorious handling of this whole
situation is now fixed

--Karl Williamson

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant