Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 fatal warning #9659

Closed
p5pRT opened this issue Feb 24, 2009 · 48 comments
Closed

utf8 fatal warning #9659

p5pRT opened this issue Feb 24, 2009 · 48 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Feb 24, 2009

Migrated from rt.perl.org#63446 (status was 'resolved')

Searchable as RT63446$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 24, 2009

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ /\A[\x{123}]/ ? "yes" : "no"'
no
$ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~ /\A[\x{123}]/ ? "yes" : "no"'
Malformed UTF-8 character (fatal) at -e line 1.
$

Turning warnings on makes the regexp operation die, whereas with warnings
off it produced the correct answer. If 'no warnings "utf8"' is in scope
at the regexp op, then the error does not occur.

The form of the regexp affects behaviour. If the regexp is /\A\x{123}/
(not a character class) then there is no error or warning. If the regexp
is /\A\x{23}/ (ASCII character, no character class) then a *warning* is
issued and the right answer is returned. If the regexp is /\A[\x{23}]/
(ASCII character, character class) then the error occurs on 5.8.8 and
a warning is issued on 5.10.0 or 5.8.9. Non-ASCII Latin-1 characters
behave the same as ASCII characters.

The characters in the string that can be complained about this way
are U+d800 to U+dfff (the surrogates) and U+ffff (one of many reserved
non-characters).

The character is in fact encoded correctly in Perl-internal UTF-8.
The error message is wrong. Curiously, and probably related,
Devel​::Peek​::Dump() suffers the same kind of problem when dumping the
string​: it generates a warning and gives the wrong UTF-8 decode iff
'use warnings "utf8"' is in scope at its call site.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.8:

Configured by Debian Project at Fri Dec 19 00:43:54 EST 2008.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.18-6-686, archname=i486-linux-gnu-thread-multi
    uname='linux etch 2.6.18-6-686 #1 smp fri dec 12 16:48:28 utc 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-21)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8
    gnulibc_version='2.3.6'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /etc/perl
    /usr/local/lib/perl/5.8.8
    /usr/local/share/perl/5.8.8
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.8
    /usr/share/perl/5.8
    /usr/local/lib/site_perl
    /usr/local/lib/perl/5.8.4
    /usr/local/share/perl/5.8.4
    .


Environment for perl v5.8.8:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_GB
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 22, 2010

From @khwilliamson

On Tue Feb 24 13​:27​:21 2009, zefram@​fysh.org wrote​:

This is a bug report for perl from zefram@​fysh.org,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

$ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~
/\A[\x{123}]/ ? "yes" : "no"'
no
$ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~
/\A[\x{123}]/ ? "yes" : "no"'
Malformed UTF-8 character (fatal) at -e line 1.
$

Turning warnings on makes the regexp operation die, whereas with
warnings
off it produced the correct answer. If 'no warnings "utf8"' is in
scope
at the regexp op, then the error does not occur.

The form of the regexp affects behaviour. If the regexp is
/\A\x{123}/
(not a character class) then there is no error or warning. If the
regexp
is /\A\x{23}/ (ASCII character, no character class) then a *warning*
is
issued and the right answer is returned. If the regexp is
/\A[\x{23}]/
(ASCII character, character class) then the error occurs on 5.8.8 and
a warning is issued on 5.10.0 or 5.8.9. Non-ASCII Latin-1 characters
behave the same as ASCII characters.

The characters in the string that can be complained about this way
are U+d800 to U+dfff (the surrogates) and U+ffff (one of many reserved
non-characters).

The character is in fact encoded correctly in Perl-internal UTF-8.
The error message is wrong. Curiously, and probably related,
Devel​::Peek​::Dump() suffers the same kind of problem when dumping the
string​: it generates a warning and gives the wrong UTF-8 decode iff
'use warnings "utf8"' is in scope at its call site.

I'm not sure what to do about this ticket. The basics of it anyway are
behaving as designed, which is that non-characters and surrogates
generate errors unless warnings are turned off, but then things should
work. The message in 5.12 for U+FFFF has been clarified that this
character is illegal for interchange. This should be extended in a
later release to the other 65 noncharacters.

Surrogates, on the other hand, should never appear in well-formed utf8,
and there are security considerations for doing so that I don't fully
understand but can see why. It seems to me that the current design is
sufficient.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Mar 22, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Oct 19, 2010

From @khwilliamson

Zefram,

Is it ok if I close this ticket? The Unicode standard says, "Because
surrogate code points are not Unicode scalar values, any UTF-8 byte
sequence that would otherwise map to code points D800..DFFF is ill-formed."

The message for FFFF has been changed to be correct.

--Karl Williamson

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Oct 20, 2010

From zefram@fysh.org

Karl Williamson via RT wrote​:

Is it ok if I close this ticket?

No. It's not OK for a warning to be fatal. The situation should either
be a fatal error (regardless of warning flags) or a non-fatal warning
(controlled by warning flags). A warning would make a lot more sense,
because Perl is generally happy to process codepoints in ways that
Unicode does not permit.

-zefram

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

On Sun Mar 21 19​:08​:59 2010, khw wrote​:

On Tue Feb 24 13​:27​:21 2009, zefram@​fysh.org wrote​:

$ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~
/\A[\x{123}]/ ? "yes" : "no"'
no
$ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~
/\A[\x{123}]/ ? "yes" : "no"'
Malformed UTF-8 character (fatal) at -e line 1.
$

Turning warnings on makes the regexp operation die, whereas with
warnings
off it produced the correct answer. If 'no warnings "utf8"' is in
scope
at the regexp op, then the error does not occur.

I'm not sure what to do about this ticket. The basics of it anyway are
behaving as designed, which is that non-characters and surrogates
generate errors unless warnings are turned off, but then things should
work.

It may be working as designed, but it was not designed very well.

The message in 5.12 for U+FFFF has been clarified that this
character is illegal for interchange. This should be extended in a
later release to the other 65 noncharacters.

Surrogates, on the other hand, should never appear in well-formed utf8,
and there are security considerations for doing so that I don't fully
understand but can see why.

The regular expression engine is not a security layer. It should not
pretend to be one. If I want to implement a security layer using regular
expressions, then this bug (yes, I do consider it a bug) will get in the
way.

Furthermore, Perl’s strings are not just Unicode. Unicode strings are
merely a subset of the strings that Perl supports.

Regular expressions are for looking at strings. So it should not warn or
die based on the contents of the string, as long as it is a valid Perl
string.

perl already warns for "\x{d800}" and chr 0xd800. So if such a string is
passed to a regular expression, we get multiple warnings for the same
character.

I use Perl’s strings for storing 16-bit binary data. The result is that
not only the code creating such strings, but any code looking at the
strings, has to turn off utf8 warnings. So I can’t use any CPAN modules
such as Data​::Dump​::Streamer.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

There are three patches attached that fix a few cases. There will be
more to come.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

Inline Patch
diff -up blead-63446.base/MANIFEST blead-63446-utf8-warnings/MANIFEST
--- blead-63446.base/MANIFEST	2010-11-26 08:06:10.000000000 -0800
+++ blead-63446-utf8-warnings/MANIFEST	2010-11-27 21:36:33.000000000 -0800
@@ -4804,6 +4804,7 @@ t/porting/podcheck.t		Test the POD of sh
 t/porting/regen.t		Check that regen.pl doesn't need running
 t/porting/test_bootstrap.t	Test that the instructions for test bootstrapping aren't accidentally overlooked.
 t/README			Instructions for regression tests
+t/re/beyond_unicode.t		See if regexps work with all characters
 t/re/fold_grind.t		See if case folding works properly
 t/re/overload.t		Test against string corruption in pattern matches on overloaded objects
 t/re/pat_advanced.t		See if advanced esoteric patterns work
diff -up blead-63446.base/regcomp.c blead-63446-utf8-warnings/regcomp.c
--- blead-63446.base/regcomp.c	2010-11-24 09:59:12.000000000 -0800
+++ blead-63446-utf8-warnings/regcomp.c	2010-11-28 05:37:38.000000000 -0800
@@ -3038,7 +3038,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_
 	    if (UTF) {
 		const U8 * const s = (U8*)STRING(scan);
 		l = utf8_length(s, s + l);
-		uc = utf8_to_uvchr(s, NULL);
+		uc =
+		  utf8n_to_uvchr(s, UTF8_MAXBYTES, NULL, UTF8_ALLOW_ANYUV);
 	    } else {
 		uc = *((U8*)STRING(scan));
 	    }
@@ -7779,7 +7780,7 @@ tryagain:
 		    if (UTF8_IS_START(*p) && UTF) {
 			STRLEN numlen;
 			ender = utf8n_to_uvchr((U8*)p, RExC_end - p,
-					       &numlen, UTF8_ALLOW_DEFAULT);
+					       &numlen, UTF8_ALLOW_ANYUV);
 			p += numlen;
 		    }
 		    else
@@ -9078,7 +9079,10 @@ S_reguni(pTHX_ const RExC_state_t *pRExC
 
     PERL_ARGS_ASSERT_REGUNI;
 
-    return SIZE_ONLY ? UNISKIP(uv) : (uvchr_to_utf8((U8*)s, uv) - (U8*)s);
+    return
+      SIZE_ONLY
+           ? UNISKIP(uv)
+           : (uvuni_to_utf8_flags((U8*)s, uv, UNICODE_ALLOW_ANY) - (U8*)s);
 }
 
 /*
diff -Nurp blead-63446.base/t/re/beyond_unicode.t blead-63446-utf8-warnings/t/re/beyond_unicode.t
--- blead-63446.base/t/re/beyond_unicode.t	1969-12-31 16:00:00.000000000 -0800
+++ blead-63446-utf8-warnings/t/re/beyond_unicode.t	2010-11-28 05:49:47.000000000 -0800
@@ -0,0 +1,30 @@
+#!./perl -w
+
+# This script tests that the regular expression engine can handle all Perl
+# characters, including those that are not Unicode. Unicode characters are
+# merely a subset of Perl characters.
+
+BEGIN {
+	chdir 't' if -d 't';
+	@INC = '../lib';
+	require './test.pl';
+}
+
+plan 1;
+
+my @bad;
+
+sub report_bad {
+ if(@bad) {
+  diag "Bad ranges: ", join " ", map sprintf("%x00..%x00",$_,$_+1), @bad;
+ }
+}
+
+@bad = ();
+for(0..0x1200) {
+  next if rand > .25;
+  my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+  push @bad, $_ if $c !~ quotemeta $c;
+}
+ok !@bad, 'quotemeta $foo matches $foo for every character';
+report_bad;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

Inline Patch
diff -up blead-63446-utf8-warnings2/regcomp.c blead-63446-utf8-warnings3/regcomp.c
--- blead-63446-utf8-warnings2/regcomp.c	2010-11-28 06:27:23.000000000 -0800
+++ blead-63446-utf8-warnings3/regcomp.c	2010-11-28 11:03:40.000000000 -0800
@@ -8313,7 +8313,7 @@ parseit:
 	if (UTF) {
 	    value = utf8n_to_uvchr((U8*)RExC_parse,
 				   RExC_end - RExC_parse,
-				   &numlen, UTF8_ALLOW_DEFAULT);
+				   &numlen, UTF8_ALLOW_ANYUV);
 	    RExC_parse += numlen;
 	}
 	else
diff -up blead-63446-utf8-warnings2/regexec.c blead-63446-utf8-warnings3/regexec.c
--- blead-63446-utf8-warnings2/regexec.c	2010-11-28 06:32:01.000000000 -0800
+++ blead-63446-utf8-warnings3/regexec.c	2010-11-28 11:08:39.000000000 -0800
@@ -6217,10 +6217,8 @@ S_reginclass(pTHX_ const regexp * const 
     /* If c is not already the code point, get it */
     if (utf8_target && !UTF8_IS_INVARIANT(c)) {
 	c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &c_len,
-		(UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV)
-		| UTF8_ALLOW_FFFF | UTF8_CHECK_ONLY);
-		/* see [perl #37836] for UTF8_ALLOW_ANYUV; [perl #38293] for
-		 * UTF8_ALLOW_FFFF */
+		  UTF8_ALLOW_ANYUV | UTF8_CHECK_ONLY);
+		/* see [perl #37836], [perl #38293] and [perl #63446] */
 	if (c_len == (STRLEN)-1)
 	    Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)");
     }
diff -up blead-63446-utf8-warnings2/utf8.c blead-63446-utf8-warnings3/utf8.c
--- blead-63446-utf8-warnings2/utf8.c	2010-11-28 06:26:01.000000000 -0800
+++ blead-63446-utf8-warnings3/utf8.c	2010-11-28 12:40:01.000000000 -0800
@@ -2046,8 +2046,7 @@ Perl_swash_fetch(pTHX_ SV *swash, const 
 	       Unicode tables, not a native character number.
 	     */
 	    const UV code_point = utf8n_to_uvuni(ptr, UTF8_MAXBYTES, 0,
-					   ckWARN(WARN_UTF8) ?
-					   0 : UTF8_ALLOW_ANY);
+						 UTF8_ALLOW_ANYUV);
 	    swatch = swash_get(swash,
 		    /* On EBCDIC & ~(0xA0-1) isn't a useful thing to do */
 				(klen) ? (code_point & ~(needents - 1)) : 0,

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

Inline Patch
diff -up blead-63446-utf8-warnings/regcomp.c blead-63446-utf8-warnings2/regcomp.c
--- blead-63446-utf8-warnings/regcomp.c	2010-11-28 05:37:38.000000000 -0800
+++ blead-63446-utf8-warnings2/regcomp.c	2010-11-28 06:27:23.000000000 -0800
@@ -1348,7 +1348,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_st
     HV *widecharmap = NULL;
     AV *revcharmap = newAV();
     regnode *cur;
-    const U32 uniflags = UTF8_ALLOW_DEFAULT;
+    const U32 uniflags = UTF8_ALLOW_ANYUV;
     STRLEN len = 0;
     UV uvc = 0;
     U16 curword = 0;
diff -up blead-63446-utf8-warnings/regexec.c blead-63446-utf8-warnings2/regexec.c
--- blead-63446-utf8-warnings/regexec.c	2010-11-24 05:45:11.000000000 -0800
+++ blead-63446-utf8-warnings2/regexec.c	2010-11-28 06:32:01.000000000 -0800
@@ -1752,7 +1752,7 @@ S_find_byclass(pTHX_ regexp * prog, cons
 
                  */
                 while (s <= last_start) {
-                    const U32 uniflags = UTF8_ALLOW_DEFAULT;
+                    const U32 uniflags = UTF8_ALLOW_ANYUV;
                     U8 *uc = (U8*)s;
                     U16 charid = 0;
                     U32 base = 1;
@@ -2948,7 +2948,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo,
 #endif
     dVAR;
     register const bool utf8_target = PL_reg_match_utf8;
-    const U32 uniflags = UTF8_ALLOW_DEFAULT;
+    const U32 uniflags = UTF8_ALLOW_ANYUV;
     REGEXP *rex_sv = reginfo->prog;
     regexp *rex = (struct regexp *)SvANY(rex_sv);
     RXi_GET_DECL(rex,rexi);
diff -up blead-63446-utf8-warnings/t/re/beyond_unicode.t blead-63446-utf8-warnings2/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings/t/re/beyond_unicode.t	2010-11-28 06:08:59.000000000 -0800
+++ blead-63446-utf8-warnings2/t/re/beyond_unicode.t	2010-11-28 06:09:42.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
 	require './test.pl';
 }
 
-plan 1;
+plan 2;
 
 my @bad;
 
@@ -26,5 +26,13 @@ for(0..0x1200) {
   my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
   push @bad, $_ if $c !~ quotemeta $c;
 }
-ok !@bad, 'quotemeta $foo matches $foo for every character';
+ok !@bad, '$foo =~ quotemeta $foo for every character';
+report_bad;
+
+for(0..0x1200) {
+  next if rand > .25;
+  my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+  push @bad, $_ if $c !~ /\Q$c\E|a/;
+}
+ok !@bad, '$foo =~ /\Q$foo\E|a/ for every character';
 report_bad;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From tchrist@perl.com

In-Reply-To​:

  Message from "Father Chrysostomos via RT" <perlbug-followup@​perl.org>
  of "Sun, 28 Nov 2010 13​:16​:27 PST."
  <rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@​perl.org>

Furthermore, Perl’s strings are not just Unicode. Unicode strings
are merely a subset of the strings that Perl supports.

Quite.

Regular expressions are for looking at strings. So it should not warn
or die based on the contents of the string, as long as it is a valid
Perl string.

perl already warns for "\x{d800}" and chr 0xd800. So if such a string
is passed to a regular expression, we get multiple warnings for the
same character.

It’s true.

I use Perl’s strings for storing 16-bit binary data. The result is
that not only the code creating such strings, but any code looking at
the strings, has to turn off utf8 warnings. So I can’t use any CPAN
modules such as Data​::Dump​::Streamer.

That seems unfortunate.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

What you’ve written seems perfectly reasonable — even desirable.

However, I am rather concerned that this could lead to anomalous behavior.
Here’s the kind of thing I don’t believe we want to see happen in Perl.

Java’s pattern matching acts completely nutty when presented with the kind
of data you’re talking about. It never warns or dies, just gives logically
inconsistent results. I hope we do not embark down a road that leads to
this kinda of nonsense.

To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a
0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher
does with that. I’ve all‐capped results I find most troubling below.
A surrogate pair tests as a single character, but so does half of that
pair. Plus if you flip the order of the surrogates, you now utterly
illogical results, with the matcher claiming it has neither surrogates nor
nonsurrogates in it, nor any characters at all, yet still allowing some
things to match nonetheless, but others to senselessly fail.

  * Correct surrogate order​:

  true "\uD83D\uDC7E" =~ /./
  true "\uD83D\uDC7E" =~ /^.$/
  false "\uD83D\uDC7E" =~ /^..$/
  false "\uD83D\uDC7E" =~ /\p{Cs}/
  true "\uD83D\uDC7E" =~ /\P{Cs}/
  true​: "\uD83D\uDC7E" =~ /\uD83D\uDC7E/
  false​: "\uD83D\uDC7E" =~ /\uDC7E/
  false​: "\uD83D\uDC7E" =~ /\uD83D/

  * Half a surrogate pair​:

  TRUE "\uD83D" =~ /./
  TRUE "\uD83D" =~ /^.$/
  false "\uD83D" =~ /^..$/
  true "\uD83D" =~ /\p{Cs}/
  false "\uD83D" =~ /\P{Cs}/
  true "\uD83D" =~ /\uD83D/

  * The other half of a surrogate pair​:

  TRUE "\uDC7E" =~ /./
  TRUE "\uDC7E" =~ /^.$/
  false "\uDC7E" =~ /^..$/
  true "\uDC7E" =~ /\p{Cs}/
  false "\uDC7E" =~ /\P{Cs}/
  true "\uDC7E" =~ /\uDC7E/

  * Surrogates in backwards order​:

  FALSE "\uDC7E\uD83D" =~ /./
  false "\uDC7E\uD83D" =~ /^.$/
  true "\uDC7E\uD83D" =~ /^..$/
  FALSE "\uDC7E\uD83D" =~ /\p{Cs}/
  FALSE "\uDC7E\uD83D" =~ /\P{Cs}/
  true​: "\uDC7E\uD83D" =~ /\uDC7E\uD83D/
  FALSE​: "\uDC7E\uD83D" =~ /\uD83D/
  FALSE​: "\uDC7E\uD83D" =~ /\uDC7E/

See what I mean? Isn’t that loony? I’m not sure what you would
see done with “raw data”, but I sure do hope it’s nothing at all
like *that*!

--tom

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

On Sun Nov 28 14​:42​:02 2010, tom christiansen wrote​:

In-Reply-To​:

Message from "Father Chrysostomos via RT" \<perlbug\-followup@&#8203;perl\.org>
   of "Sun\, 28 Nov 2010 13&#8203;:16&#8203;:27 PST\."

<rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@​perl.org>

Furthermore, Perl’s strings are not just Unicode. Unicode strings
are merely a subset of the strings that Perl supports.

Quite.

Regular expressions are for looking at strings. So it should not warn
or die based on the contents of the string, as long as it is a valid
Perl string.

perl already warns for "\x{d800}" and chr 0xd800. So if such a string
is passed to a regular expression, we get multiple warnings for the
same character.

It’s true.

I use Perl’s strings for storing 16-bit binary data. The result is
that not only the code creating such strings, but any code looking at
the strings, has to turn off utf8 warnings. So I can’t use any CPAN
modules such as Data​::Dump​::Streamer.

That seems unfortunate.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

What you’ve written seems perfectly reasonable — even desirable.

However, I am rather concerned that this could lead to anomalous behavior.
Here’s the kind of thing I don’t believe we want to see happen in Perl.

Java’s pattern matching acts completely nutty when presented with the kind
of data you’re talking about. It never warns or dies, just gives
logically
inconsistent results. I hope we do not embark down a road that leads to
this kinda of nonsense.

To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a
0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher
does with that. I’ve all‐capped results I find most troubling below.
A surrogate pair tests as a single character, but so does half of that
pair. Plus if you flip the order of the surrogates, you now utterly
illogical results, with the matcher claiming it has neither surrogates nor
nonsurrogates in it, nor any characters at all, yet still allowing some
things to match nonetheless, but others to senselessly fail.

None of that will happen in perl, because 0xDC7E and U+1F47E are
completely unrelated characters, as far as it is concerned.

$ perl -le' print "yes" if "\x{1F47E}" =~ /\p{Cs}/'
$ perl -le' print "yes" if "\x{DC7E}" =~ /\p{Cs}/'
yes

$ perl -le' print "yes" if "\x{1F47E}" =~ /^.\z/'
yes
$ perl -le' print "yes" if "\x{D83D}\x{DC7E}" =~ /^.\z/'

I’m not sure what you would
see done with “raw data”, but I sure do hope it’s nothing at all
like *that*!

It will be treated the same way as \x{110000}-\x{ffffffff), except that
\p{Cs} can match a surrogate and there is no such shorthand for
\x{110000}-\x{ffffffff).

I’m just making the utf8-warning implementation the same as the
non-utf8-warning implementation.

BTW, here are two more patches.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

From​: Father Chrysostomos <sprout@​cpan.org>

[perl #63446] "x" =~ /\x/ for all characters

This makes "x" =~ /\x/ work for all characters that are not ASCII
letters or numbers, regardless of utf8 warnings.

Inline Patch
diff -up blead-63446-utf8-warnings4/regcomp.c blead-63446-utf8-warnings5/regcomp.c
--- blead-63446-utf8-warnings4/regcomp.c	2010-11-28 11:03:40.000000000 -0800
+++ blead-63446-utf8-warnings5/regcomp.c	2010-11-28 14:24:16.000000000 -0800
@@ -8326,7 +8326,7 @@ parseit:
 	    if (UTF) {
 		value = utf8n_to_uvchr((U8*)RExC_parse,
 				   RExC_end - RExC_parse,
-				   &numlen, UTF8_ALLOW_DEFAULT);
+				   &numlen, UTF8_ALLOW_ANYUV);
 		RExC_parse += numlen;
 	    }
 	    else
diff -up blead-63446-utf8-warnings4/t/re/beyond_unicode.t blead-63446-utf8-warnings5/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings4/t/re/beyond_unicode.t	2010-11-28 14:06:55.000000000 -0800
+++ blead-63446-utf8-warnings5/t/re/beyond_unicode.t	2010-11-28 14:44:59.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
 	require './test.pl';
 }
 
-plan 3;
+plan 4;
 
 my @bad;
 
@@ -18,7 +18,7 @@ sub test_against_many_chars(&$) {
  my($test, $name) = @::_;
  @bad = ();
  for(0..0x1200) {
-  next if rand > .25;
+  next if rand > .125;
   &$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
  }
  ok !@bad, $name;
@@ -42,3 +42,10 @@ test_against_many_chars {
   my $c = join "", @{$_[0]};
   push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
 } '$foo =~ /[$foo]/ for every character';
+
+test_against_many_chars {
+  # Skip this for the ASCII range, as "a" =~ /\a/ obviously does not match.
+  return if !$_;
+  my $c = join "", @{$_[0]};
+  push @bad, $_ if $c !~ ("^[" . ($c =~ s/(.)/\\$1/gross) . "]+\\z");
+} '"x" =~ /[\x]/ for every character';

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

On Sun Nov 28 14​:54​:51 2010, sprout wrote​:

BTW, here are two more patches.

RT did not like those files. Let’s try this again​:

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

From​: Father Chrysostomos <sprout@​cpan.org>

Make t/re/beyond_unicode.t less repetititive

Inline Patch
diff -up blead-63446-utf8-warnings3/t/re/beyond_unicode.t blead-63446-utf8-warnings4/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings3/t/re/beyond_unicode.t	2010-11-28 12:47:07.000000000 -0800
+++ blead-63446-utf8-warnings4/t/re/beyond_unicode.t	2010-11-28 14:06:55.000000000 -0800
@@ -14,33 +14,31 @@ plan 3;
 
 my @bad;
 
-sub report_bad {
+sub test_against_many_chars(&$) {
+ my($test, $name) = @::_;
+ @bad = ();
+ for(0..0x1200) {
+  next if rand > .25;
+  &$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
+ }
+ ok !@bad, $name;
+
  if(@bad) {
   diag "Bad ranges: ", join " ", map sprintf("%x00..%x00",$_,$_+1), @bad;
  }
 }
 
-@bad = ();
-for(0..0x1200) {
-  next if rand > .25;
-  my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+  my $c = join "", @{$_[0]};
   push @bad, $_ if $c !~ quotemeta $c;
-}
-ok !@bad, '$foo =~ quotemeta $foo for every character';
-report_bad;
+} '$foo =~ quotemeta $foo for every character';
 
-for(0..0x1200) {
-  next if rand > .25;
-  my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+  my $c = join "", @{$_[0]};
   push @bad, $_ if $c !~ /\Q$c\E|a/;
-}
-ok !@bad, '$foo =~ /$foo|a/ for every character';
-report_bad;
+} '$foo =~ /$foo|a/ for every character';
 
-for(0..0x1200) {
-  next if rand > .25;
-  my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+  my $c = join "", @{$_[0]};
   push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
-}
-ok !@bad, '$foo =~ /[$foo]/ for every character';
-report_bad;
+} '$foo =~ /[$foo]/ for every character';

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 28, 2010

From @cpansprout

From​: Father Chrysostomos <sprout@​cpan.org>

[perl #63446] "x" =~ /\x/ for all characters

This makes "x" =~ /\x/ work for all characters that are not ASCII
letters or numbers, regardless of utf8 warnings.

Inline Patch
diff -up blead-63446-utf8-warnings4/regcomp.c blead-63446-utf8-warnings5/regcomp.c
--- blead-63446-utf8-warnings4/regcomp.c	2010-11-28 11:03:40.000000000 -0800
+++ blead-63446-utf8-warnings5/regcomp.c	2010-11-28 14:24:16.000000000 -0800
@@ -8326,7 +8326,7 @@ parseit:
 	    if (UTF) {
 		value = utf8n_to_uvchr((U8*)RExC_parse,
 				   RExC_end - RExC_parse,
-				   &numlen, UTF8_ALLOW_DEFAULT);
+				   &numlen, UTF8_ALLOW_ANYUV);
 		RExC_parse += numlen;
 	    }
 	    else
diff -up blead-63446-utf8-warnings4/t/re/beyond_unicode.t blead-63446-utf8-warnings5/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings4/t/re/beyond_unicode.t	2010-11-28 14:06:55.000000000 -0800
+++ blead-63446-utf8-warnings5/t/re/beyond_unicode.t	2010-11-28 14:44:59.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
 	require './test.pl';
 }
 
-plan 3;
+plan 4;
 
 my @bad;
 
@@ -18,7 +18,7 @@ sub test_against_many_chars(&$) {
  my($test, $name) = @::_;
  @bad = ();
  for(0..0x1200) {
-  next if rand > .25;
+  next if rand > .125;
   &$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
  }
  ok !@bad, $name;
@@ -42,3 +42,10 @@ test_against_many_chars {
   my $c = join "", @{$_[0]};
   push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
 } '$foo =~ /[$foo]/ for every character';
+
+test_against_many_chars {
+  # Skip this for the ASCII range, as "a" =~ /\a/ obviously does not match.
+  return if !$_;
+  my $c = join "", @{$_[0]};
+  push @bad, $_ if $c !~ ("^[" . ($c =~ s/(.)/\\$1/gross) . "]+\\z");
+} '"x" =~ /[\x]/ for every character';

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 29, 2010

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun Nov 28 14​:42​:02 2010, tom christiansen wrote​:

In-Reply-To​:

Message from "Father Chrysostomos via RT" \<perlbug\-followup@&#8203;perl\.org>
   of "Sun\, 28 Nov 2010 13&#8203;:16&#8203;:27 PST\."

<rt-3.6.HEAD-13564-1290978986-1559.63446-15-0@​perl.org>

Furthermore, Perl’s strings are not just Unicode. Unicode strings
are merely a subset of the strings that Perl supports.
Quite.

Regular expressions are for looking at strings. So it should not warn
or die based on the contents of the string, as long as it is a valid
Perl string.
perl already warns for "\x{d800}" and chr 0xd800. So if such a string
is passed to a regular expression, we get multiple warnings for the
same character.
It’s true.

I use Perl’s strings for storing 16-bit binary data. The result is
that not only the code creating such strings, but any code looking at
the strings, has to turn off utf8 warnings. So I can’t use any CPAN
modules such as Data​::Dump​::Streamer.
That seems unfortunate.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.
What you’ve written seems perfectly reasonable — even desirable.

However, I am rather concerned that this could lead to anomalous behavior.
Here’s the kind of thing I don’t believe we want to see happen in Perl.

Java’s pattern matching acts completely nutty when presented with the kind
of data you’re talking about. It never warns or dies, just gives
logically
inconsistent results. I hope we do not embark down a road that leads to
this kinda of nonsense.

To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a
0xDC7E in UTF-16BE. Here are the crazy things the Java pattern matcher
does with that. I’ve all‐capped results I find most troubling below.
A surrogate pair tests as a single character, but so does half of that
pair. Plus if you flip the order of the surrogates, you now utterly
illogical results, with the matcher claiming it has neither surrogates nor
nonsurrogates in it, nor any characters at all, yet still allowing some
things to match nonetheless, but others to senselessly fail.

None of that will happen in perl, because 0xDC7E and U+1F47E are
completely unrelated characters, as far as it is concerned.

$ perl -le' print "yes" if "\x{1F47E}" =~ /\p{Cs}/'
$ perl -le' print "yes" if "\x{DC7E}" =~ /\p{Cs}/'
yes

$ perl -le' print "yes" if "\x{1F47E}" =~ /^.\z/'
yes
$ perl -le' print "yes" if "\x{D83D}\x{DC7E}" =~ /^.\z/'

I’m not sure what you would
see done with “raw data”, but I sure do hope it’s nothing at all
like *that*!

It will be treated the same way as \x{110000}-\x{ffffffff), except that
\p{Cs} can match a surrogate and there is no such shorthand for
\x{110000}-\x{ffffffff).

I’m just making the utf8-warning implementation the same as the
non-utf8-warning implementation.

BTW, here are two more patches.

I have some uneasiness about this. It needs ample vetting here.

First, to make sure you know, I am planning to shortly change things so
that the non-characters and above-Unicode code points do not by default
warn except in I/O. The fixes to do that are more minimal than your
patches.

I had thought of doing that with surrogates as well, but this met with
resistance. This was some months back. So I didn't even propose it
with my most recent postings, the last one of which got no response,
which I take to mean that I had finally addressed all the concerns
expressed earlier.

It seems to me that the best solution would be a way to declare a binary
string, and it would be illegal to operate on it using things that
require semantics beyond the ordinal. So /i would not be valid, nor
uc(), nor /\w/, etc, etc. But that might be construed as being against
Perl philosophy.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 29, 2010

From @cpansprout

On Sun Nov 28 18​:32​:05 2010, public@​khwilliamson.com wrote​:

I have some uneasiness about this. It needs ample vetting here.

First, to make sure you know, I am planning to shortly change things
so
that the non-characters and above-Unicode code points do not by
default
warn except in I/O.

If warnings are on, right?

The fixes to do that are more minimal than your
patches.

I’d better stop, then. :-)

I had thought of doing that with surrogates as well, but this met with
resistance.

Can you give me a reference?

This was some months back. So I didn't even propose it
with my most recent postings, the last one of which got no response,
which I take to mean that I had finally addressed all the concerns
expressed earlier.

It seems to me that the best solution would be a way to declare a
binary
string, and it would be illegal to operate on it using things that
require semantics beyond the ordinal. So /i would not be valid, nor
uc(), nor /\w/, etc, etc. But that might be construed as being
against
Perl philosophy.

/i and \x{d800} are orthogonal, so neither one should stop the other
from working.

Whether I/O, chr and "\x{...}" warn or not, as long as I can turn off
the warning with ‘no warnings "utf8"’, does not matter to me.

But I reiterate that regular expressions should never warn or die for
valid Perl strings. That’s a bit like adding uninitialized warnings to
‘defined’.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 29, 2010

From @demerphq

On 28 November 2010 22​:16, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sun Mar 21 19​:08​:59 2010, khw wrote​:

On Tue Feb 24 13​:27​:21 2009, zefram@​fysh.org wrote​:

$ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~
   /\A[\x{123}]/ ? "yes" : "no"'
no
$ perl -lwe 'print do { no warnings "utf8"; "\x{d800}" } =~
   /\A[\x{123}]/ ? "yes" : "no"'
Malformed UTF-8 character (fatal) at -e line 1.
$

Turning warnings on makes the regexp operation die, whereas with
   warnings
off it produced the correct answer.  If 'no warnings "utf8"' is in
   scope
at the regexp op, then the error does not occur.

I'm not sure what to do about this ticket.  The basics of it anyway are
behaving as designed, which is that non-characters and surrogates
generate errors unless warnings are turned off, but then things should
work.

It may be working as designed, but it was not designed very well.

The message in 5.12 for U+FFFF has been clarified that this
character is illegal for interchange.  This should be extended in a
later release to the other 65 noncharacters.

Surrogates, on the other hand, should never appear in well-formed utf8,
and there are security considerations for doing so that I don't fully
understand but can see why.

The regular expression engine is not a security layer. It should not
pretend to be one. If I want to implement a security layer using regular
expressions, then this bug (yes, I do consider it a bug) will get in the
way.

Furthermore, Perl’s strings are not just Unicode. Unicode strings are
merely a subset of the strings that Perl supports.

Regular expressions are for looking at strings. So it should not warn or
die based on the contents of the string, as long as it is a valid Perl
string.

perl already warns for "\x{d800}" and chr 0xd800. So if such a string is
passed to a regular expression, we get multiple warnings for the same
character.

I use Perl’s strings for storing 16-bit binary data. The result is that
not only the code creating such strings, but any code looking at the
strings, has to turn off utf8 warnings. So I can’t use any CPAN modules
such as Data​::Dump​::Streamer.

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

I agree, except that I would include /i matches.

Using /i on a unicode flagged string implies you want (our brand of)
unicode folding semantics.

In order to make that work effectively we need to be able to depend on
the utf8 data following the rules.

So i think its just fine if the case-folding logic warns about something.

But I agree that the regex engine should not block case sensitive matches.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 29, 2010

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun Nov 28 18​:32​:05 2010, public@​khwilliamson.com wrote​:

I have some uneasiness about this. It needs ample vetting here.

First, to make sure you know, I am planning to shortly change things
so
that the non-characters and above-Unicode code points do not by
default
warn except in I/O.

If warnings are on, right?

Yes, I keep forgetting to say that.

The fixes to do that are more minimal than your
patches.

I’d better stop, then. :-)

At least until we see how this resolves, anyway.

I had thought of doing that with surrogates as well, but this met with
resistance.

Can you give me a reference?

The only relatively recent one I can find is a really mild comment from
Yves saying he would need to to think about the ramifications. I
thought there were more. There certainly is discussion on the recent
thread
http​://groups.google.com/group/perl.perl5.porters/browse_thread/thread/501f0059709a973b/2599e7219597cec4?lnk=gst&q=non-character

But I ran across this very similar discussion from two years ago​:
http​://rt.perl.org/rt3//Public/Bug/Display.html?id=51936

I'm willing to make surrogates internally allowed by default, like
non-characters if the consensus is it's ok to do so. They most
definitely would continue to be warned about on I/O. Part of the
problem with them is that the Unicode standard says they should not be
in well-formed utf8. John G. Myers can address this (cc'd)

This was some months back. So I didn't even propose it
with my most recent postings, the last one of which got no response,
which I take to mean that I had finally addressed all the concerns
expressed earlier.

It seems to me that the best solution would be a way to declare a
binary
string, and it would be illegal to operate on it using things that
require semantics beyond the ordinal. So /i would not be valid, nor
uc(), nor /\w/, etc, etc. But that might be construed as being
against
Perl philosophy.

/i and \x{d800} are orthogonal, so neither one should stop the other
from working.

This brings up another question that occurred to me. Didn't you say you
were processing binary data? If so, then why is it encoded in utf8?

Whether I/O, chr and "\x{...}" warn or not, as long as I can turn off
the warning with ‘no warnings "utf8"’, does not matter to me.

But I reiterate that regular expressions should never warn or die for
valid Perl strings. That’s a bit like adding uninitialized warnings to
‘defined’.

I think most of us agree. If Perl stored its strings internally in U32
words instead of U8 utf8 bytes, I don't think there would be this
discussion, or the earlier ones.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 12, 2010

From @cpansprout

On Mon Nov 29 11​:53​:45 2010, public@​khwilliamson.com wrote​:

This brings up another question that occurred to me. Didn't you say
you
were processing binary data? If so, then why is it encoded in utf8?

By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It
is perl that happens to use utf8 internally to represent it, but I
should not have to worry about that.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 12, 2010

From @cpansprout

On Mon Nov 29 02​:06​:16 2010, demerphq wrote​:

On 28 November 2010 22​:16, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

I agree, except that I would include /i matches.

Using /i on a unicode flagged string implies you want (our brand of)
unicode folding semantics.

In order to make that work effectively we need to be able to depend on
the utf8 data following the rules.

So i think its just fine if the case-folding logic warns about something.

That makes case-tolerance conceptually more complex than it needs to be.
I thought /σ/i was supposed to be equivalent to /[Σσς]/, but you seem to
be saying that the former would warn while the latter would not.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 13, 2010

From @demerphq

On 12 December 2010 22​:04, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Mon Nov 29 02​:06​:16 2010, demerphq wrote​:

On 28 November 2010 22​:16, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

I propose we stop the regular expression engine from rejecting or
warning about these characters altogether. The only checking should be
for code that creates such characters or for I/O layers.

I agree, except that I would include /i matches.

Using /i on a unicode flagged string implies you want (our brand of)
unicode folding semantics.

In order to make that work effectively we need to be able to depend on
the utf8 data following the rules.

So i think its just fine if the case-folding logic warns about something.

That makes case-tolerance conceptually more complex than it needs to be.
I thought /σ/i was supposed to be equivalent to /[Σσς]/, but you seem to
be saying that the former would warn while the latter would not.

What I was saying was that if we are doing a case insensitive match
and you put a \x{D800} in your string or pattern, then we would be
entitled to warn, as that codepoint is reserved for representing high
value codepoints that cannot be expressed in 16 bits, and does not
represent a "character" at all, and thus cannot be folded. Similar
story for codepoints > 10FFFF etc.

In other words, we should NOT warn if someone wants to match a string
against \x{D800} or match a string containing \x{D800} against a
case-sensitive pattern, as there is no reason to ascribe semantic
meaning to the codepoints when doing case-sensitive matching. However
when we case fold we must ascribe semantic meaning to the codepoints,
and when we encounter one that is illegal it makes sense to say so.

Also, just as a note, in the early days the character class notation
was created so that people had a shorthand way to write (a|b|c|d) type
constructs. With unicode folding rules where the folded version of a
string can be longer than the original, this doesnt make quite as much
sense. For instance /[\x{DF}-\x{FF}]/i becomes problematic, as \xDF
fold to 'ss' so, which a character class cant match, and even more
bizarre, what exactly does it mean to have a range with a
multi-codepoint string as the startpoint?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @chipdude

On 12/12/2010 1​:02 PM, Father Chrysostomos via RT wrote​:

On Mon Nov 29 11​:53​:45 2010, public@​khwilliamson.com wrote​:

This brings up another question that occurred to me. Didn't you say
you
were processing binary data? If so, then why is it encoded in utf8?
By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It
is perl that happens to use utf8 internally to represent it, but I
should not have to worry about that.

I hate to have to disagree, but​: "UTF8" means "UCS Translation Format -
8-bit", and "UCS" means "Universal Character Set", i.e. Unicode.
Unicode semantics _are_ part of what Perl supports, so Perl is entitled
to give Unicode-specific meaning to the code points it finds therein.
What you seem to want is for Perl to support "arbitrary integers encoded
as variable-length byte strings using the same encoding tricks as UTF8",
and of course it is possible that this could have been done, but that's
not what Perl actually promises to do.

So complaining when Perl takes seriously the "u" in "utf8" seems
ill-founded. Unless, that is, I have been grievously misinformed?

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @ikegami

On Mon, Dec 13, 2010 at 7​:14 PM, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/12/2010 1​:02 PM, Father Chrysostomos via RT wrote​:

By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It
is perl that happens to use utf8 internally to represent it, but I
should not have to worry about that.

I hate to have to disagree, but​: "UTF8" means "UCS Translation Format -
8-bit", and "UCS" means "Universal Character Set", i.e. Unicode.

The name of some internal flag is of very little importance.

Perl currently supports strings of arbitrary 32-bit numbers in 32-bit
builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I don't
know of any documentation to the contrary (but I'm not familiar with the
latest).

$ perl -E'say 0xFFFFFFFFFFFFFFFF'
18446744073709551615

$ perl -E'say ord chr 0xFFFFFFFFFFFFFFFF'
18446744073709551615

$ perl -MEncode -E'$x=chr 0xFFFFFFFFFFFFFFFF; Encode​::_utf8_off($x); say
length($x)'
13

Despite being named "UTF8", the flag clearly does not imply adherence to
UTF-8.

(Obviously, uc() and the regex engine will assign meaning to those numbers,
but that's unrelated.)

It may be that Perl should be changed so its strings are confined to strings
of Unicode characters, but basing your argument on the name of some internal
flag makes the argument unconvincing.

- Eric

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @chipdude

On 12/13/2010 5​:24 PM, Eric Brine wrote​:

On Mon, Dec 13, 2010 at 7​:14 PM, Reverend Chip <rev.chip@​gmail.com
<mailto​:rev.chip@​gmail.com>> wrote​:

On 12/12/2010 1&#8203;:02 PM\, Father Chrysostomos via RT wrote&#8203;:
> By 16\-bit binary data\, I mean sequences of unsigned 16\-bit
integers\. It
> is perl that happens to use utf8 internally to represent it\, but I
> should not have to worry about that\.

I hate to have to disagree\, but&#8203;:  "UTF8" means "UCS Translation
Format \-
8\-bit"\, and "UCS" means "Universal Character Set"\, i\.e\. Unicode\.

The name of some internal flag is of very little importance.

That's would be true, if it were purely an internal flag. But the flag
is very external, both in code ("utf8​::upgrade()") and in
documentation. Please try harder.

Perl currently supports strings of arbitrary 32-bit numbers in 32-bit
builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I
don't know of any documentation to the contrary...

Well of course Perl is designed to perform as gracefully as possible as
the Unicode committee(s) assign new code points; to do otherwise would
be downright stupid. But that forward-looking design is irrelevant to
the fact that Perl knows the strings _are_ Unicode. As for documentary
evidence, from the many possible choices I pick mostly at random this
quotation from perlunicode -- which is referenced from utf8's
documentation, natch​:

  "Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data."

The above-quoted objections are hardly worth knocking down. Please,
please try harder.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @demerphq

On 14 December 2010 07​:22, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/13/2010 5​:24 PM, Eric Brine wrote​:

On Mon, Dec 13, 2010 at 7​:14 PM, Reverend Chip <rev.chip@​gmail.com
<mailto​:rev.chip@​gmail.com>> wrote​:

    On 12/12/2010 1​:02 PM, Father Chrysostomos via RT wrote​:
    > By 16-bit binary data, I mean sequences of unsigned 16-bit
    integers. It
    > is perl that happens to use utf8 internally to represent it, but I
    > should not have to worry about that.

    I hate to have to disagree, but​:  "UTF8" means "UCS Translation
    Format -
    8-bit", and "UCS" means "Universal Character Set", i.e. Unicode.

The name of some internal flag is of very little importance.

That's would be true, if it were purely an internal flag.  But the flag
is very external, both in code ("utf8​::upgrade()") and in
documentation.  Please try harder.

Perl currently supports strings of arbitrary 32-bit numbers in 32-bit
builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I
don't know of any documentation to the contrary...

Well of course Perl is designed to perform as gracefully as possible as
the Unicode committee(s) assign new code points; to do otherwise would
be downright stupid.  But that forward-looking design is irrelevant to
the fact that Perl knows the strings _are_ Unicode.   As for documentary
evidence, from the many possible choices I pick mostly at random this
quotation from perlunicode -- which is referenced from utf8's
documentation, natch​:

   "Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data."

You are a bit misinformed. The internals specifically contemplated
handling the utf8 encoding as a way to implement packed arrays of 32
bit integers.

It is only when we must ascribe meaning to codepoints, such as when we
do case change operations, or case insensitive matching that we
ascribe semantic meaning to the values.

There is no reason not to allow \x{D800} to be stored in a utf8
string, except if someone wants to treat that string as having meaning
under unicode. Its not perls nature to say "you cant do that - unicode
doesn't agree" except when we have no other choice.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @chipdude

On 12/13/2010 11​:24 PM, demerphq wrote​:

On 14 December 2010 07​:22, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/13/2010 5​:24 PM, Eric Brine wrote​:

Perl currently supports strings of arbitrary 32-bit numbers in 32-bit
builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I
don't know of any documentation to the contrary...
Well of course Perl is designed to perform as gracefully as possible as
the Unicode committee(s) assign new code points; to do otherwise would
be downright stupid. But that forward-looking design is irrelevant to
the fact that Perl knows the strings _are_ Unicode. As for documentary
evidence, from the many possible choices I pick mostly at random this
quotation from perlunicode -- which is referenced from utf8's
documentation, natch​:

"Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data."
You are a bit misinformed. The internals specifically contemplated
handling the utf8 encoding as a way to implement packed arrays of 32
bit integers.

Code cannot contemplate. What are you trying to say? A hypothetical
leveraging of the utf8 support code for some other purpose is off topic.

It is only when we must ascribe meaning to codepoints, such as when we
do case change operations, or case insensitive matching that we
ascribe semantic meaning to the values.

Well, of course. Unnecessary validation work is unnecessary. Still,
Perl knows it's Unicode.

There is no reason not to allow \x{D800} to be stored in a utf8
string, except if someone wants to treat that string as having meaning
under unicode.

Perl does treat the string as having meaning under Unicode. This is
established. Now if a programmer decides to play a game in which he
puts illegal code points into Unicode strings because Perl's validation
is lazy, well, that's a game that programmer may win and may lose; but
in any case, he has no grounds to complain when Perl's validation
catches up with him.

Its not perls nature to say "you cant do that - unicode
doesn't agree" except when we have no other choice.

Perl's nature both includes compliance and integrity. It's established
and documented that Perl's "utf8" is a representation of Unicode; that's
not a lie, but a truth that some people are in denial about. Perl
interprets your commands within that context. So its compliance has
limits and conditions. It has always been thus.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @ikegami

On Tue, Dec 14, 2010 at 1​:46 PM, Reverend Chip <rev.chip@​gmail.com> wrote​:

Perl is made of its operators, in part. If operators treat the strings

as Unicode, then Perl does.

But substr, length and index don't treat strings as Unicode. It doesn't
assign any meaning to the characters. That's why I can use substr, length
and index and strings of arbitrary integers (e.g. iso-latin-15, JFIF image,
etc).

I hope you're not saying I'm misusing substr by using it on binary data. So
the only question that leaves is what's the limit in the size of the
integers. Nowhere does it mention that the integers are limited to 8-bits,
and it's not limited to 8-bits in practice. (It's limited to UV.) That's an
assumption you're carrying over from C or something.

I'm not saying we should support more than Unicode or not, just that we
currently do support more than Unicode.

- Eric

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 14, 2010

From @chipdude

On 12/14/2010 12​:31 PM, Eric Brine wrote​:

On Tue, Dec 14, 2010 at 1​:46 PM, Reverend Chip <rev.chip@​gmail.com
<mailto​:rev.chip@​gmail.com>> wrote​:

Perl is made of its operators\, in part\.  If operators treat the
strings

as Unicode\, then Perl does\. 

But substr, length and index don't treat strings as Unicode.

Yes, they do. But their error checking is minimal for performance
reasons. So you're getting away with cheating.

It doesn't assign any meaning to the characters.

For the value of "it" that is Perl as a whole, I've already proven this
point wrong; please spare us the repetition. If you mean each
individual operator, then see above.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 15, 2010

From @khwilliamson

I'd like to come to some closure on this discussion. Let me start by
stepping back and summarizing, first quoting from the Unicode Standard​:

"2.7 Unicode Strings

"A Unicode string data type is simply an ordered sequence of code units.
Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units,
a Unicode 16-bit string is an ordered sequence of 16-bit code units, and
a Unicode 32-bit string is an ordered sequence of 32-bit code units.

"Depending on the programming environment, a Unicode string may or may
not be required to be in the corresponding Unicode encoding form. For
example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings,
but are not necessarily well-formed UTF-16 sequences. In normal
processing, it can be far more efficient to allow such strings to
contain code unit sequences that are not well-formed UTF-16—that is,
isolated surrogates. Because strings are such a fundamental component
of every program, checking for isolated surrogates in every operation
that modifies strings can create significant overhead, especially
because supplementary characters are extremely rare as a percentage of
overall text in programs worldwide.

"It is straightforward to design basic string manipulation libraries
that handle isolated surrogates in a consistent and straightforward
manner. They cannot ever be interpreted as abstract characters, but they
can be internally handled the same way as noncharacters where they
occur. Typically they occur only ephemerally, such as in dealing with
keyboard events. While an ideal protocol would allow keyboard events to
contain complete strings, many allow only a single UTF-16 code unit per
event. As a sequence of events is transmitted to the application, a
string that is being built up by the application in response to those
events may contain isolated surrogates at any particular point in time."

And the definition of "abstract character"​:

"D7 Abstract character​: A unit of information used for the organization,
control, or representation of textual data.

"When representing data, the nature of that data is generally symbolic
as opposed to some other kind of data (for example, aural or visual).
Examples of such symbolic data include letters, ideographs, digits,
punctuation, technical symbols, and dingbats.

"An abstract character has no concrete form and should not be confused
with a glyph.

"An abstract character does not necessarily correspond to what a user
thinks of as a “character” and should not be confused with a grapheme.

"The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.

"Abstract characters not directly encoded by the Unicode Standard can
often be represented by the use of combining character sequences."

What that bureaucratize comes down to is that an abstract character is a
code point that has been assigned a meaning, like LINE FEED or LATIN
CAPITAL LETTER A. There are 2**21 -1 code points in Unicode, ranging
from 0 to 0x10FFFF; somewhat less than a quarter are currently assigned.

There are 4 categories of code points that are not abstract characters
(private use are separate, and not an issue)​:

1) Those that may be assigned in the future; they have General Category Cn.

2) Noncharacters, also having Gc=Cn. These are reserved for internal
use by an application; and are illegal for interchange between
applications. Perl handles these improperly, treating them like it does
surrogates, except that for conversion from utf8 to an unsigned value,
it knows about only one of the 66 of these. Asymmetrically, going from
uv to utf8, it does know about all 66, but splits them into two groups,
a distinction not present in the standard.

3) Beyond Unicode code points. These are the code points above 0x10FFFF
but fitting into whatever size word is available. Unicode has said that
these will never be used by it.

4) Surrogates, having Gc=Cs, which are reserved for use in pairs in
UTF-16 to allow > 16 bit code points to be specified.

My original proposal this time round was to fix the noncharacters to
operate as the standard says, by allowing them by default internally.
Only during I/O would they be checked for. It seems like a small
extension to give this behavior as well to the beyond-Unicode code
points. And it has been suggested that this behavior extend as well to
the surrogates, so that any unsigned value can be represented as a "Perl
string". Note that no one is proposing that any of these values be
legal upon I/o.

In this thread, I don't think I've heard what the harm is of allowing
surrogates internally. The above text from the standard seems to allow
that possibility, as long as they don't represent an abstract character.
  So why not allow them (as they mostly are now when warnings are off)?

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 16, 2010

From @chipdude

On 12/15/2010 11​:55 AM, karl williamson wrote​:

"As a sequence of events is transmitted to the application, a string
that is being built up by the application in response to those events
may contain isolated surrogates at any particular point in time."

That is a much better explanation than I have previously made as to why
Perl is so lax with its code point checking while it still knows,
certainly, that its strings are Unicode. It makes sense that Perl must
allow arbitrary sequences of code points because Perl can't know what
the string is being used for; and too much error checking would render
Perl less useful. For example, when translating JS/JSON/Java strings to
Perl utf8 strings, it shouldn't be surprising or alarming to find at
some point surrogates as 'characters' in the Perl string. Any time Perl
code copes with UTF-16 might also have this happen. So Perl can't do
much error checking on contained code points.

My original proposal this time round was to fix the noncharacters to
operate as the standard says, by allowing them by default internally.
Only during I/O would they be checked for. It seems like a small
extension to give this behavior as well to the beyond-Unicode code
points. And it has been suggested that this behavior extend as well
to the surrogates, so that any unsigned value can be represented as a
"Perl string". Note that no one is proposing that any of these values
be legal upon I/o.

In this thread, I don't think I've heard what the harm is of allowing
surrogates internally. The above text from the standard seems to
allow that possibility, as long as they don't represent an abstract
character. So why not allow them (as they mostly are now when
warnings are off)?

I think "allow" is too overloaded; perhaps I'm misunderstanding what you
mean by it. Maybe this is violent agreement. :-( That said, "harm"
isn't the only relevant standard. There's also "correctness." If you
ask Perl to do something with data and the data are not properly formed
for what you ask it to do -- if Perl can't do it _correctly_ -- then we
expect Perl to tell us. Compare C<"xyz" + 1>.

A utf8 string where some of the code points are surrogates, which is
being processed in a way that requires knowing the code points'
semantics, like \p{whatever} or case operations, cannot be processed
_correctly_ because there is no _correct_ answer. Therein may lay the
harm you seek, btw​: Perl silently acting on invalid data and producing
invalid results without the programmer getting a warning about it.

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 17, 2010

From @khwilliamson

Reverend Chip wrote​:

On 12/15/2010 11​:55 AM, karl williamson wrote​:

"As a sequence of events is transmitted to the application, a string
that is being built up by the application in response to those events
may contain isolated surrogates at any particular point in time."

That is a much better explanation than I have previously made as to why
Perl is so lax with its code point checking while it still knows,
certainly, that its strings are Unicode. It makes sense that Perl must
allow arbitrary sequences of code points because Perl can't know what
the string is being used for; and too much error checking would render
Perl less useful. For example, when translating JS/JSON/Java strings to
Perl utf8 strings, it shouldn't be surprising or alarming to find at
some point surrogates as 'characters' in the Perl string. Any time Perl
code copes with UTF-16 might also have this happen. So Perl can't do
much error checking on contained code points.

My original proposal this time round was to fix the noncharacters to
operate as the standard says, by allowing them by default internally.
Only during I/O would they be checked for. It seems like a small
extension to give this behavior as well to the beyond-Unicode code
points. And it has been suggested that this behavior extend as well
to the surrogates, so that any unsigned value can be represented as a
"Perl string". Note that no one is proposing that any of these values
be legal upon I/o.

In this thread, I don't think I've heard what the harm is of allowing
surrogates internally. The above text from the standard seems to
allow that possibility, as long as they don't represent an abstract
character. So why not allow them (as they mostly are now when
warnings are off)?

I think "allow" is too overloaded; perhaps I'm misunderstanding what you
mean by it. Maybe this is violent agreement. :-( That said, "harm"
isn't the only relevant standard. There's also "correctness." If you
ask Perl to do something with data and the data are not properly formed
for what you ask it to do -- if Perl can't do it _correctly_ -- then we
expect Perl to tell us. Compare C<"xyz" + 1>.

A utf8 string where some of the code points are surrogates, which is
being processed in a way that requires knowing the code points'
semantics, like \p{whatever} or case operations, cannot be processed
_correctly_ because there is no _correct_ answer. Therein may lay the
harm you seek, btw​: Perl silently acting on invalid data and producing
invalid results without the programmer getting a warning about it.

But surrogates do have semantics. The standard is kind of
self-contradictory about these things. It says that surrogate code
points are not legal Unicode code points, the same as for those above
10FFFF. But the data files give a property definition for surrogates
for every property in Unicode. The upper case of a surrogate is itself.
  It has general category 'Cs' or 'Surrogate'. It's in one of the
surrogate blocks. It case folds to itself. These definitions are not
just by-products of having to have place-markers in the data files​: In
the standard's Section "D.4 Changes from Version 5.0 to Version 5.1", it
says​: "In UAX #24, “Unicode Script Property,” added surrogates to the
list of code points which get the “Unknown” script value ...".

So they are actively maintaining the properties for surrogate code points.

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

A problem is that it doesn't currently work any way like this when
converting from utf8 to code point. Unless warnings are off, or called
with specific flags, the common function that does this will return 0
instead of the code point for surrogates, one of the non-character code
points, and any values that don't fit into 31 bits. All the other
non-character code points and above 10FFFF but fitting into 31 bits are
unchecked.

(It is beyond me as to why the 31-bit limit, unless there was concern
that UV's didn't work and IV's would be required, or I suppose it could
be that testing for this was the most efficient, requiring only a
single-byte compare, to make sure that it would fit into 32 bits.)

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 17, 2010

From @chipdude

On 12/17/2010 1​:51 PM, karl williamson wrote​:

Reverend Chip wrote​:

A utf8 string where some of the code points are surrogates, which is
being processed in a way that requires knowing the code points'
semantics, like \p{whatever} or case operations, cannot be processed
_correctly_ because there is no _correct_ answer. Therein may lay the
harm you seek, btw​: Perl silently acting on invalid data and producing
invalid results without the programmer getting a warning about it.

But surrogates do have semantics. The standard is kind of
self-contradictory about these things. It says that surrogate code
points are not legal Unicode code points, the same as for those above
10FFFF. But the data files give a property definition for surrogates
for every property in Unicode. The upper case of a surrogate is
itself. It has general category 'Cs' or 'Surrogate'. It's in one of
the surrogate blocks. It case folds to itself. These definitions are
not just by-products of having to have place-markers in the data
files​: In the standard's Section "D.4 Changes from Version 5.0 to
Version 5.1", it says​: "In UAX #24, “Unicode Script Property,” added
surrogates to the list of code points which get the “Unknown” script
value ...". So they are actively maintaining the properties for
surrogate code points.

Only a committee would declare something illegal, and also specify how
it should work.
I left that whole paragraph in place because I can hardly believe it, so
I want to remind myself that it's true.

So​: Under the "generous in what you accept" principle, if Unicode has
gone so far as to maintain the code point properties, then we really
should play along, I guess. \p{foo} for everyone!

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

A problem is that it doesn't currently work any way like this when
converting from utf8 to code point. Unless warnings are off, or
called with specific flags, the common function that does this will
return 0 instead of the code point for surrogates, one of the
non-character code points, and any values that don't fit into 31
bits. All the other non-character code points and above 10FFFF but
fitting into 31 bits are unchecked.

(It is beyond me as to why the 31-bit limit, unless there was concern
that UV's didn't work and IV's would be required, or I suppose it
could be that testing for this was the most efficient, requiring only
a single-byte compare, to make sure that it would fit into 32 bits.)

Erm, that sounds like something that should change, then. If current
Unicode defines properties for surrogate code points (and presumably
that one non-character code point?) then we need to be able to decode
them. I suspect that the 31-bit limit probably does relate to IVs; but
given that UV support is quite good I think allowing 32 bits is a good
idea, but I imagine that might have cascading implications throughout
the code that processes characters. (Why signed integers are the
default in C is something I hope never to truly understand.)

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 17, 2010

From tchrist@perl.com

Only a committee would declare something illegal, and also
specify how it should work.

I left that whole paragraph in place because I can hardly
believe it, so I want to remind myself that it's true.

I thought the same thing. Remember, Java is the place where both a single
surrogate (half a character, as it were) tests true for /^.$/, but so does
a surrogate pair​: you can't distinguish them! There's a disturbing amount
of Java mumblese in the Unicode docs. It's disturbing because they're
citing people who clearly don't understand important matters, which shows
that they themselves are thinking fuzzily. This does not inspire confidence.

--tom

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 17, 2010

From @chipdude

On 12/17/2010 2​:34 PM, Tom Christiansen wrote​:

Only a committee would declare something illegal, and also
specify how it should work.
I left that whole paragraph in place because I can hardly
believe it, so I want to remind myself that it's true.
I thought the same thing. Remember, Java is the place where both a single
surrogate (half a character, as it were) tests true for /^.$/, but so does
a surrogate pair​: you can't distinguish them!

That's remarkable. I'll bet that a reversed surrogate pair matches
/^.{2}$/.

There's a disturbing amount
of Java mumblese in the Unicode docs. It's disturbing because they're
citing people who clearly don't understand important matters, which shows
that they themselves are thinking fuzzily. This does not inspire confidence.

Semantic bleed back from Java makes sense; has the Java world been a
political force in Unicode, perhaps?

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 18, 2010

From tchrist@perl.com

That's remarkable. I'll bet that a reversed surrogate pair matches
/^.{2}$/.

Clever guess! But it's worse than that​:

  U+D83D TRUE​: "?" =~ /./
  U+D83D TRUE​: "?" =~ /^.$/
  U+D83D false​: "?" =~ /../
  U+D83D false​: "?" =~ /^..$/
  U+D83D TRUE​: "?" =~ /\pC/
  U+D83D TRUE​: "?" =~ /\p{Cs}/
  U+D83D TRUE​: "?" =~ /\p{InHighSurrogates}/
  U+D83D false​: "?" =~ /\p{InLowSurrogates}/
  U+D83D false​: "?" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

  U+1F47E TRUE​: "👾" =~ /./
  U+1F47E TRUE​: "👾" =~ /^.$/
  U+1F47E false​: "👾" =~ /../
  U+1F47E false​: "👾" =~ /^..$/
  U+1F47E TRUE​: "👾" =~ /\pC/
  U+1F47E TRUE​: "👾" =~ /\p{Cs}/
  U+1F47E false​: "👾" =~ /\p{InHighSurrogates}/
  U+1F47E TRUE​: "👾" =~ /\p{InLowSurrogates}/
  U+1F47E TRUE​: "👾" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

  U+DC7E TRUE​: "?" =~ /./
  U+DC7E TRUE​: "?" =~ /^.$/
  U+DC7E false​: "?" =~ /../
  U+DC7E false​: "?" =~ /^..$/
  U+DC7E TRUE​: "?" =~ /\pC/
  U+DC7E TRUE​: "?" =~ /\p{Cs}/
  U+DC7E false​: "?" =~ /\p{InHighSurrogates}/
  U+DC7E TRUE​: "?" =~ /\p{InLowSurrogates}/
  U+DC7E false​: "?" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

  U+DC7E.D83D TRUE​: "??" =~ /./
  U+DC7E.D83D false​: "??" =~ /^.$/
  U+DC7E.D83D TRUE​: "??" =~ /../
  U+DC7E.D83D TRUE​: "??" =~ /^..$/
  U+DC7E.D83D TRUE​: "??" =~ /\pC/
  U+DC7E.D83D TRUE​: "??" =~ /\p{Cs}/
  U+DC7E.D83D TRUE​: "??" =~ /\p{InHighSurrogates}/
  U+DC7E.D83D TRUE​: "??" =~ /\p{InLowSurrogates}/
  U+DC7E.D83D false​: "??" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

  U+FDDD TRUE​: "�" =~ /./
  U+FDDD TRUE​: "�" =~ /^.$/
  U+FDDD false​: "�" =~ /../
  U+FDDD false​: "�" =~ /^..$/
  U+FDDD false​: "�" =~ /\pC/
  U+FDDD false​: "�" =~ /\p{Cs}/
  U+FDDD false​: "�" =~ /\p{InHighSurrogates}/
  U+FDDD false​: "�" =~ /\p{InLowSurrogates}/
  U+FDDD TRUE​: "�" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

  U+FFFF TRUE​: "#" =~ /./
  U+FFFF TRUE​: "#" =~ /^.$/
  U+FFFF false​: "#" =~ /../
  U+FFFF false​: "#" =~ /^..$/
  U+FFFF false​: "#" =~ /\pC/
  U+FFFF false​: "#" =~ /\p{Cs}/
  U+FFFF false​: "#" =~ /\p{InHighSurrogates}/
  U+FFFF false​: "#" =~ /\p{InLowSurrogates}/
  U+FFFF TRUE​: "#" =~ /[^\pL\pM\pN\pP\pS\pZ\pC]/

That's with a UTF-8 output encoding.

Anything smell fishy to you? I mean, more than once ever few lines? :)

--tom

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 18, 2010

From @ap

* Reverend Chip <rev.chip@​gmail.com> [2010-12-14 01​:15]​:

I hate to have to disagree, but​:

You are disagreeing with Larry.

"UTF8" means "UCS Translation Format - 8-bit", and "UCS" means
"Universal Character Set", i.e. Unicode. Unicode semantics
_are_ part of what Perl supports, so Perl is entitled to give
Unicode-specific meaning to the code points it finds therein.
What you seem to want is for Perl to support "arbitrary
integers encoded as variable-length byte strings using the same
encoding tricks as UTF8", and of course it is possible that
this could have been done, but that's not what Perl actually
promises to do.

It actually does, and has documented that it does.

http​://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8

So complaining when Perl takes seriously the "u" in "utf8"
seems ill-founded. Unless, that is, I have been grievously
misinformed?

Personally I like that Perl is lax there.

In fact I want a particular UTF-8 encoder/decoder for Perl at
some point, which and can fully round-trip binary data, and is
known as UTF-8b.

In <http​://bsittler.livejournal.com/10381.html> it is described
briefly. It comes from a long and thorough analysis detailed in
<http​://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>. And
<http​://hyperreal.org/~est/freeware/> has implementations for
C and Python.

It is part of Perl’s heritage and appeal that it allows your code
to deal with the outside world in as messy or tidy a manner as
necessary to get the job done – the glue language. I have grown
some distaste for technologies that fail on that count, eg. the
restriction in XML that you can’t represent control characters in
a well-formed document, in any way. I remember the problems this
caused at a site I worked on, where we could not provide feeds of
user posts without hacky workarounds, since HTTP POSTs can
contain anything. When you are forbidden from saying something at
all, it will sooner or later become a problem.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 18, 2010

From @demerphq

On 14 December 2010 19​:39, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/13/2010 11​:24 PM, demerphq wrote​:

On 14 December 2010 07​:22, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/13/2010 5​:24 PM, Eric Brine wrote​:

Perl currently supports strings of arbitrary 32-bit numbers in 32-bit
builds, and strings of arbitrary 64-bit numbers in 64-bit builds. I
don't know of any documentation to the contrary...
Well of course Perl is designed to perform as gracefully as possible as
the Unicode committee(s) assign new code points; to do otherwise would
be downright stupid.  But that forward-looking design is irrelevant to
the fact that Perl knows the strings _are_ Unicode.   As for documentary
evidence, from the many possible choices I pick mostly at random this
quotation from perlunicode -- which is referenced from utf8's
documentation, natch​:

   "Unless explicitly stated, Perl operators use character semantics
for Unicode data and byte semantics for non-Unicode data."
You are a bit misinformed. The internals specifically contemplated
handling the utf8 encoding as a way to implement packed arrays of 32
bit integers.

Code cannot contemplate.  What are you trying to say?  A hypothetical
leveraging of the utf8 support code for some other purpose is off topic.

The coder can contemplate, and the implement, support outside of pure
requirements.

So in particular (and this is documented) we use the term "utf8" to
represent "Perl's private extension of UTF8".

utf8 is equivalent to "UTF8" in that all legal (canonical) UTF8
sequences are legal utf8, however not all legal utf8 sequences are in
UTF8 as utf8 supports a larger range of codepoints, and codepoints
which are illegal in UTF8, like the codepoints reserved for UTF-16
(surrogate pairs).

It is only when we must ascribe meaning to codepoints, such as when we
do case change operations, or case insensitive matching that we
ascribe semantic meaning to the values.

Well, of course.  Unnecessary validation work is unnecessary.  Still,
Perl knows it's Unicode.

"Perl knows it's Unicode" is an insufficiently well defined expression
for this discussion.

Flipping the utf8 bit on a SV tells perl that the integers stored
therin are to be decoded as utf8, and that if certain operations are
performed to use special Unicode routines to do so. When the latter
are invoked Perl has the right to complain about problems with the
contents of the utf8 string. If they are not it has no business doing
so.

There is no reason not to allow \x{D800} to be stored in a utf8
string, except if someone wants to treat that string as having meaning
under unicode.

Perl does treat the string as having meaning under Unicode.

Only when performing an operation that requires lookup into the
Unicode database. Which is actually rarely. The rest of the time it
knows it is utf8. Which as I explain above is /not/ Unicode.

This is established.

No, it is not.

Now if a programmer decides to play a game in which he
puts illegal code points into Unicode strings because Perl's validation
is lazy,

It is not lazy. It is deliberately designed to do this. Read the code
and the comments.

well, that's a game that programmer may win and may lose; but
in any case, he has no grounds to complain when Perl's validation
catches up with him.

No. Again, you havent read the docs. We document that utf8 is not
UTF8, and that you can do things with utf8 that are strictly speaking
illegal in UTF8.

 Its not perls nature to say "you cant do that - unicode
doesn't agree" except when we have no other choice.

Perl's nature both includes compliance and integrity.  It's established
and documented that Perl's "utf8" is a representation of Unicode; that's
not a lie, but a truth that some people are in denial about.  Perl
interprets your commands within that context.  So its compliance has
limits and conditions.  It has always been thus.

You are misinformed. See "perlunifaq"

<quote>

=head2 What's the difference between C<UTF-8> and C<utf8>?

C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
what it accepts. If you have to communicate with things that aren't so liberal,
you may want to consider using C<UTF-8>. If you have to communicate with things
that are too liberal, you may have to use C<utf8>. The full explanation is in
L<Encode>.

C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
consistently, even where utf8 is actually used internally, because theAnd f
distinction can be hard to make, and is mostly irrelevant.

For example, utf8 can be used for code points that don't exist in Unicode, like
9999999, but if you encode that to UTF-8, you get a substitution character (by
default; see L<Encode/"Handling Malformed Data"> for more ways of dealing with
this.)

Okay, if you insist​: the "internal format" is utf8, not UTF-8. (When it's not
some other encoding.)

</quote>

And further from Encode​:

<quote>
=head1 UTF-8 vs. utf8 vs. UTF8

  ....We now view strings not as sequences of bytes, but as sequences
  of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
  computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.

That has been the perl's notion of UTF-8 but official UTF-8 is more
strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are
not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).

Now that is overruled by Larry Wall himself.

  From​: Larry Wall <larry@​wall.org>
  Date​: December 04, 2004 11​:51​:58 JST
  To​: perl-unicode@​perl.org
  Subject​: Re​: Make Encode.pm support the real UTF-8
  Message-Id​: <20041204025158.GA28754@​wall.org>

  On Fri, Dec 03, 2004 at 10​:12​:12PM +0000, Tim Bunce wrote​:
  : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
  : but "UTF-8" is the name of the standard and should give the
  : corresponding behaviour.

  For what it's worth, that's how I've always kept them straight in my
  head.

  Also for what it's worth, Perl 6 will mostly default to strict but
  make it easy to switch back to lax.

  Larry

Do you copy? As of Perl 5.8.7, B<UTF-8> means strict, official UTF-8
while B<utf8> means liberal, lax, version thereof. And Encode version
2.10 or later thus groks the difference between C<UTF-8> and C"utf8".

  encode("utf8", "\x{FFFF_FFFF}", 1); # okay
  encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks

C<UTF-8> in Encode is actually a canonical name for C<utf-8-strict>.
Yes, the hyphen between "UTF" and "8" is important. Without it Encode
goes "liberal"

  find_encoding("UTF-8")->name # is 'utf-8-strict'
  find_encoding("utf-8")->name # ditto. names are case insensitive
  find_encoding("utf_8")->name # ditto. "_" are treated as "-"
  find_encoding("UTF8")->name # is 'utf8'.

The UTF8 flag is internally called UTF8, without a hyphen. It indicates
whether a string is internally encoded as utf8, also without a hypen.

</quote>

Seems to me you have some reading to do.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 18, 2010

From @demerphq

On 17 December 2010 00​:53, Reverend Chip <rev.chip@​gmail.com> wrote​:

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

That is what I, and others have been saying all along.

And is actually the only way for things to work which complies with
existing documentation and code.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 19, 2010

From @cpansprout

On Dec 16, 2010, at 3​:53 PM, Reverend Chip wrote​:

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

Warnings for surrogates may seem logical at first, but they do not solve my problem of not being able to use modules that I didn’t write. So I’ll just have to continue monkey-patching warnings​::import. (I’m getting used to that sort of thing.)

\p{foo} should definitely be exempt, as we have \p{Cs} specifically for matching surrogates.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From @chipdude

On 12/18/2010 1​:21 PM, demerphq wrote​:

On 17 December 2010 00​:53, Reverend Chip <rev.chip@​gmail.com> wrote​:

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.
That is what I, and others have been saying all along.

Indeed, you told me so.

And is actually the only way for things to work which complies with
existing documentation and code.

I think the existing docs are ambiguous. No matter now, of course.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From @chipdude

On 12/19/2010 2​:40 PM, Father Chrysostomos wrote​:

On Dec 16, 2010, at 3​:53 PM, Reverend Chip wrote​:

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

Warnings for surrogates may seem logical at first, but they do not solve my problem of not being able to use modules that I didn’t write. So I’ll just have to continue monkey-patching warnings​::import. (I’m getting used to that sort of thing.)

Sorry for the trouble. This seems like a hack needed whenever warnings
are triggered by bad data (or at least data Perl would call bad), e.g
math on non-numbers.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From @chipdude

On 12/18/2010 1​:18 PM, demerphq wrote​:

See "perlunifaq"

<quote>

=head2 What's the difference between C<UTF-8> and C<utf8>?

C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
what it accepts. [...]

OK, fine, I was mistaken. All code points are welcome. The fact that
"utf8" doesn't mean the same as "UTF-8" is remarkably annoying, and I
don't blame myself for falling for it. I will read perlunitut (again),
though, just in case there are more land mines I might step on.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From @ikegami

On Mon, Dec 20, 2010 at 6​:45 AM, Reverend Chip <rev.chip@​gmail.com> wrote​:

On 12/19/2010 2​:40 PM, Father Chrysostomos wrote​:

On Dec 16, 2010, at 3​:53 PM, Reverend Chip wrote​:

So it seems to me, in the end, that the warnings on surrogates in
\p{foo}, //i, lc, uc, etc. are important; but that we could document
that set of operations that will warn, and guarantee to programmers that
if they stay clear of those operators, they can put any pseudo-character
in a utf8 string and we will promise to avert our collective gaze.

Warnings for surrogates may seem logical at first, but they do not solve
my problem of not being able to use modules that I didn’t write. So I’ll
just have to continue monkey-patching warnings​::import. (I’m getting used to
that sort of thing.)

Sorry for the trouble. This seems like a hack needed whenever warnings
are triggered by bad data (or at least data Perl would call bad), e.g
math on non-numbers.

It's going to be like math on NaN. There are billions of different NaN, just
like there are billions of non-Unicode characters.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From tchrist@perl.com

Anything smell fishy to you? I mean, more than once ever few lines? :)

I'm not sure what it means. I thought we were talking Java but then I
see all the \p patterns; does Java implement those?

Yes, *some* of them. The pre-3.1 ones only, mostly. It's a
massive flustercluck, and they don't give a damn.

* Like they have no way to convert between code points and character
  names.

* Like they ignore the rules about loose interpretation of properties.

* Like they have \p{Alpha} for [A-Za-z] *only*, and no \p{Alphabetic}.

* Like they have \p{InGreek} but not \p{IsGreek}.

* Like they have \p{javaWhiteSpace}, which they *CLAIM* maps to Unicode
  whitespace, but they lie lie lie, because they exempt the super-common
  U+A0 plus U+2007 and U+202F from being white space.

* Like the fools use such lovely things as \p{javaJavaIdentifierStart},
  which besides its evil name, isn't something they actually use.
  Meaning, I can put control characters *including NULs and ESCs!!!!*
  in my Java idents, and everything chugs along just fine until I
  trigger your terminal's answer-back sequence.

* They can't even get casing right when they claim to.

* They're canonical equivalence is broken because of the damned
  Java preprocessor.

* You can't get back out the pattern that you originally compiled, and
  which the regex compiler changed without telling you how.

This is just the very tip of a huge festering pool of poorly engineered
errors that they're loth even to admit to, and next to intransigent about
fixing, for that would require acknowledging their own errors.

Java's claims of Unicode support are nothing but that​: claims. They
are unsubtantiable, and there is no will to fix the *SEVERAL DOZEN*
bugs I have discovered. The answer is inevitably one of these​:

  (1) Sure, but does anybody really care?

  (2) It's documented not to work, so we never have to make it work.

  (3) Yes, that's a shame, but fixing it would break backward
  compatibility.

  (4) That's only required for Level N+1 compliance, so we don't have
  to fix it.

There is an arrogance and insularity, and NIH ignorance, amongst the Sun
Java people the likes of which I don't think I've seen since the old IBM
days before the DEC wars. It is infuriating beyond description. I hope
Java dies dies dies.

Since they refuse to act on my double-digit worth of Unicode
bugs, I'm seriously considering creating a "tribute site" where I
air out their dirty laundry. That's how angry I am about this
putrid pile of Certifiably Regressive Agonizing Puke.

Sure, I may not be able to shame them into fixing anything, but I
can certainly shame them. That may be some small solace to me as
well as warning to others who've been suckered into this insane
piece of mindless bloatware.

  RESOLVED​: DO NOT USE JAVA IF YOU NEED TO DEAL WITH UNICODE F
  U
  L
  L

  S
  T
  O
  P

Also, is the UTF-8 output encoding relevant to the match
results, or just their display?

Just to their display. The internals are all converted into UTFMH-16.

Here​: read 'em and weep.

--tom

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 20, 2010

From tchrist@perl.com

import java.io.*;
import java.util.regex.*;

public class surotest {

private static PrintStream stdout;

private static void testmatch(String s, String re) {

  boolean found = Pattern.compile(re).matcher(s).find();

  stdout.printf("U+");
  for (int i = 0; i < s.length(); i++) {
  stdout.printf("%04X", s.codePointAt(i));
  if (s.codePointAt(i) > Character.MAX_VALUE) { i++; } // idiots
  if (i+1 < s.length()) { stdout.printf("."); }
  }

  stdout.printf(" %s​: ", found ? " TRUE" : "false", s);
  stdout.printf(" \"%s\" =~ /%s/\n", s, re);
}

public static void main(String[ ] args)
  throws IOException

{ // yes, this is intentionally outdented

  /*
  * note that encoding-mapping failures are suppressed,
  * as in fact are all errors with this PoS interface
  *
  * What fools these morons be!
  */

  stdout = new PrintStream(System.out, true, "UTF-8");

  String[] slist = {
  "\uD83D", // high surrogate half - invalid
  "\uD83D\uDC7E", // U+1F47E
  "\uDC7E", // low surrogate half - invalid
  "\uDC7E\uD83D", // wrong order
  "\uFDDD", // invalid
  "\uFFFF", // invalid
  };

  for (String s : slist) {
  stdout.println("");

  testmatch(s, ".");
  testmatch(s, "^.$");
  testmatch(s, "..");
  testmatch(s, "^..$");
  testmatch(s, "\\pC");
  testmatch(s, "\\p{Cs}");
  testmatch(s, "\\p{InHighSurrogates}");
  testmatch(s, "\\p{InLowSurrogates}");
  testmatch(s, "[^\\pL\\pM\\pN\\pP\\pS\\pZ\\pC]");
  }
}

}

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

From @khwilliamson

The acceptance of surrogates no longer is dependent on warnings being
enabled or not. Now they are accepted except under strict input rules.

--Karl Williamson

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jan 10, 2011

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant