quotemeta() fails to quote literal non-word character under utf8 #10602

p5pRT · 2010-09-02T19:58:13Z

Migrated from rt.perl.org#77654 (status was 'resolved')

Searchable as RT77654$

p5pRT · 2010-09-02T19:58:15Z

From mncharity@vendian.org

Created by mncharity@vendian.org

quotemeta() fails to quote a CENT SIGN when,
using utf8, the string is created with
a literal CENT SIGN character, instead of with \xA2 .

----

use utf8;
use Test;
plan( tests => 21 );

# Bug Synopsis

# quotemeta() fails to quote a CENT SIGN when,
# using utf8, the string is created with
# a literal CENT SIGN character, instead of with \xA2 .

ok("¢","\xA2"); # ok
ok(quotemeta("\xA2"),"\\¢"); # ok

ok(quotemeta("¢"),"\\¢"); # NOT OK
ok(quotemeta("¢"),quotemeta("\xA2")); # NOT OK

# Bug Demonstration

my $a = "¢";
my $b = "\xA2";
ok($a,$b);
ok(($a eq $b),1);
ok(quotemeta($a),quotemeta($b)); # NOT OK
my $quoted = "\\\xA2";
ok("\\".$a,$quoted);
ok("\\".$b,$quoted);
ok(quotemeta($a),$quoted); # NOT OK
ok(quotemeta($b),$quoted); # ok

# Additional notes

# CENT SIGN is \xA2
ok("¢","\xA2");
# CENT SIGN is not a word character
ok("a"=/\w/,1);
ok("a"=/\W/,"");
ok("¢"=/\p{IsWord}/,"");
ok("¢"=/\P{IsWord}/,1);
ok("¢"=/\w/,"");
ok("¢"=/\W/,1);
# Regexps behave correctly
my $s;
$s = "¢"; $s =~ s/([^A-Za-z_0-9])/\\$1/g;
ok($s,$quoted);
$s = "¢"; $s =~ s/(\P{IsWord})/\\$1/g;
ok($s,$quoted);
$s = "¢"; $s =~ s/(\W)/\\$1/g;
ok($s,$quoted);

----

1..21
# Running under perl version 5.012001 for linux
# Current time local: Thu Sep 2 15:42:28 2010
# Current time GMT: Thu Sep 2 19:42:28 2010
# Using Test.pm version 1.25_02
ok 1
ok 2
not ok 3
# Test 3 got: "\xA2" (./bug.pl at line 14)
# Expected: "\\\xA2"
# ./bug.pl line 14 is: ok(quotemeta("¢"),"\\¢"); # NOT OK
not ok 4
# Test 4 got: "\xA2" (./bug.pl at line 15)
# Expected: "\\\xA2"
# ./bug.pl line 15 is: ok(quotemeta("¢"),quotemeta("\xA2")); # NOT OK
ok 5
ok 6
not ok 7
# Test 7 got: "\xA2" (./bug.pl at line 24)
# Expected: "\\\xA2"
# ./bug.pl line 24 is: ok(quotemeta($a),quotemeta($b)); # NOT OK
ok 8
ok 9
not ok 10
# Test 10 got: "\xA2" (./bug.pl at line 28)
# Expected: "\\\xA2"
# ./bug.pl line 28 is: ok(quotemeta($a),$quoted); # NOT OK
ok 11
ok 12
ok 13
ok 14
ok 15
ok 16
ok 17
ok 18
ok 19
ok 20
ok 21

Perl Info


Flags:
     category=core
     severity=medium

Site configuration information for perl 5.12.1:

Configured by mncharity at Sun Jul  4 18:40:05 EDT 2010.

Summary of my perl5 (revision 5 version 12 subversion 1) configuration:

   Platform:
     osname=linux, osvers=2.6.32-22-generic, archname=x86_64-linux
     uname='linux pencil 2.6.32-22-generic #36-ubuntu smp thu jun 3 
19:31:57 utc 2010 x86_64 gnulinux '
     config_args='-des -Dprefix=/usr/local/perl512'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=undef, usemultiplicity=undef
     useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector 
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-fno-strict-aliasing -pipe -fstack-protector 
-I/usr/local/include'
     ccversion='', gccversion='4.4.3', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
     libs=-lnsl -ldl -lm -lcrypt -lutil -lc
     perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
     libc=/lib/libc-2.11.1.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.11.1'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib 
-fstack-protector'

Locally applied patches:



@INC for perl 5.12.1:
     /usr/local/perl512/lib/site_perl/5.12.1/x86_64-linux
     /usr/local/perl512/lib/site_perl/5.12.1
     /usr/local/perl512/lib/5.12.1/x86_64-linux
     /usr/local/perl512/lib/5.12.1
     .


Environment for perl 5.12.1:
     HOME=/home/mncharity
     LANG=en_US.utf8
     LANGUAGE (unset)
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
     PERL_BADLANG (unset)
     SHELL=/bin/bash

p5pRT · 2010-12-16T17:56:58Z

From @iabyn

On Thu, Sep 02, 2010 at 12:58:16PM -0700, Mitchell N Charity wrote:

quotemeta() fails to quote a CENT SIGN when,
using utf8, the string is created with
a literal CENT SIGN character, instead of with \xA2 .

This appears to be down to a difference in behaviour of quotemeta
depending on whether the string is internally UTF-8 encoded or not.

For non-utf8 strings, all chars *except* isALNUM() are \\-escaped; in
particular, chars with ords in the range 128-255 are always quoted.

For utf8 strings, chars with ord > 127 are never quoted. I think this
this is a bug that needs fixing, but can anyone confirm or deny?
In particular this would be be significant change in behaviour, since
currently the miriad of codepoints above 255 are not escaped, including
"letters" from non-latin character ranges. I would assume that all these
should be quoted.

The current docs make it clear that all chars except [A-Za-z_0-9] should
be escaped.

--
Monto Blanco... scorchio!

p5pRT · 2010-12-16T17:56:58Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2010-12-16T18:07:28Z

From tchrist@perl.com

For utf8 strings, chars with ord > 127 are never quoted. I think this
this is a bug that needs fixing, but can anyone confirm or deny?

I believe Unicode makes some guarantees regarding the stability of the
Pattern_Syntax set of characters for just such an occasion, but one should
ask Karl for details there.

--tom

p5pRT · 2010-12-16T19:48:10Z

From tchrist@perl.com

For utf8 strings, chars with ord > 127 are never quoted. I think this
this is a bug that needs fixing, but can anyone confirm or deny?

I've been thinking about this a bit more, rereading UAX#31, UTS#18, and
UTR#18. My first thought was that for Unicode, that all \W characters
should be quotemeta'd no matter what block their code points fall into.
(That may be a problem, though, as I'll explain below.)

I *think* that is what Dave is suggesting, not that merely all [^\x00-\x7F]
also be quoted, since that would violate certain first principles of what
is and is not a metacharacter in a Perl pattern: see elaboration underneath
my signature.

But I have encountered a problem with that idea. Unicode defines certain
characters as being Pattern_Syntax characters. It also defines certain
characters as being Pattern_White_Space characters. It further guarantees
that this set will never change, so that you can future-proof your program.

The important bits are from:

http://unicode.org/reports/tr31/#Pattern_Syntax

As of Unicode4.1, two Unicode character properties are defined to
provide for stable syntax: Pattern_White_Space and Pattern_Syntax.
Particular pattern languages may, of course, override these
recommendations, for example, by adding or removing other characters
for compatibility with ASCII usage.

For stability, the values of these properties are absolutely invariant,
not changing with successive versions of Unicode. Of course, this does
not limit the ability of the Unicode Standard to encode more symbol or
whitespace characters, but the syntax and whitespace code points
recommended for use in patterns will not change.

When *generating* rules or patterns, all whitespace and syntax code
points that are to be literals require quoting, using whatever
quoting mechanism is available. For readability, it is recommended
practice to quote or escape all literal whitespace and default
ignorable code points as well.

There's more there, which should probably be studied before we
do anything.

One would think that backslashing all \p{Pattern_Syntax} characters
would be the right thing to do. There are 2417 Pattern_Syntax code
points, all of which are in the BMP, none in the astral planes. But
there is one code point which is both \w and yet considered pattern
syntax; it's a \p{Lm} character:

% unichars -c '\p{Pattern_Syntax}' '\w'
ⸯ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

There are also two code points that are Pattern_White_Space but not
White_Space:

% unichars -c '\p{Pattern_White_Space}' '\P{White_Space}'
-- 8206 200E GC=Cf LEFT-TO-RIGHT MARK
-- 8207 200F GC=Cf RIGHT-TO-LEFT MARK

Which I'm not sure what to make up.

For what it's worth (which probably is nothing), there are 63
\p{Default_Ignorable_Code_Point} chars in the BMP, or 49 if you
discount \p{Cn} and HANGUL FILLER. There are rather more than
that up in the astral planes because of the TAG and VARIATION
SELECTOR stuff in the 0E0000 plane.

I believe there to be no changes to the sets of things I've
talked about here for Unicode 6.0; at least, I could find none.

--tom

ELABORATION:

The reason Perl quotes all \W characters is because of first principles
about what is and is not able to be a metacharacter in Perl patterns.

That principle is that, in patterns:

* a \w character never means anything special
* a \W character might mean something special

Whence it follows that

* backslashing a \w character might mean something special
* backslashing a \W character never means anything special

In point of fact, there are uniquely 12 and 12 only metacharacters
in Perl regexes, the dirty dozen of:

\ | ( ) [ { ^ $ * + ? .

The question becomes whether we want the flexibility to someday extend
our set of metacharacters beyond those 12. The quotemeta behavior of
backslashing any and all \W characters no matter what, while always
leaving inviolate all \w characters, was designed to provide for that.

We've never drawn upon our the \W reservoir for other pattern matching
operations in Perl5, but Perl6 has. For one thing, it uses "<expr>"
for circumfix quoting of subrules, as in Perl5 one uses "(?&expr)", so
both "<" and ">" are metachars. It also uses this notation for
Unicode properties, with a colon in front of the property name:

<:Letter> # \pL
<:!Letter> # \PL

<:East_Asian_Width<Narrow>> # \p{EA=N}
<:!Blk<ASCII>> # \P{Blk=ASCII}

For another thing, Perl6 uses "~" for matching nested subrules and uses
"&" for conjunction. Both of those, and "|", can also be doubled, but
doubling doesn't extend the set. The "~~" can be negated with a "!~~",
but all those are still ASCII \W characters.

I do not know whether one can add new metacharacters in Perl6 patterns.
I wouldn't put it past them, considering you can do so for regular
operators, but I can't figure out whether you can. If somebody reading
http://perlcabal.org/syn/S05.html can find something that says for sure
that this either *is* or else that it is *not* possible, I'd be
interested in knowing.

I have proposed that we adopt a way to specify character class union,
intersection, and subtraction. The Unicode documents talk about these
using simple + and -, which one can actually use in Perl when defining
one's own property subroutines, like

sub IsKana {
return <<'END';
+utf8::InHiragana
+utf8::InKatakana
-utf8::IsCn
END
}

which was used back before we had a proper Kana property (we now do).

Even if we did something Java's character class set mechanics (as I
have informally proposed), which uses [a[b]] for union, [a&&b] for
intersection, and the ungainly [a&&[^b]] for subtraction because [a-b]
was already taken, whatever we may elect to do would likely fall within
square brackets and so follow a different ruleset.

The Unicode documents use a cleaner syntax than Java's for talking
about these things, looking more like Perl6, although not quite the
same; for example, Unicode has separate "--" and "~~" operators for set
and symmetric difference respectively.

p5pRT · 2010-12-16T21:09:23Z

From @khwilliamson

Tom Christiansen wrote:

For utf8 strings, chars with ord > 127 are never quoted. I think this
this is a bug that needs fixing, but can anyone confirm or deny?

I've been thinking about this a bit more, rereading UAX#31, UTS#18, and
UTR#18.

I believe that UTR18 and UTS18 are now the same document.

My first thought was that for Unicode, that all \W characters

should be quotemeta'd no matter what block their code points fall into.
(That may be a problem, though, as I'll explain below.)

I *think* that is what Dave is suggesting, not that merely all [^\x00-\x7F]
also be quoted, since that would violate certain first principles of what
is and is not a metacharacter in a Perl pattern: see elaboration underneath
my signature.

But I have encountered a problem with that idea. Unicode defines certain
characters as being Pattern_Syntax characters. It also defines certain
characters as being Pattern_White_Space characters. It further guarantees
that this set will never change, so that you can future-proof your program.

Note that some of the code points in the sets are still unassigned, so
that gives Unicode some leeway to add things.

The important bits are from:

http&#8203;://unicode\.org/reports/tr31/\#Pattern\_Syntax

As of Unicode4\.1\, two Unicode character properties are defined to
provide for stable syntax&#8203;: Pattern\_White\_Space and Pattern\_Syntax\.
Particular pattern languages may\, of course\, override these
recommendations\, for example\, by adding or removing other characters
for compatibility with ASCII usage\.

For stability\, the values of these properties are absolutely invariant\,
not changing with successive versions of Unicode\. Of course\, this does
not limit the ability of the Unicode Standard to encode more symbol or
whitespace characters\, but the syntax and whitespace code points
recommended for use in patterns will not change\.

When \*generating\* rules or patterns\, all whitespace and syntax code
points that are to be literals require quoting\, using whatever
quoting mechanism is available\. For readability\, it is recommended
practice to quote or escape all literal whitespace and default
ignorable code points as well\.

There's more there, which should probably be studied before we
do anything.

One would think that backslashing all \p{Pattern_Syntax} characters
would be the right thing to do. There are 2417 Pattern_Syntax code
points, all of which are in the BMP, none in the astral planes. But
there is one code point which is both \w and yet considered pattern
syntax; it's a \p{Lm} character:

% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 ⸯ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

I have emailed Unicode about this apparent discrepancy.

There are also two code points that are Pattern_White_Space but not
White_Space:
% unichars \-c '\\p\{Pattern\_White\_Space\}' '\\P\{White\_Space\}'
 \-\- 8206 200E GC=Cf LEFT\-TO\-RIGHT MARK
 \-\- 8207 200F GC=Cf RIGHT\-TO\-LEFT MARK
Which I'm not sure what to make up.

They are, however, default ignorable code points, so it is recommended
that they be quoted. See the discussion in section 2.3 of #31. Some
implementations might want to allow them; I imagine that is why they
aren't pattern white space.

For what it's worth (which probably is nothing), there are 63
\p{Default_Ignorable_Code_Point} chars in the BMP, or 49 if you
discount \p{Cn} and HANGUL FILLER. There are rather more than
that up in the astral planes because of the TAG and VARIATION
SELECTOR stuff in the 0E0000 plane.

I believe there to be no changes to the sets of things I've
talked about here for Unicode 6.0; at least, I could find none.

--tom

ELABORATION:

The reason Perl quotes all \\W characters is because of first principles
about what is and is not able to be a metacharacter in Perl patterns\.

That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

The question becomes whether we want the flexibility to someday extend
our set of metacharacters beyond those 12\.  The quotemeta behavior of
backslashing any and all \\W characters no matter what\, while always
leaving inviolate all \\w characters\, was designed to provide for that\.

We've never drawn upon our the \\W reservoir for other pattern matching
operations in Perl5\, but Perl6 has\.  For one thing\, it uses "\<expr>"
for circumfix quoting of subrules\, as in Perl5 one uses "\(?&expr\)"\, so
both "\<" and ">" are metachars\.   It also uses this notation for
Unicode properties\, with a colon in front of the property name&#8203;:

    \<&#8203;:Letter>                           \# \\pL
    \<&#8203;:\!Letter>                          \# \\PL

    \<&#8203;:East\_Asian\_Width\<Narrow>>         \# \\p\{EA=N\}
    \<&#8203;:\!Blk\<ASCII>>                      \# \\P\{Blk=ASCII\}

For another thing\, Perl6 uses "~" for matching nested subrules and uses
"&" for conjunction\.  Both of those\, and "|"\, can also be doubled\, but
doubling doesn't extend the set\.  The "~~" can be negated with a "\!~~"\,
but all those are still ASCII \\W characters\.

I do not know whether one can add new metacharacters in Perl6 patterns\.
I wouldn't put it past them\, considering you can do so for regular
operators\, but I can't figure out whether you can\.  If somebody reading
http&#8203;://perlcabal\.org/syn/S05\.html can find something that says for sure
that this either \*is\* or else that it is \*not\* possible\, I'd be
interested in knowing\.

I have proposed that we adopt a way to specify character class union\,
intersection\, and subtraction\.  The Unicode documents talk about these
using simple \+ and \-\, which one can actually use in Perl when defining
one's own property subroutines\, like

    sub IsKana \{
        return \<\<'END';
    \+utf8&#8203;::InHiragana
    \+utf8&#8203;::InKatakana
    \-utf8&#8203;::IsCn
    END
    \}

which was used back before we had a proper Kana property \(we now do\)\.

Even if we did something Java's character class set mechanics \(as I
have informally proposed\)\, which uses \[a\[b\]\] for union\, \[a&&b\] for
intersection\, and the ungainly \[a&&\[^b\]\] for subtraction because \[a\-b\]
was already taken\, whatever we may elect to do would likely fall within
square brackets and so follow a different ruleset\.

The Unicode documents use a cleaner syntax than Java's for talking
about these things\, looking more like Perl6\, although not quite the
same; for example\, Unicode has separate "\-\-" and "~~" operators for set
and symmetric difference respectively\.

So I don't know what to do. This may be complicated by the fact that
Perl botched what are considered identifiers. My guess from the
comments is that it stems from the fact that Unicode botched the
definition of alpha between v1.9 and 3.0.1. sprout has gone in in 5.13
and fixed the definition so that it doesn't hang the parser, but for
backwards compatibility, it doesn't match the Unicode identifier
definition, and that is somewhat bothersome to me.

The Unicode recommendation is to only quote the pattern white space and
identifier characters plus the default ignorable code points. That
means most controls would not get quoted.

I believe Tom has a better handle on the implications than me. I await
his further ideas.

p5pRT · 2010-12-16T22:31:02Z

From tchrist@perl.com

SUMMARY: I believe that if nothing substantial can be gained by
using the broader \W over what UAX#31 says to quote, we
should use UAX#31's suggestions to implement quotemeta()
and \Q on Unicode.

I also think we should use those suggestions if there were some
error that \W might introduce. The code points I show below
which are both on UAX#31's things to quote list but which also
happen to be \w characters suggest that there may be.

Karl wrote:

One would think that backslashing all \p{Pattern_Syntax} characters
would be the right thing to do. There are 2417 Pattern_Syntax code
points, all of which are in the BMP, none in the astral planes. But
there is one code point which is both \w and yet considered pattern
syntax; it's a \p{Lm} character:
% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 ⸯ 11823 2E2F GC=Lm VERTICAL TILDE
I don't know whether that is a mistake or not. Karl?

I have emailed Unicode about this apparent discrepancy.

Good, thank you.

So I don't know what to do. This may be complicated by the fact that
Perl botched what are considered identifiers. My guess from the
comments is that it stems from the fact that Unicode botched the
definition of alpha between v1.9 and 3.0.1. sprout has gone in in
5.13 and fixed the definition so that it doesn't hang the parser, but
for backwards compatibility, it doesn't match the Unicode identifier
definition, and that is somewhat bothersome to me.

Could you please explain what that means, that Unicode botched the
definition of alpha between v1.9 and 3.0.1?

My working definitions of an alpha and an idenitifier charclass
in Java work out to these:

alphabetic_charclass =
"["
+ "\\pL" /* all Letters */
+ "\\pM" /* all Marks */
+ "\\p{Nl}" /* Letter Number */
+ "]";

identifier_charclass =
"["
+ "\\pL" /* all Letters */
+ "\\pM" /* all Marks */
+ "\\p{Nd}" /* Decimal Number */
+ "\\p{Nl}" /* Letter Number */
+ "\\p{Pc}" /* Connector Punctuation */
+ "[" /* or else chars which are both */
+ "\\p{InEnclosedAlphanumerics}"
+ "&&" /* and also */
+ "\\p{So}" /* Other Symbol */
+ "]"
+ "]";

Now, that's not quite the way #31's section 2 reads, but it may be
(close to) equivalent; I haven't checked. Hm, I'm pretty sure that I
have ZWJ and ZWNJ issues there, something that I addressed in working
out extended grapheme clusters but never backported to regular old
identifier-class characters.

What part of the sense of "alpha" or "identifier" did Perl and Unicode
part ways on? Is this perhaps only in Perl's parser, not in its notions
of properties? Does it have to do with #31 section 2, or something else?

The Unicode recommendation is to only quote the pattern white space and
identifier characters plus the default ignorable code points. That
means most controls would not get quoted.

That would save on space.

That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

Modulo the problematic U+2E2F, I believe that quoting all \W characters
is both safe and a proper superset of UAX#31. My only question is whether
there is anything to be gained by reducing that superset down to quoting
only those code points with any of

Pattern_Syntax
Pattern_White_Space
Default_Ignorable_Code_Point

Let's for this discussion call those the Pattern_Quotable set, or PQ.

The considerations are time and space. On space, there are certainly more
\W characters than there are QP characters. In the BMP:

% unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l
2475

% unichars '\W' | wc -l
4137

Adding Unassigned, PrivateUse, Han, and InHangulSyllables produces
a slight gain on the PQ set:

% unichars -u '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l
2832

And of course a substantial gain on the \W set:

% unichars -u '\W' | wc -l
13556

I am somewhat surprrised to see more identifier characters
in the PQ set than I had reckoned with:

% unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' '\w'
͏ 847 034F COMBINING GRAPHEME JOINER
ᅟ 4447 115F HANGUL CHOSEONG FILLER
ᅠ 4448 1160 HANGUL JUNGSEONG FILLER
᠋ 6155 180B MONGOLIAN FREE VARIATION SELECTOR ONE
᠌ 6156 180C MONGOLIAN FREE VARIATION SELECTOR TWO
᠍ 6157 180D MONGOLIAN FREE VARIATION SELECTOR THREE
ⸯ 11823 2E2F VERTICAL TILDE
ㅤ 12644 3164 HANGUL FILLER
︀ 65024 FE00 VARIATION SELECTOR-1
︁ 65025 FE01 VARIATION SELECTOR-2
︂ 65026 FE02 VARIATION SELECTOR-3
︃ 65027 FE03 VARIATION SELECTOR-4
︄ 65028 FE04 VARIATION SELECTOR-5
︅ 65029 FE05 VARIATION SELECTOR-6
︆ 65030 FE06 VARIATION SELECTOR-7
︇ 65031 FE07 VARIATION SELECTOR-8
︈ 65032 FE08 VARIATION SELECTOR-9
︉ 65033 FE09 VARIATION SELECTOR-10
︊ 65034 FE0A VARIATION SELECTOR-11
︋ 65035 FE0B VARIATION SELECTOR-12
︌ 65036 FE0C VARIATION SELECTOR-13
︍ 65037 FE0D VARIATION SELECTOR-14
︎ 65038 FE0E VARIATION SELECTOR-15
️ 65039 FE0F VARIATION SELECTOR-16
ﾠ 65440 FFA0 HALFWIDTH HANGUL FILLER

I believe Tom has a better handle on the implications than me.

Maybe, maybe not.

I await his further ideas.

It would save us on space to make quotemeta working on the smaller PQ set
than on the entire \W set, but would it save us anything on time?

I mean apart from the obvious that it takes time to allocate more stuff
that would need quoting; I mean the time involved in looking up properties.

I don't really know much how the swatches work, nor the true costs of
looking up properties in general, so I can't guess there. My instinct
is to run just quickly backlash any \W, but that's an ASCII-only instinct
established before we had guidelines for the PQ set, let alone for valid
identifier characters.

If there is nothing substantial to be gained by using the broader \W over
the previously defined Pattern_Quote set, then I think we should use PQ.
I also think we should use PQ if there were some error that \W might
introduce; the outliers that are both PQ and \w suggest there might be.

--tom

p5pRT · 2010-12-16T23:25:56Z

From @khwilliamson

Tom Christiansen wrote:

SUMMARY: I believe that if nothing substantial can be gained by
using the broader \W over what UAX#31 says to quote, we
should use UAX#31's suggestions to implement quotemeta()
and \Q on Unicode.
     I also think we should use those suggestions if there were some
     error that \\W might introduce\.  The code points I show below
     which are both on UAX\#31's things to quote list but which also
 happen to be \\w characters suggest that there may be\.
Karl wrote:
One would think that backslashing all \p{Pattern_Syntax} characters
would be the right thing to do. There are 2417 Pattern_Syntax code
points, all of which are in the BMP, none in the astral planes. But
there is one code point which is both \w and yet considered pattern
syntax; it's a \p{Lm} character:
% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 ⸯ 11823 2E2F GC=Lm VERTICAL TILDE
I don't know whether that is a mistake or not. Karl?
I have emailed Unicode about this apparent discrepancy.

Good, thank you.

I've gotten a (rapid) preliminary response. Their definition of \w
appears to be flawed, and likely should be revised to exclude U+2E2F.

So I don't know what to do. This may be complicated by the fact that
Perl botched what are considered identifiers. My guess from the
comments is that it stems from the fact that Unicode botched the
definition of alpha between v1.9 and 3.0.1. sprout has gone in in
5.13 and fixed the definition so that it doesn't hang the parser, but
for backwards compatibility, it doesn't match the Unicode identifier
definition, and that is somewhat bothersome to me.

Could you please explain what that means, that Unicode botched the
definition of alpha between v1.9 and 3.0.1?

Here are my comments in mktables, added when I researched the problem:
# The number of code points in \p{alpha} halved in 2.1.9. It turns out
# that the reason is that the CJK block starting at 4E00 was removed
# from PropList, and was not put back in until 3.1.0

And here are the comments from handy.h:
/* The ID_Start of Unicode was originally quite limiting: it assumed an
* L-class character (meaning that you could not have, say, a CJK charac-
* ter). So, instead, perl has for a long time allowed ID_Continue but
* not digits.
* We still preserve that for backward compatibility. But we also make sure
* that it is alphanumeric, so S_scan_word in toke.c will not hang. See
* http://rt.perl.org/rt3/Ticket/Display.html?id=74022
* for more detail than you ever wanted to know about. */

My working definitions of an alpha and an idenitifier charclass
in Java work out to these:
alphabetic\_charclass =
       "\["
     \+      "\\\\pL"            /\* all Letters    \*/
     \+      "\\\\pM"            /\* all Marks      \*/
     \+      "\\\\p\{Nl\}"         /\* Letter Number  \*/
     \+ "\]";


identifier\_charclass =
        "\["
     \+      "\\\\pL"          /\* all Letters      \*/
     \+      "\\\\pM"          /\* all Marks        \*/
     \+      "\\\\p\{Nd\}"       /\* Decimal Number   \*/
     \+      "\\\\p\{Nl\}"       /\* Letter Number    \*/
     \+      "\\\\p\{Pc\}"       /\* Connector Punctuation           \*/
     \+      "\["             /\*    or else chars which are both \*/
     \+          "\\\\p\{InEnclosedAlphanumerics\}"
     \+          "&&"          /\*    and also      \*/
     \+          "\\\\p\{So\}"   /\* Other Symbol     \*/
     \+      "\]"
     \+  "\]";
Now, that's not quite the way #31's section 2 reads, but it may be
(close to) equivalent; I haven't checked. Hm, I'm pretty sure that I
have ZWJ and ZWNJ issues there, something that I addressed in working
out extended grapheme clusters but never backported to regular old
identifier-class characters.

What part of the sense of "alpha" or "identifier" did Perl and Unicode
part ways on? Is this perhaps only in Perl's parser, not in its notions
of properties? Does it have to do with #31 section 2, or something else?

See the comments above. Perl doesn't use IDStart at all. Instead it
currently uses:
#define isIDFIRST_utf8(p) \
(is_utf8_idcont(p) && !is_utf8_digit(p) && is_utf8_alnum(p))

That definition has only been in place for some of the 5.13.X releases.
Prior to that, the definition was:
#define isIDFIRST_utf8(p) (is_utf8_idcont(p) && !is_utf8_digit(p))

This caused the parser to loop on some inputs. The details are in the
trouble ticket mentioned above. My own view is that it would be better
to move to the Unicode definition, but there is the backward
compatibility issue.

The Unicode recommendation is to only quote the pattern white space and
identifier characters plus the default ignorable code points. That
means most controls would not get quoted.

That would save on space.
That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ |  \[ \{ ^ $ \* \+ ? \.
Modulo the problematic U+2E2F, I believe that quoting all \W characters
is both safe and a proper superset of UAX#31. My only question is whether
there is anything to be gained by reducing that superset down to quoting
only those code points with any of
Pattern\_Syntax
Pattern\_White\_Space
Default\_Ignorable\_Code\_Point
Let's for this discussion call those the Pattern_Quotable set, or PQ.

The considerations are time and space.

On space, there are certainly more

\W characters than there are QP characters. In the BMP:

% unichars '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' | wc \-l
    2475

% unichars '\\W' | wc \-l
    4137

Adding Unassigned, PrivateUse, Han, and InHangulSyllables produces
a slight gain on the PQ set:

% unichars \-u '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' | wc \-l
    2832

And of course a substantial gain on the \W set:

% unichars \-u '\\W' | wc \-l
   13556

I am somewhat surprrised to see more identifier characters
in the PQ set than I had reckoned with:

% unichars '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' '\\w'
 ͏   847 034F COMBINING GRAPHEME JOINER
 ᅟ  4447 115F HANGUL CHOSEONG FILLER
 ᅠ  4448 1160 HANGUL JUNGSEONG FILLER
 ᠋  6155 180B MONGOLIAN FREE VARIATION SELECTOR ONE
 ᠌  6156 180C MONGOLIAN FREE VARIATION SELECTOR TWO
 ᠍  6157 180D MONGOLIAN FREE VARIATION SELECTOR THREE
 ⸯ 11823 2E2F VERTICAL TILDE
 ㅤ 12644 3164 HANGUL FILLER
 ︀ 65024 FE00 VARIATION SELECTOR\-1
 ︁ 65025 FE01 VARIATION SELECTOR\-2
 ︂ 65026 FE02 VARIATION SELECTOR\-3
 ︃ 65027 FE03 VARIATION SELECTOR\-4
 ︄ 65028 FE04 VARIATION SELECTOR\-5
 ︅ 65029 FE05 VARIATION SELECTOR\-6
 ︆ 65030 FE06 VARIATION SELECTOR\-7
 ︇ 65031 FE07 VARIATION SELECTOR\-8
 ︈ 65032 FE08 VARIATION SELECTOR\-9
 ︉ 65033 FE09 VARIATION SELECTOR\-10
 ︊ 65034 FE0A VARIATION SELECTOR\-11
 ︋ 65035 FE0B VARIATION SELECTOR\-12
 ︌ 65036 FE0C VARIATION SELECTOR\-13
 ︍ 65037 FE0D VARIATION SELECTOR\-14
 ︎ 65038 FE0E VARIATION SELECTOR\-15
 ️ 65039 FE0F VARIATION SELECTOR\-16
 ﾠ 65440 FFA0 HALFWIDTH HANGUL FILLER

There's something wrong if this includes only the first 16 variation
selectors as all 256 are Default Ignorable. If you don't include the DI
characters in PQ, I suspect you get close to your reckoning.

I believe Tom has a better handle on the implications than me.

Maybe, maybe not.

I await his further ideas.

It would save us on space to make quotemeta working on the smaller PQ set
than on the entire \W set, but would it save us anything on time?

I mean apart from the obvious that it takes time to allocate more stuff
that would need quoting; I mean the time involved in looking up properties.

I don't really know much how the swatches work, nor the true costs of
looking up properties in general, so I can't guess there. My instinct
is to run just quickly backlash any \W, but that's an ASCII-only instinct
established before we had guidelines for the PQ set, let alone for valid
identifier characters.

If there is nothing substantial to be gained by using the broader \W over
the previously defined Pattern_Quote set, then I think we should use PQ.
I also think we should use PQ if there were some error that \W might
introduce; the outliers that are both PQ and \w suggest there might be.

I think the differences in time/space are in the noise. swashes aren't
the right data structure to use anyway, and I'm planning to replace them
for 5.16, but even if that doesn't happen in that release, we shouldn't
base this decision on something we intend to remove.

--tom

p5pRT · 2010-12-17T00:13:29Z

From @ikegami

On Thu, Dec 16, 2010 at 2:47 PM, Tom Christiansen <tchrist@perl.com> wrote:

In point of fact, there are uniquely 12 and 12 only metacharacters
in Perl regexes, the dirty dozen of:
   \\ |  \[ \{ ^ $ \* \+ ? \.

"-" and "^" are meta in certain positions.

/[^a]/ vs /[\^a]/
/[a-c]/ vs /[a\-c]/

p5pRT · 2010-12-17T00:14:34Z

From tchrist@perl.com

"-" and "^" are meta in certain positions.

/[^a]/ vs /[\^a]/
/[a-c]/ vs /[a\-c]/

I elsewhere wrote that charclasses operate under different rules.

--tom

p5pRT · 2010-12-17T10:06:12Z

From @Abigail

On Thu, Dec 16, 2010 at 12:47:21PM -0700, Tom Christiansen wrote:

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

I've always wondered why a lone } or ] does not need escaping (they're
only special after an opening { or [ has been seen), but a lone ) does.

The question becomes whether we want the flexibility to someday extend
our set of metacharacters beyond those 12\.  The quotemeta behavior of
backslashing any and all \\W characters no matter what\, while always
leaving inviolate all \\w characters\, was designed to provide for that\.

We've never drawn upon our the \\W reservoir for other pattern matching
operations in Perl5\, but Perl6 has\.

And I don't think Perl5 every will. There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
that is seems unlike p5p will ever judge the advantages of using a new
\W character for metapurposes to outweigth the downside of breaking code.

Abigail

p5pRT · 2010-12-17T15:12:12Z

From tchrist@perl.com

I've always wondered why a lone } or ] does not need escaping (they're
only special after an opening { or [ has been seen), but a lone ) does.

So have I. It could be worse: things like quantifiers still
need escaping to be made literals even if they couldn't quantify
something, such as at the beginning of a string. A (poor) argument
could be made that in such a position, escaping isn't necessary
to infer function, and it seems to me some nasty regex dialects
do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),

Now that you mention it, you're right, we do. Hadn't thought of that.

--tom

p5pRT · 2010-12-29T10:58:22Z

From @iabyn

On Fri, Dec 17, 2010 at 08:11:15AM -0700, Tom Christiansen wrote:

I've always wondered why a lone } or ] does not need escaping (they're
only special after an opening { or [ has been seen), but a lone ) does.

So have I. It could be worse: things like quantifiers still
need escaping to be made literals even if they couldn't quantify
something, such as at the beginning of a string. A (poor) argument
could be made that in such a position, escaping isn't necessary
to infer function, and it seems to me some nasty regex dialects
do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),

Now that you mention it, you're right, we do. Hadn't thought of that.

Ok. How about the following resolution: we change it so that utf8 strings
get chr(128)-chr(255) escaped, so that it matches the non-utf8 case, and
leave chars > 255 unescaped. In some future world if chars > 255 start
having special meaning to the regex engine, then we start escaping them
too.

--
Technology is dominated by two types of people: those who understand what
they do not manage, and those who manage what they do not understand.

p5pRT · 2012-02-06T02:29:02Z

From @khwilliamson

On 12/29/2010 03:57 AM, Dave Mitchell wrote:

On Fri, Dec 17, 2010 at 08:11:15AM -0700, Tom Christiansen wrote:

I've always wondered why a lone } or ] does not need escaping (they're
only special after an opening { or [ has been seen), but a lone ) does.

So have I. It could be worse: things like quantifiers still
need escaping to be made literals even if they couldn't quantify
something, such as at the beginning of a string. A (poor) argument
could be made that in such a position, escaping isn't necessary
to infer function, and it seems to me some nasty regex dialects
do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),

Now that you mention it, you're right, we do. Hadn't thought of that.

Ok. How about the following resolution: we change it so that utf8 strings
get chr(128)-chr(255) escaped, so that it matches the non-utf8 case, and
leave chars> 255 unescaped. In some future world if chars> 255 start
having special meaning to the regex engine, then we start escaping them
too.

This proposal and all others died in 5.14 for lack of consensus. This
leaves the Unicode bug extant for quotemeta, and I would like to get it
fixed. Tom has told me privately that he's ok with changing things to
get consistent rules for UTF8- vs non-UTF8 encoded strings.

I'm thinking we should just do what the original trouble ticket asks
for, and what the documentation has always said, and that is to quote
everything that matches [^a-zA-Z0-9_]. This agrees with the first part
of Dave's proposal, but makes all above Latin1 chars also escaped.

I'm reopening this publicly now, in order to try to get resolution in
the next week or so, so that we can do something for 5.16. Either
proposal is easy to implement, and fast in cpu cycles.

If we do this, does that close the door on later changing to use the
pattern syntax should it ever become necessary? I think that it
doesn't. This thread included extensive discussion on that.

p5pRT · 2012-02-06T13:41:47Z

From @demerphq

On 29 December 2010 11:57, Dave Mitchell <davem@iabyn.com> wrote:

On Fri, Dec 17, 2010 at 08:11:15AM -0700, Tom Christiansen wrote:

I've always wondered why a lone } or ] does not need escaping (they're
only special after an opening { or [ has been seen), but a lone ) does.

So have I. It could be worse: things like quantifiers still
need escaping to be made literals even if they couldn't quantify
something, such as at the beginning of a string. A (poor) argument
could be made that in such a position, escaping isn't necessary
to infer function, and it seems to me some nasty regex dialects
do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),

Now that you mention it, you're right, we do. Hadn't thought of that.

Ok. How about the following resolution: we change it so that utf8 strings
get chr(128)-chr(255) escaped, so that it matches the non-utf8 case, and
leave chars > 255 unescaped. In some future world if chars > 255 start
having special meaning to the regex engine, then we start escaping them
too.

I think it depends on what we want to do. If quotemeta() is intended
for escaping content in a regex, then we could make it escape ONLY
known regex metacharacters, which would mean very little gets escaped
at all.

Some of the options are:

1) make quotemeta() *not* escape codepoints>127 regardless
2) make quotemeta() escape codepoints>127 regardless
3) make quotemeta() only escape codepoints that are known meta characters.

In terms of back-compat your suggestion (2) or my suggestion (1) are
the only viable choices...

BUT, option 3 has some things to be said for it. Specifically, its
output would be parsed by the regex engine much more efficiently,
which is also why I think that option 1 has a slight edge over option
2.

The efficiency point is also why I think that escaping codepoints we
know will never be part of Perl 5's internal regex engine syntax is a
bad idea. So for me escaping ALL codepoints larger than 255 is a
mistake.

Also, as an aside to the cc list: I do not think that what Unicode
considers to be pattern syntax is particularly relevant to Perl.
While it is something we should consider, just as we consider
precedent by other regex engines, it is much like a judge in one
jurisdiction encountering an unusual case using precedent from another
jurisdiction: it may be useful advise, but it is not at all binding or
authoritative. And lastly, Unicode is a moving target, I personally
would have big reservations in using its definitions for something
like this. We have, apparently, wasted a LOT of time trying to be
compliant with Unicode, only to learn that the Unicode proposals don't
make sense and then see them deprecated or changed over time. Case
folding rules are a particular example that really makes me
disinclined to treat Unicode as an authority on what Perl should do in
the regex engine.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-06T13:42:58Z

From @demerphq

On 6 February 2012 14:41, demerphq <demerphq@gmail.com> wrote:

1) make quotemeta() *not* escape codepoints>127 regardless
2) make quotemeta() escape codepoints>127 regardless

To be clear I meant codepoints where: 127 < codepoint < 256

Sorry for the extra mail...

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-06T14:14:25Z

From tchrist@perl.com

Also, as an aside to the cc list: I do not think that what Unicode considers
to be pattern syntax is particularly relevant to Perl. While it is something
we should consider, just as we consider precedent by other regex engines, it
is much like a judge in one jurisdiction encountering an unusual case using
precedent from another jurisdiction: it may be useful advise, but it is not at
all binding or authoritative. And lastly, Unicode is a moving target, I
personally would have big reservations in using its definitions for something
like this. We have, apparently, wasted a LOT of time trying to be compliant
with Unicode, only to learn that the Unicode proposals don't make sense and
then see them deprecated or changed over time. Case folding rules are a
particular example that really makes me disinclined to treat Unicode as an
authority on what Perl should do in the regex engine.

Disagree, several times over.

First of all, the Pattern_Syntax Unicode character property is *not*
defined by UTS#18 on Unicode Regular Expressions, a document that is by
first of being a UTS/UTR inherently informative in nature.

Rather, it is a defined in UAX#44, the Unicode Character Database. That
means it is part of the Unicode Standard itself. It is also a *normative*
property, not an informative, contributory, or provisional one. So is
Pattern_Whitespace.

Lastly, both those properties are, like the names of the characters
themselves, *immutable*. They are guaranteed to be closed sets
that will never change. No new character will ever gain of those
two properties, nor shall any old character that has one of those
immutable normative properties ever lose that property.

Now please stop repeating this nonsense about Unicode being a moving
target. Things that can change are clearly marked at such, and things
that cannot change similarly. You need to understand which is which,
and why. Unicode has a very clear stability policy. Please familiarize
yourself with it.

As for casefolding, we have *not* "wasted a lot of time". But I am not
free to waste my time explaining this just right now.

--tom

p5pRT · 2012-02-06T16:14:25Z

From @demerphq

On 6 February 2012 15:13, Tom Christiansen <tchrist@perl.com> wrote:

Also, as an aside to the cc list: I do not think that what Unicode considers
to be pattern syntax is particularly relevant to Perl. While it is something
we should consider, just as we consider precedent by other regex engines, it
is much like a judge in one jurisdiction encountering an unusual case using
precedent from another jurisdiction: it may be useful advise, but it is not at
all binding or authoritative. And lastly, Unicode is a moving target, I
personally would have big reservations in using its definitions for something
like this. We have, apparently, wasted a LOT of time trying to be compliant
with Unicode, only to learn that the Unicode proposals don't make sense and
then see them deprecated or changed over time. Case folding rules are a
particular example that really makes me disinclined to treat Unicode as an
authority on what Perl should do in the regex engine.

Disagree, several times over.

First of all, the Pattern_Syntax Unicode character property is *not*
defined by UTS#18 on Unicode Regular Expressions, a document that is by
first of being a UTS/UTR inherently informative in nature.

Not sure how this is relevant.

Rather, it is a defined in UAX#44, the Unicode Character Database. That
means it is part of the Unicode Standard itself. It is also a *normative*
property, not an informative, contributory, or provisional one. So is
Pattern_Whitespace.

Lastly, both those properties are, like the names of the characters
themselves, *immutable*. They are guaranteed to be closed sets
that will never change. No new character will ever gain of those
two properties, nor shall any old character that has one of those
immutable normative properties ever lose that property.

Given that we reserve the right to add new regex meta characters if we
wish it seem like this supports my position. Or am i missing something
here? (Probably)

Now please stop repeating this nonsense about Unicode being a moving
target. Things that can change are clearly marked at such, and things
that cannot change similarly. You need to understand which is which,
and why. Unicode has a very clear stability policy. Please familiarize
yourself with it.

I have personal experience with Unicode being a moving target. For
instance the introduction of an upper case sharp-ess. Perhaps this is
not relevant to the instant case, but that is not clear to me.

As for casefolding, we have *not* "wasted a lot of time". But I am not
free to waste my time explaining this just right now.

It is entirely possible I am misinformed, but this is my impression of
what I recall of Karl's comments on this subject. Some of what I have
heard on the subject makes me think I wasted some of my time trying to
make it work properly.

Anyway, as and when you have time I would like to hear more of your
thoughts on this. No rush tho, I am available sporadically this week
due to family reasons.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-06T17:04:21Z

From @nwc10

On Mon, Feb 06, 2012 at 05:13:47PM +0100, demerphq wrote:

On 6 February 2012 15:13, Tom Christiansen <tchrist@perl.com> wrote:

I have personal experience with Unicode being a moving target. For
instance the introduction of an upper case sharp-ess. Perhaps this is
not relevant to the instant case, but that is not clear to me.

Aspects of Unicode aren't fixed yet. Being on the bleeding edge of
implementation means that one discovers things like:

"ß" =~ /ss/i
"ß" =~ /(s)(s)/i

um, has "issues" about what exactly $1 and $2 should be for the capturing
variant, and that these may turn out not to be intuitive:

"s" =~ /^[^ß]/
"ss" =~ /^[^ß]/
"s" =~ /^[^ß]/i
"ss" =~ /^[^ß]/i

As for casefolding, we have *not* "wasted a lot of time". But I am not
free to waste my time explaining this just right now.

It is entirely possible I am misinformed, but this is my impression of
what I recall of Karl's comments on this subject. Some of what I have
heard on the subject makes me think I wasted some of my time trying to
make it work properly.

Anyway, as and when you have time I would like to hear more of your
thoughts on this. No rush tho, I am available sporadically this week
due to family reasons.

In the passing mailing list traffic I didn't spot anything that made me
think that any decisions the Unicode consortium took wasted anyone here's
time as far as casefolding went.

Mainly it's that a lot of what they define is fundamentally *hard* to
implement, at least in any scalable performant fashion, and as it's new,
we don't have a choice of existing implementations to steal from.

Nicholas Clark

p5pRT · 2012-02-06T17:25:58Z

From @khwilliamson

On 02/06/2012 10:03 AM, Nicholas Clark wrote:

On Mon, Feb 06, 2012 at 05:13:47PM +0100, demerphq wrote:

On 6 February 2012 15:13, Tom Christiansen<tchrist@perl.com> wrote:

I have personal experience with Unicode being a moving target. For
instance the introduction of an upper case sharp-ess. Perhaps this is
not relevant to the instant case, but that is not clear to me.

Aspects of Unicode aren't fixed yet. Being on the bleeding edge of
implementation means that one discovers things like:
 "ß" =~ /ss/i
 "ß" =~ /$s$$s$/i
um, has "issues" about what exactly $1 and $2 should be for the capturing
variant, and that these may turn out not to be intuitive:
 "s"  =~ /^\[^ß\]/
 "ss" =~ /^\[^ß\]/
 "s"  =~ /^\[^ß\]/i
 "ss" =~ /^\[^ß\]/i
As for casefolding, we have *not* "wasted a lot of time". But I am not
free to waste my time explaining this just right now.

It is entirely possible I am misinformed, but this is my impression of
what I recall of Karl's comments on this subject. Some of what I have
heard on the subject makes me think I wasted some of my time trying to
make it work properly.

Anyway, as and when you have time I would like to hear more of your
thoughts on this. No rush tho, I am available sporadically this week
due to family reasons.

In the passing mailing list traffic I didn't spot anything that made me
think that any decisions the Unicode consortium took wasted anyone here's
time as far as casefolding went.

Mainly it's that a lot of what they define is fundamentally *hard* to
implement, at least in any scalable performant fashion, and as it's new,
we don't have a choice of existing implementations to steal from.

Nicholas Clark

I believe it's decisions they haven't finalized yet. Indications are
that they are backing away from suggesting that regexes use full case
folding, because of things like the
"ß" =~ /(s)(s)/i
anomaly. What Yves may be referring to is that I've mentioned this
several times on the list. But Unicode hasn't updated their TR18. I
don't know what the hold up is. And what Tom meant is that TR18 is not
a part of the Standard, but merely recommendations. (If it had been
part of the Standard, they would be in a world of hurt with their
ill-advised encoding of BELL to mean something other than what it has
always meant; perhaps they would have taken better care to not break
TR18 if it had been part of the standard. I note that it does introduce
a bug into their own CLDR POSIX locales, as they have to use the term
BELL there to mean U+0007)

If they do back away, then perhaps we will have made wasted effort.

p5pRT · 2012-02-06T20:19:58Z

From @ikegami

On Mon, Feb 6, 2012 at 8:41 AM, demerphq <demerphq@gmail.com> wrote:

I think it depends on what we want to do. If quotemeta() is intended
for escaping content in a regex, then we could make it escape ONLY
known regex metacharacters, which would mean very little gets escaped
at all.

That's not very safe. It prevents storing the escaped pattern and using it
with a different version of Perl. It is not forward-compatible.

Some of the options are:

1) make quotemeta() *not* escape codepoints>127 regardless
2) make quotemeta() escape codepoints>127 regardless
3) make quotemeta() only escape codepoints that are known meta characters.

Also mentioned was:

4) make quotemeta() escape some code-points above 127. (\W,
\p{Pattern_syntax} or some other group to be determined).

Analysis: (worst-to-best)

(3) is the least forward-compatible.
(2) is forward-compatible for as long as we don't start using characters
above 127 as "special escapes".
(1) is forward-compatible for as long as we don't start using characters
above 127 as meta characters.
(4) is the most forward-compatible.

(3) is the least backward-compatible (e.g. it would no longer escape "&").
(2) and (4) are backward-compatible with character below 127
(1) is backward-compatible with character below 127 and above 255

(3) is the most dangerous, affecting characters below 127 (e.g. some might
expect "&" to be escaped by quotemeta).
(2) and (4) only affects characters above 127.
(1) only affects characters for which behaviour was "undefined" (for lack
of a better word).

(3) is faster than (1), (2) and (4) if you think the time spent parsing "\"
is noticeable.

- Eric

p5pRT · 2012-02-06T23:39:37Z

From @khwilliamson

On 02/06/2012 01:19 PM, Eric Brine wrote:

On Mon, Feb 6, 2012 at 8:41 AM, demerphq <demerphq@gmail.com
<mailto:demerphq@gmail.com>> wrote:
I think it depends on what we want to do\. If quotemeta is intended
for escaping content in a regex\, then we could make it escape ONLY
known regex metacharacters\, which would mean very little gets escaped
at all\.
That's not very safe. It prevents storing the escaped pattern and using
it with a different version of Perl. It is not forward-compatible.
Some of the options are&#8203;:

1\) make quotemeta \*not\* escape codepoints>127 regardless
2\) make quotemeta escape codepoints>127 regardless
3\) make quotemeta only escape codepoints that are known meta
characters\.
Also mentioned was:

4) make quotemeta() escape some code-points above 127. (\W,
\p{Pattern_syntax} or some other group to be determined).

Analysis: (worst-to-best)

(3) is the least forward-compatible.
(2) is forward-compatible for as long as we don't start using characters
above 127 as "special escapes".
(1) is forward-compatible for as long as we don't start using characters
above 127 as meta characters.
(4) is the most forward-compatible.

(3) is the least backward-compatible (e.g. it would no longer escape "&").
(2) and (4) are backward-compatible with character below 127
(1) is backward-compatible with character below 127 and above 255

(3) is the most dangerous, affecting characters below 127 (e.g. some
might expect "&" to be escaped by quotemeta).
(2) and (4) only affects characters above 127.
(1) only affects characters for which behaviour was "undefined" (for
lack of a better word).

(3) is faster than (1), (2) and (4) if you think the time spent parsing
"\" is noticeable.

- Eric

Thanks for the analysis. I'd like to throw this comment in from this
thread last year from Abigail, and Dave Mitchell's response:

There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
Now that you mention it, you're right, we do. Hadn't thought of that.

p5pRT · 2012-02-07T00:35:23Z

From @ikegami

On Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com>wrote:

Thanks for the analysis. I'd like to throw this comment in from this
thread last year from Abigail, and Dave Mitchell's response:

There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
Now that you mention it, you're right, we do. Hadn't thought of that.

Good point. From that, one could conclude that being forward-compatible is
not an important factor since there's so much existing regex code that
isn't forward-compatible.

p5pRT · 2012-02-07T01:20:56Z

From @demerphq

On 6 February 2012 21:19, Eric Brine <ikegami@adaelis.com> wrote:

(3) is faster than (1), (2) and (4) if you think the time spent parsing "\"
is noticeable.

I do not have stats to back me up, but knowing how the code handles
escapes I am pretty confident that a string with lots of unnecessarily
escaped characters will be visibly slower than one without.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-07T01:23:03Z

From @demerphq

On 7 February 2012 01:34, Eric Brine <ikegami@adaelis.com> wrote:

On Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com>
wrote:

Thanks for the analysis. I'd like to throw this comment in from this
thread last year from Abigail, and Dave Mitchell's response:

There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
Now that you mention it, you're right, we do. Hadn't thought of that.

Good point. From that, one could conclude that being forward-compatible is
not an important factor since there's so much existing regex code that isn't
forward-compatible.

devils advocate:
Or is that it actually is forward compatible because it leaves escaped
\W chars available for use by the regex engine?

:-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-07T02:07:22Z

From @Abigail

On Tue, Feb 07, 2012 at 02:22:34AM +0100, demerphq wrote:

On 7 February 2012 01:34, Eric Brine <ikegami@adaelis.com> wrote:

On Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com>
wrote:

Thanks for the analysis. I'd like to throw this comment in from this
thread last year from Abigail, and Dave Mitchell's response:

There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
Now that you mention it, you're right, we do. Hadn't thought of that.

Good point. From that, one could conclude that being forward-compatible is
not an important factor since there's so much existing regex code that isn't
forward-compatible.

devils advocate:
Or is that it actually is forward compatible because it leaves escaped
\W chars available for use by the regex engine?

And break an ancient promise? [1] ;-)

Unlike some other regular expression languages, there are no backslashed
symbols that aren’t alphanumeric. So anything that looks like \\,
$, $, \<, \>, \{, or \} is always interpreted as a literal character,
not a metacharacter.

This is in the current manual page, but the exact same phrasing already
appears in the manual pages of perl-3.000.

[1] 22 years counts as ancient.

Abigail

p5pRT · 2012-02-07T02:25:42Z

From @demerphq

On 7 February 2012 03:06, Abigail <abigail@abigail.be> wrote:

On Tue, Feb 07, 2012 at 02:22:34AM +0100, demerphq wrote:

On 7 February 2012 01:34, Eric Brine <ikegami@adaelis.com> wrote:

On Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com>
wrote:

Thanks for the analysis. I'd like to throw this comment in from this
thread last year from Abigail, and Dave Mitchell's response:

There's so much code out there that
doesn't escape \W characters outside of the dozen mentioned above (and
if we see a newbie escaping a \W outside of the dozen, we pick on him),
Now that you mention it, you're right, we do. Hadn't thought of that.

Good point. From that, one could conclude that being forward-compatible is
not an important factor since there's so much existing regex code that isn't
forward-compatible.

devils advocate:
Or is that it actually is forward compatible because it leaves escaped
\W chars available for use by the regex engine?

And break an ancient promise? [1] ;-)

Unlike some other regular expression languages, there are no backslashed
symbols that aren’t alphanumeric. So anything that looks like \\,
$, $, \<, \>, \{, or \} is always interpreted as a literal character,
not a metacharacter.

This is in the current manual page, but the exact same phrasing already
appears in the manual pages of perl-3.000.

Yes right, my bad. Did not think my post through before I sent it.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2012-02-07T19:25:47Z

From @khwilliamson

On 02/06/2012 05:34 PM, Eric Brine wrote:

On Mon, Feb 6, 2012 at 6:37 PM, Karl Williamson <public@khwilliamson.com
<mailto:public@khwilliamson.com>> wrote:
Thanks for the analysis\.  I'd like to throw this comment in from
this thread last year from Abigail\, and Dave Mitchell's response&#8203;:


 >> There's so much code out there that
 >> doesn't escape \\W characters outside of the dozen mentioned
above $and
 >> if we see a newbie escaping a \\W outside of the dozen\, we pick
on him$\,
 > Now that you mention it\, you're right\, we do\.  Hadn't thought of
that\.
Good point. From that, one could conclude that being forward-compatible
is not an important factor since there's so much existing regex code
that isn't forward-compatible.

I've looked over this thread now several times and re-read Unicode's UAX
31. I'll try to succinctly summarize the relevant portions of it.

It essentially suggests that characters that are \p{Pattern_Syntax} are
the only ones that could ever have metacharacter meaning. A complete
list of these is attached. This list is claimed to be absolutely
stable. But you may note that there are several hundred unassigned code
points in it. I expect and hope that this means that Unicode will only
ever use those code points for characters that it thinks would be
appropriate for use as metacharacters.

Unicode also defines a few characters (also attached, and also
completely stable) as \{Pattern_White_Space}. It says that these should
not appear in a pattern as literals unless escaped. But Perl does allow
some of these as literals unescaped, except under /x. But for purposes
of quotemeta, which is what is being discussed here, these should be
escaped as well.

UAX 31 also suggests that for readability all other white space (6.1
list also attached) be escaped, as well as all characters matching
\p{Default_Ignorable_Code_Point} (6.1 list attached). Note that these
are not stable, and may grow over time, and much less likely, shrink.
Many of the default ignorables (DI for short) are generally usually
invisible in output, so it is a good idea to escape them. They don't
include the controls.

If Perl is willing to never use other than a pattern syntax character as
a metacharacter, then we can reasonably use UAX 31 as a basis for
quoting. If we decide, as some suggest that we will never use any thing
other than what we have already used, then it really doesn't matter, but
we do need to fix things so that non-utf8 encoded strings and
utf8-encoded strings behave the same on the same code points.

Another reasonable basis is to use \W, which Tom has pointed earlier in
this thread comes from first principals of how Perl's definition. Tom
however also pointed out that there is a single character that matches
\W that could cause problems, U+2E2F VERTICAL TILDE. When I emailed
Unicode a year ago about it, they said they would look into it; but
nothing happened. I just reminded them. But regardless, one of their
responses indicated that they did not see any anomaly here (I refer you
to Tom's post for details), so that even if they change this one, the
could encode new such characters in the future.

Thus, I'm coming down to Tom's conclusion that if we do quoting based on
code point properties, that it would be better to use UAX 31 instead of \W.

People have talked about the speed of parsing quoted characters. But
there is a cost that hasn't been mentioned, which is the speed of
figuring out by quotemeta if a code point should be quoted or not. It's
much faster to just quote all code points above 127 or 255 than to have
to parse and go out to disk to compute a swash. However, I have code
that is next in line to be fully smoked (having passed the quick
smoke-me's) that allows for compile time inclusion of property
definitions. I believe that using it would lower this cost to an
acceptable level. There still would be a swash, but its contents would
be known at compile time.

I have now formulated the following proposal:

Non-utf8 string, not feature unicode_strings:
quote \W ASCII range, plus all code points 128-255;
nothing else quoted.

Otherwise,
quote \W ASCII range, plus all pattern syntax, pattern white
space, regular white space, and default ignorable code points;
nothing else quoted.

It may be that we decide we will never use anything outside the dozen we
already do; but it seems to me to be prudent to not box ourselves in
forever to this stance. Hence, I think we should do some quoting of
characters above ASCII.

This solution is completely backwards compatible in the ASCII range.
It is completely backwards compatible in the Latin1 range provided you
aren't using unicode_strings. unicode_strings was never advertised as
applying to quotemeta, but it seems like a reasonable extension of its
use to me; another alternative would be to come up with yet another
feature, say 'quote_unicode_strings'.

The solution isn't backwards compatible above Latin1; nothing we do is,
unless we create a new feature.

p5pRT · 2012-02-07T19:25:47Z

From @khwilliamson

Pat_Syn

p5pRT · 2012-02-07T19:25:47Z

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!!
# This file is machine-generated by lib/unicore/mktables from the Unicode
# database, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
# This file is for internal use by core Perl only. The format and even the
# name or existence of this file are subject to change without notice. Don't
# use it directly.

# Use Unicode::UCD::prop_invlist() to access the contents of this file.
#
# This file returns the 11 code points in Unicode Version 6.1.0 that match
# any of the following regular expression constructs:
#
# \p{Pattern_White_Space=Yes}
# \p{Pat_WS=Y}
# \p{Pattern_White_Space=T}
# \p{Pat_WS=True}
#
# \p{Pattern_White_Space}
# \p{Is_Pattern_White_Space}
# \p{Pat_WS}
# \p{Is_Pat_WS}
#
# perluniprops.pod should be consulted for the syntax rules for any of these,
# including if adding or subtracting white space, underscore, and hyphen
# characters matters or doesn't matter, and other permissible syntactic
# variants. Upper/lower case distinctions never matter.
#
# A colon can be substituted for the equals sign, and anything to the left of
# the equals (or colon) can be combined with anything to the right. Thus,
# for example,
# \p{Pat_WS: Yes}
# is also valid.
#
# The format of the lines of this file is: START\tSTOP\twhere START is the
# starting code point of the range, in hex; STOP is the ending point, or if
# omitted, the range has just one code point. Numbers in comments in
# [brackets] indicate how many code points are in the range.

return <<'END' =~ s/\s*#.*//mgr;
0009 # CHARACTER TABULATION
000A # LINE FEED (LF)
000B # LINE TABULATION
000C # FORM FEED (FF)
000D # CARRIAGE RETURN (CR)
0020 # ' ' SPACE
0085 # NEXT LINE (NEL)
200E # '‎' LEFT-TO-RIGHT MARK
200F # '‏' RIGHT-TO-LEFT MARK
2028 # LINE SEPARATOR
2029 # PARAGRAPH SEPARATOR
END

p5pRT · 2012-02-07T19:25:47Z

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!!
# This file is machine-generated by lib/unicore/mktables from the Unicode
# database, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
# This file is for internal use by core Perl only. The format and even the
# name or existence of this file are subject to change without notice. Don't
# use it directly.

# Use Unicode::UCD::prop_invlist() to access the contents of this file.
#
# This file returns the 26 code points in Unicode Version 6.1.0 that match
# any of the following regular expression constructs:
#
# \p{White_Space=Yes}
# \p{WSpace=Y}
# \p{Space=T}
# \p{White_Space=True}
#
# \p{White_Space}
# \p{Is_White_Space}
# \p{WSpace}
# \p{Is_WSpace}
#
# \p{Space}
# \p{XPosixSpace}
# \p{Is_Space}
# \p{Is_XPosixSpace}
#
# Meaning: \s including beyond ASCII plus vertical tab
#
# perluniprops.pod should be consulted for the syntax rules for any of these,
# including if adding or subtracting white space, underscore, and hyphen
# characters matters or doesn't matter, and other permissible syntactic
# variants. Upper/lower case distinctions never matter.
#
# A colon can be substituted for the equals sign, and anything to the left of
# the equals (or colon) can be combined with anything to the right. Thus,
# for example,
# \p{Space: Yes}
# is also valid.
#
# The format of the lines of this file is: START\tSTOP\twhere START is the
# starting code point of the range, in hex; STOP is the ending point, or if
# omitted, the range has just one code point. Numbers in comments in
# [brackets] indicate how many code points are in the range.

return <<'END' =~ s/\s*#.*//mgr;
0009 # CHARACTER TABULATION
000A # LINE FEED (LF)
000B # LINE TABULATION
000C # FORM FEED (FF)
000D # CARRIAGE RETURN (CR)
0020 # ' ' SPACE
0085 # NEXT LINE (NEL)
00A0 # ' ' NO-BREAK SPACE
1680 # ' ' OGHAM SPACE MARK
180E # '᠎' MONGOLIAN VOWEL SEPARATOR
2000 # ' ' EN QUAD
2001 # ' ' EM QUAD
2002 # ' ' EN SPACE
2003 # ' ' EM SPACE
2004 # ' ' THREE-PER-EM SPACE
2005 # ' ' FOUR-PER-EM SPACE
2006 # ' ' SIX-PER-EM SPACE
2007 # ' ' FIGURE SPACE
2008 # ' ' PUNCTUATION SPACE
2009 # ' ' THIN SPACE
200A # ' ' HAIR SPACE
2028 # LINE SEPARATOR
2029 # PARAGRAPH SEPARATOR
202F # ' ' NARROW NO-BREAK SPACE
205F # ' ' MEDIUM MATHEMATICAL SPACE
3000 # '　' IDEOGRAPHIC SPACE
END

p5pRT · 2012-02-07T19:25:47Z

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!!
# This file is machine-generated by lib/unicore/mktables from the Unicode
# database, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
# This file is for internal use by core Perl only. The format and even the
# name or existence of this file are subject to change without notice. Don't
# use it directly.

# Use Unicode::UCD::prop_invlist() to access the contents of this file.
#
# This file returns the 4167 code points in Unicode Version 6.1.0 that match
# any of the following regular expression constructs:
#
# \p{Default_Ignorable_Code_Point=Yes}
# \p{DI=Y}
# \p{Default_Ignorable_Code_Point=T}
# \p{DI=True}
#
# \p{Default_Ignorable_Code_Point}
# \p{Is_Default_Ignorable_Code_Point}
# \p{DI}
# \p{Is_DI}
#
# perluniprops.pod should be consulted for the syntax rules for any of these,
# including if adding or subtracting white space, underscore, and hyphen
# characters matters or doesn't matter, and other permissible syntactic
# variants. Upper/lower case distinctions never matter.
#
# A colon can be substituted for the equals sign, and anything to the left of
# the equals (or colon) can be combined with anything to the right. Thus,
# for example,
# \p{DI: Yes}
# is also valid.
#
# The format of the lines of this file is: START\tSTOP\twhere START is the
# starting code point of the range, in hex; STOP is the ending point, or if
# omitted, the range has just one code point. Numbers in comments in
# [brackets] indicate how many code points are in the range.

return <<'END' =~ s/\s*#.*//mgr;
00AD # '' SOFT HYPHEN
034F # '͏' COMBINING GRAPHEME JOINER
115F # 'ᅟ' HANGUL CHOSEONG FILLER
1160 # 'ᅠ' HANGUL JUNGSEONG FILLER
17B4 # '឴' KHMER VOWEL INHERENT AQ
17B5 # '឵' KHMER VOWEL INHERENT AA
180B # '᠋' MONGOLIAN FREE VARIATION SELECTOR ONE
180C # '᠌' MONGOLIAN FREE VARIATION SELECTOR TWO
180D # '᠍' MONGOLIAN FREE VARIATION SELECTOR THREE
200B # '' ZERO WIDTH SPACE
200C # '‌' ZERO WIDTH NON-JOINER
200D # '‍' ZERO WIDTH JOINER
200E # '‎' LEFT-TO-RIGHT MARK
200F # '‏' RIGHT-TO-LEFT MARK
202A # '‪' LEFT-TO-RIGHT EMBEDDING
202B # '‫' RIGHT-TO-LEFT EMBEDDING
202C # '‬' POP DIRECTIONAL FORMATTING
202D # '‭' LEFT-TO-RIGHT OVERRIDE
202E # '‮' RIGHT-TO-LEFT OVERRIDE
2060 # '⁠' WORD JOINER
2061 # '⁡' FUNCTION APPLICATION
2062 # '⁢' INVISIBLE TIMES
2063 # '⁣' INVISIBLE SEPARATOR
2064 # '⁤' INVISIBLE PLUS
2065 2069 # Unassigned, block=General_Punctuation [5]
206A # '⁪' INHIBIT SYMMETRIC SWAPPING
206B # '⁫' ACTIVATE SYMMETRIC SWAPPING
206C # '⁬' INHIBIT ARABIC FORM SHAPING
206D # '⁭' ACTIVATE ARABIC FORM SHAPING
206E # '⁮' NATIONAL DIGIT SHAPES
206F # '⁯' NOMINAL DIGIT SHAPES
3164 # 'ㅤ' HANGUL FILLER
FE00 # '︀' VARIATION SELECTOR-1
FE01 # '︁' VARIATION SELECTOR-2
FE02 # '︂' VARIATION SELECTOR-3
FE03 # '︃' VARIATION SELECTOR-4
FE04 # '︄' VARIATION SELECTOR-5
FE05 # '︅' VARIATION SELECTOR-6
FE06 # '︆' VARIATION SELECTOR-7
FE07 # '︇' VARIATION SELECTOR-8
FE08 # '︈' VARIATION SELECTOR-9
FE09 # '︉' VARIATION SELECTOR-10
FE0A # '︊' VARIATION SELECTOR-11
FE0B # '︋' VARIATION SELECTOR-12
FE0C # '︌' VARIATION SELECTOR-13
FE0D # '︍' VARIATION SELECTOR-14
FE0E # '︎' VARIATION SELECTOR-15
FE0F # '️' VARIATION SELECTOR-16
FEFF # '' ZERO WIDTH NO-BREAK SPACE
FFA0 # 'ﾠ' HALFWIDTH HANGUL FILLER
FFF0 FFF8 # Unassigned, block=Specials [9]
1D173 # '𝅳' MUSICAL SYMBOL BEGIN BEAM
1D174 # '𝅴' MUSICAL SYMBOL END BEAM
1D175 # '𝅵' MUSICAL SYMBOL BEGIN TIE
1D176 # '𝅶' MUSICAL SYMBOL END TIE
1D177 # '𝅷' MUSICAL SYMBOL BEGIN SLUR
1D178 # '𝅸' MUSICAL SYMBOL END SLUR
1D179 # '𝅹' MUSICAL SYMBOL BEGIN PHRASE
1D17A # '𝅺' MUSICAL SYMBOL END PHRASE
E0000 # Unassigned, block=Tags
E0001 # '󠀁' LANGUAGE TAG
E0002 E001F # Unassigned, block=Tags [30]
E0020 # '󠀠' TAG SPACE
E0021 # '󠀡' TAG EXCLAMATION MARK
E0022 # '󠀢' TAG QUOTATION MARK
E0023 # '󠀣' TAG NUMBER SIGN
E0024 # '󠀤' TAG DOLLAR SIGN
E0025 # '󠀥' TAG PERCENT SIGN
E0026 # '󠀦' TAG AMPERSAND
E0027 # '󠀧' TAG APOSTROPHE
E0028 # '󠀨' TAG LEFT PARENTHESIS
E0029 # '󠀩' TAG RIGHT PARENTHESIS
E002A # '󠀪' TAG ASTERISK
E002B # '󠀫' TAG PLUS SIGN
E002C # '󠀬' TAG COMMA
E002D # '󠀭' TAG HYPHEN-MINUS
E002E # '󠀮' TAG FULL STOP
E002F # '󠀯' TAG SOLIDUS
E0030 # '󠀰' TAG DIGIT ZERO
E0031 # '󠀱' TAG DIGIT ONE
E0032 # '󠀲' TAG DIGIT TWO
E0033 # '󠀳' TAG DIGIT THREE
E0034 # '󠀴' TAG DIGIT FOUR
E0035 # '󠀵' TAG DIGIT FIVE
E0036 # '󠀶' TAG DIGIT SIX
E0037 # '󠀷' TAG DIGIT SEVEN
E0038 # '󠀸' TAG DIGIT EIGHT
E0039 # '󠀹' TAG DIGIT NINE
E003A # '󠀺' TAG COLON
E003B # '󠀻' TAG SEMICOLON
E003C # '󠀼' TAG LESS-THAN SIGN
E003D # '󠀽' TAG EQUALS SIGN
E003E # '󠀾' TAG GREATER-THAN SIGN
E003F # '󠀿' TAG QUESTION MARK
E0040 # '󠁀' TAG COMMERCIAL AT
E0041 # '󠁁' TAG LATIN CAPITAL LETTER A
E0042 # '󠁂' TAG LATIN CAPITAL LETTER B
E0043 # '󠁃' TAG LATIN CAPITAL LETTER C
E0044 # '󠁄' TAG LATIN CAPITAL LETTER D
E0045 # '󠁅' TAG LATIN CAPITAL LETTER E
E0046 # '󠁆' TAG LATIN CAPITAL LETTER F
E0047 # '󠁇' TAG LATIN CAPITAL LETTER G
E0048 # '󠁈' TAG LATIN CAPITAL LETTER H
E0049 # '󠁉' TAG LATIN CAPITAL LETTER I
E004A # '󠁊' TAG LATIN CAPITAL LETTER J
E004B # '󠁋' TAG LATIN CAPITAL LETTER K
E004C # '󠁌' TAG LATIN CAPITAL LETTER L
E004D # '󠁍' TAG LATIN CAPITAL LETTER M
E004E # '󠁎' TAG LATIN CAPITAL LETTER N
E004F # '󠁏' TAG LATIN CAPITAL LETTER O
E0050 # '󠁐' TAG LATIN CAPITAL LETTER P
E0051 # '󠁑' TAG LATIN CAPITAL LETTER Q
E0052 # '󠁒' TAG LATIN CAPITAL LETTER R
E0053 # '󠁓' TAG LATIN CAPITAL LETTER S
E0054 # '󠁔' TAG LATIN CAPITAL LETTER T
E0055 # '󠁕' TAG LATIN CAPITAL LETTER U
E0056 # '󠁖' TAG LATIN CAPITAL LETTER V
E0057 # '󠁗' TAG LATIN CAPITAL LETTER W
E0058 # '󠁘' TAG LATIN CAPITAL LETTER X
E0059 # '󠁙' TAG LATIN CAPITAL LETTER Y
E005A # '󠁚' TAG LATIN CAPITAL LETTER Z
E005B # '󠁛' TAG LEFT SQUARE BRACKET
E005C # '󠁜' TAG REVERSE SOLIDUS
E005D # '󠁝' TAG RIGHT SQUARE BRACKET
E005E # '󠁞' TAG CIRCUMFLEX ACCENT
E005F # '󠁟' TAG LOW LINE
E0060 # '󠁠' TAG GRAVE ACCENT
E0061 # '󠁡' TAG LATIN SMALL LETTER A
E0062 # '󠁢' TAG LATIN SMALL LETTER B
E0063 # '󠁣' TAG LATIN SMALL LETTER C
E0064 # '󠁤' TAG LATIN SMALL LETTER D
E0065 # '󠁥' TAG LATIN SMALL LETTER E
E0066 # '󠁦' TAG LATIN SMALL LETTER F
E0067 # '󠁧' TAG LATIN SMALL LETTER G
E0068 # '󠁨' TAG LATIN SMALL LETTER H
E0069 # '󠁩' TAG LATIN SMALL LETTER I
E006A # '󠁪' TAG LATIN SMALL LETTER J
E006B # '󠁫' TAG LATIN SMALL LETTER K
E006C # '󠁬' TAG LATIN SMALL LETTER L
E006D # '󠁭' TAG LATIN SMALL LETTER M
E006E # '󠁮' TAG LATIN SMALL LETTER N
E006F # '󠁯' TAG LATIN SMALL LETTER O
E0070 # '󠁰' TAG LATIN SMALL LETTER P
E0071 # '󠁱' TAG LATIN SMALL LETTER Q
E0072 # '󠁲' TAG LATIN SMALL LETTER R
E0073 # '󠁳' TAG LATIN SMALL LETTER S
E0074 # '󠁴' TAG LATIN SMALL LETTER T
E0075 # '󠁵' TAG LATIN SMALL LETTER U
E0076 # '󠁶' TAG LATIN SMALL LETTER V
E0077 # '󠁷' TAG LATIN SMALL LETTER W
E0078 # '󠁸' TAG LATIN SMALL LETTER X
E0079 # '󠁹' TAG LATIN SMALL LETTER Y
E007A # '󠁺' TAG LATIN SMALL LETTER Z
E007B # '󠁻' TAG LEFT CURLY BRACKET
E007C # '󠁼' TAG VERTICAL LINE
E007D # '󠁽' TAG RIGHT CURLY BRACKET
E007E # '󠁾' TAG TILDE
E007F # '󠁿' CANCEL TAG
E0080 E00FF # Unassigned, block=No_Block [128]
E0100 # '󠄀' VARIATION SELECTOR-17
E0101 # '󠄁' VARIATION SELECTOR-18
E0102 # '󠄂' VARIATION SELECTOR-19
E0103 # '󠄃' VARIATION SELECTOR-20
E0104 # '󠄄' VARIATION SELECTOR-21
E0105 # '󠄅' VARIATION SELECTOR-22
E0106 # '󠄆' VARIATION SELECTOR-23
E0107 # '󠄇' VARIATION SELECTOR-24
E0108 # '󠄈' VARIATION SELECTOR-25
E0109 # '󠄉' VARIATION SELECTOR-26
E010A # '󠄊' VARIATION SELECTOR-27
E010B # '󠄋' VARIATION SELECTOR-28
E010C # '󠄌' VARIATION SELECTOR-29
E010D # '󠄍' VARIATION SELECTOR-30
E010E # '󠄎' VARIATION SELECTOR-31
E010F # '󠄏' VARIATION SELECTOR-32
E0110 # '󠄐' VARIATION SELECTOR-33
E0111 # '󠄑' VARIATION SELECTOR-34
E0112 # '󠄒' VARIATION SELECTOR-35
E0113 # '󠄓' VARIATION SELECTOR-36
E0114 # '󠄔' VARIATION SELECTOR-37
E0115 # '󠄕' VARIATION SELECTOR-38
E0116 # '󠄖' VARIATION SELECTOR-39
E0117 # '󠄗' VARIATION SELECTOR-40
E0118 # '󠄘' VARIATION SELECTOR-41
E0119 # '󠄙' VARIATION SELECTOR-42
E011A # '󠄚' VARIATION SELECTOR-43
E011B # '󠄛' VARIATION SELECTOR-44
E011C # '󠄜' VARIATION SELECTOR-45
E011D # '󠄝' VARIATION SELECTOR-46
E011E # '󠄞' VARIATION SELECTOR-47
E011F # '󠄟' VARIATION SELECTOR-48
E0120 # '󠄠' VARIATION SELECTOR-49
E0121 # '󠄡' VARIATION SELECTOR-50
E0122 # '󠄢' VARIATION SELECTOR-51
E0123 # '󠄣' VARIATION SELECTOR-52
E0124 # '󠄤' VARIATION SELECTOR-53
E0125 # '󠄥' VARIATION SELECTOR-54
E0126 # '󠄦' VARIATION SELECTOR-55
E0127 # '󠄧' VARIATION SELECTOR-56
E0128 # '󠄨' VARIATION SELECTOR-57
E0129 # '󠄩' VARIATION SELECTOR-58
E012A # '󠄪' VARIATION SELECTOR-59
E012B # '󠄫' VARIATION SELECTOR-60
E012C # '󠄬' VARIATION SELECTOR-61
E012D # '󠄭' VARIATION SELECTOR-62
E012E # '󠄮' VARIATION SELECTOR-63
E012F # '󠄯' VARIATION SELECTOR-64
E0130 # '󠄰' VARIATION SELECTOR-65
E0131 # '󠄱' VARIATION SELECTOR-66
E0132 # '󠄲' VARIATION SELECTOR-67
E0133 # '󠄳' VARIATION SELECTOR-68
E0134 # '󠄴' VARIATION SELECTOR-69
E0135 # '󠄵' VARIATION SELECTOR-70
E0136 # '󠄶' VARIATION SELECTOR-71
E0137 # '󠄷' VARIATION SELECTOR-72
E0138 # '󠄸' VARIATION SELECTOR-73
E0139 # '󠄹' VARIATION SELECTOR-74
E013A # '󠄺' VARIATION SELECTOR-75
E013B # '󠄻' VARIATION SELECTOR-76
E013C # '󠄼' VARIATION SELECTOR-77
E013D # '󠄽' VARIATION SELECTOR-78
E013E # '󠄾' VARIATION SELECTOR-79
E013F # '󠄿' VARIATION SELECTOR-80
E0140 # '󠅀' VARIATION SELECTOR-81
E0141 # '󠅁' VARIATION SELECTOR-82
E0142 # '󠅂' VARIATION SELECTOR-83
E0143 # '󠅃' VARIATION SELECTOR-84
E0144 # '󠅄' VARIATION SELECTOR-85
E0145 # '󠅅' VARIATION SELECTOR-86
E0146 # '󠅆' VARIATION SELECTOR-87
E0147 # '󠅇' VARIATION SELECTOR-88
E0148 # '󠅈' VARIATION SELECTOR-89
E0149 # '󠅉' VARIATION SELECTOR-90
E014A # '󠅊' VARIATION SELECTOR-91
E014B # '󠅋' VARIATION SELECTOR-92
E014C # '󠅌' VARIATION SELECTOR-93
E014D # '󠅍' VARIATION SELECTOR-94
E014E # '󠅎' VARIATION SELECTOR-95
E014F # '󠅏' VARIATION SELECTOR-96
E0150 # '󠅐' VARIATION SELECTOR-97
E0151 # '󠅑' VARIATION SELECTOR-98
E0152 # '󠅒' VARIATION SELECTOR-99
E0153 # '󠅓' VARIATION SELECTOR-100
E0154 # '󠅔' VARIATION SELECTOR-101
E0155 # '󠅕' VARIATION SELECTOR-102
E0156 # '󠅖' VARIATION SELECTOR-103
E0157 # '󠅗' VARIATION SELECTOR-104
E0158 # '󠅘' VARIATION SELECTOR-105
E0159 # '󠅙' VARIATION SELECTOR-106
E015A # '󠅚' VARIATION SELECTOR-107
E015B # '󠅛' VARIATION SELECTOR-108
E015C # '󠅜' VARIATION SELECTOR-109
E015D # '󠅝' VARIATION SELECTOR-110
E015E # '󠅞' VARIATION SELECTOR-111
E015F # '󠅟' VARIATION SELECTOR-112
E0160 # '󠅠' VARIATION SELECTOR-113
E0161 # '󠅡' VARIATION SELECTOR-114
E0162 # '󠅢' VARIATION SELECTOR-115
E0163 # '󠅣' VARIATION SELECTOR-116
E0164 # '󠅤' VARIATION SELECTOR-117
E0165 # '󠅥' VARIATION SELECTOR-118
E0166 # '󠅦' VARIATION SELECTOR-119
E0167 # '󠅧' VARIATION SELECTOR-120
E0168 # '󠅨' VARIATION SELECTOR-121
E0169 # '󠅩' VARIATION SELECTOR-122
E016A # '󠅪' VARIATION SELECTOR-123
E016B # '󠅫' VARIATION SELECTOR-124
E016C # '󠅬' VARIATION SELECTOR-125
E016D # '󠅭' VARIATION SELECTOR-126
E016E # '󠅮' VARIATION SELECTOR-127
E016F # '󠅯' VARIATION SELECTOR-128
E0170 # '󠅰' VARIATION SELECTOR-129
E0171 # '󠅱' VARIATION SELECTOR-130
E0172 # '󠅲' VARIATION SELECTOR-131
E0173 # '󠅳' VARIATION SELECTOR-132
E0174 # '󠅴' VARIATION SELECTOR-133
E0175 # '󠅵' VARIATION SELECTOR-134
E0176 # '󠅶' VARIATION SELECTOR-135
E0177 # '󠅷' VARIATION SELECTOR-136
E0178 # '󠅸' VARIATION SELECTOR-137
E0179 # '󠅹' VARIATION SELECTOR-138
E017A # '󠅺' VARIATION SELECTOR-139
E017B # '󠅻' VARIATION SELECTOR-140
E017C # '󠅼' VARIATION SELECTOR-141
E017D # '󠅽' VARIATION SELECTOR-142
E017E # '󠅾' VARIATION SELECTOR-143
E017F # '󠅿' VARIATION SELECTOR-144
E0180 # '󠆀' VARIATION SELECTOR-145
E0181 # '󠆁' VARIATION SELECTOR-146
E0182 # '󠆂' VARIATION SELECTOR-147
E0183 # '󠆃' VARIATION SELECTOR-148
E0184 # '󠆄' VARIATION SELECTOR-149
E0185 # '󠆅' VARIATION SELECTOR-150
E0186 # '󠆆' VARIATION SELECTOR-151
E0187 # '󠆇' VARIATION SELECTOR-152
E0188 # '󠆈' VARIATION SELECTOR-153
E0189 # '󠆉' VARIATION SELECTOR-154
E018A # '󠆊' VARIATION SELECTOR-155
E018B # '󠆋' VARIATION SELECTOR-156
E018C # '󠆌' VARIATION SELECTOR-157
E018D # '󠆍' VARIATION SELECTOR-158
E018E # '󠆎' VARIATION SELECTOR-159
E018F # '󠆏' VARIATION SELECTOR-160
E0190 # '󠆐' VARIATION SELECTOR-161
E0191 # '󠆑' VARIATION SELECTOR-162
E0192 # '󠆒' VARIATION SELECTOR-163
E0193 # '󠆓' VARIATION SELECTOR-164
E0194 # '󠆔' VARIATION SELECTOR-165
E0195 # '󠆕' VARIATION SELECTOR-166
E0196 # '󠆖' VARIATION SELECTOR-167
E0197 # '󠆗' VARIATION SELECTOR-168
E0198 # '󠆘' VARIATION SELECTOR-169
E0199 # '󠆙' VARIATION SELECTOR-170
E019A # '󠆚' VARIATION SELECTOR-171
E019B # '󠆛' VARIATION SELECTOR-172
E019C # '󠆜' VARIATION SELECTOR-173
E019D # '󠆝' VARIATION SELECTOR-174
E019E # '󠆞' VARIATION SELECTOR-175
E019F # '󠆟' VARIATION SELECTOR-176
E01A0 # '󠆠' VARIATION SELECTOR-177
E01A1 # '󠆡' VARIATION SELECTOR-178
E01A2 # '󠆢' VARIATION SELECTOR-179
E01A3 # '󠆣' VARIATION SELECTOR-180
E01A4 # '󠆤' VARIATION SELECTOR-181
E01A5 # '󠆥' VARIATION SELECTOR-182
E01A6 # '󠆦' VARIATION SELECTOR-183
E01A7 # '󠆧' VARIATION SELECTOR-184
E01A8 # '󠆨' VARIATION SELECTOR-185
E01A9 # '󠆩' VARIATION SELECTOR-186
E01AA # '󠆪' VARIATION SELECTOR-187
E01AB # '󠆫' VARIATION SELECTOR-188
E01AC # '󠆬' VARIATION SELECTOR-189
E01AD # '󠆭' VARIATION SELECTOR-190
E01AE # '󠆮' VARIATION SELECTOR-191
E01AF # '󠆯' VARIATION SELECTOR-192
E01B0 # '󠆰' VARIATION SELECTOR-193
E01B1 # '󠆱' VARIATION SELECTOR-194
E01B2 # '󠆲' VARIATION SELECTOR-195
E01B3 # '󠆳' VARIATION SELECTOR-196
E01B4 # '󠆴' VARIATION SELECTOR-197
E01B5 # '󠆵' VARIATION SELECTOR-198
E01B6 # '󠆶' VARIATION SELECTOR-199
E01B7 # '󠆷' VARIATION SELECTOR-200
E01B8 # '󠆸' VARIATION SELECTOR-201
E01B9 # '󠆹' VARIATION SELECTOR-202
E01BA # '󠆺' VARIATION SELECTOR-203
E01BB # '󠆻' VARIATION SELECTOR-204
E01BC # '󠆼' VARIATION SELECTOR-205
E01BD # '󠆽' VARIATION SELECTOR-206
E01BE # '󠆾' VARIATION SELECTOR-207
E01BF # '󠆿' VARIATION SELECTOR-208
E01C0 # '󠇀' VARIATION SELECTOR-209
E01C1 # '󠇁' VARIATION SELECTOR-210
E01C2 # '󠇂' VARIATION SELECTOR-211
E01C3 # '󠇃' VARIATION SELECTOR-212
E01C4 # '󠇄' VARIATION SELECTOR-213
E01C5 # '󠇅' VARIATION SELECTOR-214
E01C6 # '󠇆' VARIATION SELECTOR-215
E01C7 # '󠇇' VARIATION SELECTOR-216
E01C8 # '󠇈' VARIATION SELECTOR-217
E01C9 # '󠇉' VARIATION SELECTOR-218
E01CA # '󠇊' VARIATION SELECTOR-219
E01CB # '󠇋' VARIATION SELECTOR-220
E01CC # '󠇌' VARIATION SELECTOR-221
E01CD # '󠇍' VARIATION SELECTOR-222
E01CE # '󠇎' VARIATION SELECTOR-223
E01CF # '󠇏' VARIATION SELECTOR-224
E01D0 # '󠇐' VARIATION SELECTOR-225
E01D1 # '󠇑' VARIATION SELECTOR-226
E01D2 # '󠇒' VARIATION SELECTOR-227
E01D3 # '󠇓' VARIATION SELECTOR-228
E01D4 # '󠇔' VARIATION SELECTOR-229
E01D5 # '󠇕' VARIATION SELECTOR-230
E01D6 # '󠇖' VARIATION SELECTOR-231
E01D7 # '󠇗' VARIATION SELECTOR-232
E01D8 # '󠇘' VARIATION SELECTOR-233
E01D9 # '󠇙' VARIATION SELECTOR-234
E01DA # '󠇚' VARIATION SELECTOR-235
E01DB # '󠇛' VARIATION SELECTOR-236
E01DC # '󠇜' VARIATION SELECTOR-237
E01DD # '󠇝' VARIATION SELECTOR-238
E01DE # '󠇞' VARIATION SELECTOR-239
E01DF # '󠇟' VARIATION SELECTOR-240
E01E0 # '󠇠' VARIATION SELECTOR-241
E01E1 # '󠇡' VARIATION SELECTOR-242
E01E2 # '󠇢' VARIATION SELECTOR-243
E01E3 # '󠇣' VARIATION SELECTOR-244
E01E4 # '󠇤' VARIATION SELECTOR-245
E01E5 # '󠇥' VARIATION SELECTOR-246
E01E6 # '󠇦' VARIATION SELECTOR-247
E01E7 # '󠇧' VARIATION SELECTOR-248
E01E8 # '󠇨' VARIATION SELECTOR-249
E01E9 # '󠇩' VARIATION SELECTOR-250
E01EA # '󠇪' VARIATION SELECTOR-251
E01EB # '󠇫' VARIATION SELECTOR-252
E01EC # '󠇬' VARIATION SELECTOR-253
E01ED # '󠇭' VARIATION SELECTOR-254
E01EE # '󠇮' VARIATION SELECTOR-255
E01EF # '󠇯' VARIATION SELECTOR-256
E01F0 E0FFF # Unassigned, block=No_Block [3600]
END

p5pRT · 2012-02-08T11:03:23Z

From tchrist@perl.com

I never thought to check unassigned code points for properties.
Hadn't realized there were 308 unassigned code points that already
counted as PatSyn even though we don't know what they are yet.
That now makes more sense as to how they can have an immutable set:
they carved out a fixed place to grow into.

No room in PatWS, but LRM and RLM are \S.

(Well, so is \cK, but that's only because we haven't fixed that yet to make
it white space in Perl the way it is in Unicode. Larry said he thought we
should, because it seemed like a bug that Perl's WS != Unicode's WS.)

Not sure what all the unassigned DI code points up in E0080–E00FF
or E01F0–E0FFF are meant to be used for someday; more varriation
selectors, maybe?

--tom

p5pRT · 2012-02-08T11:37:06Z

From @nwc10

On Tue, Feb 07, 2012 at 12:22:30PM -0700, Karl Williamson wrote:

This solution is completely backwards compatible in the ASCII range.
It is completely backwards compatible in the Latin1 range provided you
aren't using unicode_strings. unicode_strings was never advertised as
applying to quotemeta, but it seems like a reasonable extension of its
use to me; another alternative would be to come up with yet another
feature, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd likely as
not be adding one new feature each year (per major release) as we find
another small thing we'd like to regular the behaviour of.

The solution isn't backwards compatible above Latin1; nothing we do is,
unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to
think of a use case where it's going to make a difference whether
quotemeta("£") is "£" or "\£", other than golden results in tests.

Nicholas Clark

p5pRT · 2012-02-08T17:25:45Z

From @khwilliamson

On 02/08/2012 04:36 AM, Nicholas Clark wrote:

On Tue, Feb 07, 2012 at 12:22:30PM -0700, Karl Williamson wrote:

This solution is completely backwards compatible in the ASCII range.
It is completely backwards compatible in the Latin1 range provided you
aren't using unicode_strings. unicode_strings was never advertised as
applying to quotemeta, but it seems like a reasonable extension of its
use to me; another alternative would be to come up with yet another
feature, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd likely as
not be adding one new feature each year (per major release) as we find
another small thing we'd like to regular the behaviour of.

I was hoping that would be people's sentiment about this. :)

The solution isn't backwards compatible above Latin1; nothing we do is,
unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to
think of a use case where it's going to make a difference whether
quotemeta("£") is "£" or "\£", other than golden results in tests.

Totally agree.

So yet another option is to just fix the Unicode bug portion of this for
now.

We could use unicode_strings as a flag for the upper Latin1 range
characters. If it is off, we treat them as we've always treated them:
quote them.

If it is on, we treat them as we've always treated above-Latin1 range
characters: don't quote them.

Thus the only inconsistency is between non-unicode_strings and
unicode_strings, and we could leave for another time worrying about
which of these we really want to quote going forwards.

p5pRT · 2012-02-10T02:11:29Z

From @khwilliamson

On 02/08/2012 10:23 AM, Karl Williamson wrote:

On 02/08/2012 04:36 AM, Nicholas Clark wrote:

On Tue, Feb 07, 2012 at 12:22:30PM -0700, Karl Williamson wrote:

This solution is completely backwards compatible in the ASCII range.
It is completely backwards compatible in the Latin1 range provided you
aren't using unicode_strings. unicode_strings was never advertised as
applying to quotemeta, but it seems like a reasonable extension of its
use to me; another alternative would be to come up with yet another
feature, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd
likely as
not be adding one new feature each year (per major release) as we find
another small thing we'd like to regular the behaviour of.

I was hoping that would be people's sentiment about this. :)

The solution isn't backwards compatible above Latin1; nothing we do is,
unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to
think of a use case where it's going to make a difference whether
quotemeta("£") is "£" or "\£", other than golden results in tests.

Totally agree.

So yet another option is to just fix the Unicode bug portion of this for
now.

We could use unicode_strings as a flag for the upper Latin1 range
characters. If it is off, we treat them as we've always treated them:
quote them.

If it is on, we treat them as we've always treated above-Latin1 range
characters: don't quote them.

Thus the only inconsistency is between non-unicode_strings and
unicode_strings, and we could leave for another time worrying about
which of these we really want to quote going forwards.

If we go the pattern syntax route, I think we should quote the controls
we wouldn't otherwise quote. This is the set of C1 controls (except NEL
is already quoted)

p5pRT · 2012-02-12T16:49:19Z

From @khwilliamson

I have mostly implemented what I last proposed, but attached is a doc
patch for comment on how it actually plays out, to verify that this
seems like an acceptable approach.

I'm also thinking that under locale, quotemeta should just quote \W for
code points < 256. I don't think it should be immune from locale, or
perhaps it doesn't much matter.

p5pRT · 2012-02-12T16:49:19Z

From @khwilliamson

0002-temp-for-comment.patch

From 1607ec47dcb28ecd2687333d4f0d759eb0479312 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sun, 12 Feb 2012 09:41:25 -0700
Subject: [PATCH 2/2] temp for comment

---
 pod/perlfunc.pod |   48 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 591fa0d..ad8b7b5 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -4953,8 +4953,52 @@ input from the user, quotemeta() or C<\Q> must be used.
 
 In Perl v5.14, all non-ASCII characters are quoted in non-UTF-8-encoded
 strings, but not quoted in UTF-8 strings.
-It is planned to change this behavior in v5.16, but the exact rules
-haven't been determined yet.
+
+Starting in Perl v5.16, Perl adopted a Unicode-defined strategy
+for quoting non-ASCII characters; the quoting of ASCII characters is
+unchanged.
+
+Also unchanged is the quoting for non-UTF-8 strings when outside the
+scope of a C<use feature 'unicode_strings'>, which is to quote all
+characters in the upper Latin1 range.  This provides complete backwards
+compatibility for old programs which do not use Unicode (but note that
+C<unicode_strings> is automatically enabled within the scope of a
+S<C<use v5.12>> or greater).
+
+Otherwise, Perl quotes non-ASCII characters using an adaptation from
+Unicode (see L<http://www.unicode.org/reports/tr31/>.)
+The only code points that are quoted are those that have any of the
+Unicode properties Pattern_Syntax, Pattern_White_Space, White_Space,
+Default_Ignorable_Code_Point, or General_Category=Control.
+
+Of these properties, the two important ones are Pattern_Syntax and
+Pattern_White_Space.  They have been set up by Unicode for exactly this
+purpose of deciding which characters in a regular expression pattern
+should be quoted.  No character that can be in an identifier has these
+properties.
+
+Perl promises, that if we ever add regular expression pattern
+metacharacters to the dozen already defined
+(C<\ E<verbar> ( ) [ { ^ $ * + ? .>), that we will only use ones that have the
+Pattern_Syntax property.  Perl also promises, that if we ever add
+characters that are considered to be white space in regular expressions
+(currently mostly affected by C</x>), they will all have the
+Pattern_White_Space property.
+
+Unicode promises that the set of code points that have these two
+properties will never change, so something that is not quoted in v5.16
+will never need to be quoted in any future Perl release.  (Not all the
+code points that match Pattern_Syntax have actually had characters
+assigned to them; so there is room to grow, but they are quoted
+whether assigned or not.  Perl, of course, would never use an
+unassigned code point as an actual metacharacter.)
+
+Quoting characters that have the other 3 properties is done to enhance
+the readability of the regular expression and not because they actually
+need to be quoted (characters with the White_Space property are likely
+to be indistinguishable on the page or screen from those with the
+Pattern_White_Space property; and the other two properties contain
+non-printing characters).
 
 =item rand EXPR
 X<rand> X<random>
-- 
1.7.7.1

p5pRT · 2012-02-13T22:56:53Z

From @rjbs

* Karl Williamson <public@khwilliamson.com> [2012-02-12T11:47:28]

+Otherwise, Perl quotes non-ASCII characters using an adaptation from
+Unicode (see L<http://www.unicode.org/reports/tr31/>.)
+The only code points that are quoted are those that have any of the
+Unicode properties Pattern_Syntax, Pattern_White_Space, White_Space,
+Default_Ignorable_Code_Point, or General_Category=Control.
[...]
+Perl promises, that if we ever add regular expression pattern
+metacharacters to the dozen already defined
+(C<\ E<verbar> ( ) [ { ^ $ * + ? .>), that we will only use ones that have
the +Pattern_Syntax property. Perl also promises, that if we ever add

...and I see that all characters that are ASCII and Pattern_Syntax are already
quoted by quotemeta. That comforts my initially-raised eyebrow.

Cool.

--
rjbs

p5pRT · 2012-02-16T01:04:30Z

From @khwilliamson

Now fixed by commit 2e2b257
--
Karl Williamson

p5pRT · 2012-02-16T01:04:30Z

From [Unknown Contact. See original ticket]

Now fixed by commit 2e2b257
--
Karl Williamson

p5pRT · 2012-02-16T01:04:30Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Feb 16, 2012

p5pRT added Severity Low type-Unicode type-core labels Oct 18, 2019

quotemeta() fails to quote literal non-word character under utf8 #10602

quotemeta() fails to quote literal non-word character under utf8 #10602

Comments

p5pRT commented Sep 2, 2010

p5pRT commented Sep 2, 2010

From mncharity@vendian.org

Created by mncharity@vendian.org

p5pRT commented Dec 16, 2010

From @iabyn

p5pRT commented Dec 16, 2010

p5pRT commented Dec 16, 2010

From tchrist@perl.com

p5pRT commented Dec 16, 2010

From tchrist@perl.com

p5pRT commented Dec 16, 2010

From @khwilliamson

p5pRT commented Dec 16, 2010

From tchrist@perl.com

p5pRT commented Dec 16, 2010

From @khwilliamson

p5pRT commented Dec 17, 2010

From @ikegami

p5pRT commented Dec 17, 2010

From tchrist@perl.com

p5pRT commented Dec 17, 2010

From @Abigail

p5pRT commented Dec 17, 2010

From tchrist@perl.com

p5pRT commented Dec 29, 2010

From @iabyn

p5pRT commented Feb 6, 2012

From @khwilliamson

p5pRT commented Feb 6, 2012

From @demerphq

p5pRT commented Feb 6, 2012

From @demerphq

p5pRT commented Feb 6, 2012

From tchrist@perl.com

p5pRT commented Feb 6, 2012

From @demerphq

p5pRT commented Feb 6, 2012

From @nwc10

p5pRT commented Feb 6, 2012

From @khwilliamson

p5pRT commented Feb 6, 2012

From @ikegami

p5pRT commented Feb 6, 2012

From @khwilliamson

p5pRT commented Feb 7, 2012

From @ikegami

p5pRT commented Feb 7, 2012

From @demerphq

p5pRT commented Feb 7, 2012

From @demerphq

p5pRT commented Feb 7, 2012

From @Abigail

p5pRT commented Feb 7, 2012

From @demerphq

p5pRT commented Feb 7, 2012

From @khwilliamson

p5pRT commented Feb 7, 2012

From @khwilliamson

p5pRT commented Feb 7, 2012

From @khwilliamson

p5pRT commented Feb 7, 2012

From @khwilliamson

p5pRT commented Feb 7, 2012

From @khwilliamson

p5pRT commented Feb 8, 2012

From tchrist@perl.com

p5pRT commented Feb 8, 2012

From @nwc10

p5pRT commented Feb 8, 2012

From @khwilliamson

p5pRT commented Feb 10, 2012

From @khwilliamson

p5pRT commented Feb 12, 2012

From @khwilliamson

p5pRT commented Feb 12, 2012

From @khwilliamson