Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent and wrong handling of 8th bit set chars with no locale #9455

Closed
p5pRT opened this issue Aug 20, 2008 · 267 comments
Closed

Inconsistent and wrong handling of 8th bit set chars with no locale #9455

p5pRT opened this issue Aug 20, 2008 · 267 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Aug 20, 2008

Migrated from rt.perl.org#58182 (status was 'resolved')

Searchable as RT58182$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 20, 2008

From @khwilliamson

This is a bug report for perl from corporate@​khwilliamson.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.


Characters in the range U+0080 through U+00FF behave inconsistently
depending on whether or not they are part of a string which also
includes a character above that range, and in some cases they behave
incorrectly even when part of such a string. The problems I will
concentrate on in this report are those involving case.

I presume that they do work properly when a locale is set, but I haven't
tested that.

print uc("\x{e0}"), "\n"; # (a with grave accent)

yields itself instead of a capital A with grave accent (U+00C0). This
is true whether or not the character is part of a string which includes
a character not storable in a single byte. Similarly

print "\x{e0}" =~ /\x{c0}/i, "\n";

will print a null string on a line, as the match fails.

The same behavior occurs for all characters in this range that are
marked in the Unicode standard as lower case and have single letter
upper case equivalents.

The behavior that is inconsistent mostly occurs with upper case letters
being mapped to lower case.

print lcfirst("\x{c0}aaaaa"), "\n";

doesn't change the first character. But

print lcfirst("\x{c0}aaaaa\x{101}"), "\n";

does change it. There is something seriously wrong when a character
separated by an arbitrarily large distance from another one can affect
what case the latter is considered to be. Similarly,

print "\x{c0}aaaaaa" =~ /^\x{e0}/i, "\n";

will show the match failing, but

print "\x{c0}aaaaaa\x{101}" =~ /^\x{e0}/i, "\n";

will show the match succeeding. Again a character maybe hundreds of
positions further along in a string can affect whether the first
character in said string matches its lower case equivalent when case is
ignored.

The same behavior occurs for all characters in this range that are
marked in the Unicode standard as upper case and have lower case
equivalents, as well as U+00DF which is lower case and has an upper case
equivalent of the string 'SS'.

Also, the byte character classes inconsistently match characters in this
range, again depending on whether or not the character is part of a
larger string that contains a character greater than the range. So, for
example, for a non-breaking space,

print "\xa0" =~ /^\s/, "\n";

will show that the match returns false but

print "\xa0\x{101}" =~ /^\s/, "\n";

will show that the match returns true. But this behavior is sort-of
documented, and there is a work-around, which is to use the '\p{}'
classes instead. Note that calling them byte character classes is
wrong; they really are 7-bit classes.

From reading the documentation, I presume that the inconsistent behavior
is a result of the decision to have perl not switch to wide-character
mode in storing its strings unless necessary. I like that decision for
efficiency reasons. But what has happened is that the code points in
the range 128 - 255 have been orphaned, when they aren't part of strings
that force the switch. Again, I presume but haven't tested, that using
a locale causes them to work properly for that locale, but in the
absence of a locale they should be treated as Unicode code points (or
equivalently for characters in this range, as iso-8859-1). Storing as
wide-characters is supposed to be transparent to users, but this bug
belies that and yields very inconsistent and unexpected behavior.
(This doesn't explain the lower to upper case translation bug, which is
wrong even in wide-character mode.)

I am frankly astonished that this bug exists, as I have come to expect
perl to "Do the Right Thing" over the course of many years of using it.
I did see one bug report of something similar to this when searching for
this, but it apparently was misunderstood and went nowhere, and wasn't
in the perl bug data base



Flags​:
  category=core
  severity=high


Site configuration information for perl 5.10.0​:

Configured by ActiveState at Wed May 14 05​:06​:16 PDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration​:
  Platform​:
  osname=linux, osvers=2.4.21-297-default,
archname=i686-linux-thread-multi
  uname='linux gila 2.4.21-297-default #1 sat jul 23 07​:47​:39 utc
2005 i686 i686 i386 gnulinux '
  config_args='-ders -Dcc=gcc -Dusethreads -Duseithreads
-Ud_sigsetjmp -Uinstallusrbinperl -Ulocincpth= -Uloclibpth=
-Accflags=-DUSE_SITECUSTOMIZE -Duselargefiles
-Accflags=-DPRIVLIB_LAST_IN_INC -Dprefix=/opt/ActivePerl-5.10
-Dprivlib=/opt/ActivePerl-5.10/lib -Darchlib=/opt/ActivePerl-5.10/lib
-Dsiteprefix=/opt/ActivePerl-5.10/site
-Dsitelib=/opt/ActivePerl-5.10/site/lib
-Dsitearch=/opt/ActivePerl-5.10/site/lib -Dsed=/bin/sed -Duseshrplib
-Dcf_by=ActiveState -Dcf_email=support@​ActiveState.com'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe'
  ccversion='', gccversion='3.3.1 (SuSE Linux)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='gcc', ldflags =''
  libpth=/lib /usr/lib /usr/local/lib
  libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=, so=so, useshrplib=true, libperl=libperl.so
  gnulibc_version='2.3.2'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/opt/ActivePerl-5.10/lib/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -O2'

Locally applied patches​:
  ACTIVEPERL_LOCAL_PATCHES_ENTRY
  33741 avoids segfaults invoking S_raise_signal() (on Linux)
  33763 Win32 process ids can have more than 16 bits
  32809 Load 'loadable object' with non-default file extension
  32728 64-bit fix for Time​::Local


@​INC for perl 5.10.0​:
  /opt/ActivePerl-5.10/site/lib
  /opt/ActivePerl-5.10/lib
  .


Environment for perl 5.10.0​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/opt/ActivePerl-5.10/bin​:/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL_BADLANG (unset)
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 21, 2008

From @moritz

karl williamson wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #58182]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=58182 >

This is a bug report for perl from corporate@​khwilliamson.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
Characters in the range U+0080 through U+00FF behave inconsistently
depending on whether or not they are part of a string which also
includes a character above that range, and in some cases they behave
incorrectly even when part of such a string. The problems I will
concentrate on in this report are those involving case.

I presume that they do work properly when a locale is set, but I haven't
tested that.

print uc("\x{e0}"), "\n"; # (a with grave accent)

yields itself instead of a capital A with grave accent (U+00C0). This
is true whether or not the character is part of a string which includes
a character not storable in a single byte. Similarly

This is a known bug, and probably not fixable, because too much code
depends on it. See http​://search.cpan.org/perldoc?Unicode​::Semantics

A possible workaround is
my $x = "\x{e0}"; utf8​::upgrade($x); say uc($x); # yields À

CHeers,
Moritz

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 21, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 21, 2008

From @druud62

karl williamson schreef​:

The behavior that is inconsistent mostly occurs with upper case
letters being mapped to lower case.

print lcfirst("\x{c0}aaaaa"), "\n";

doesn't change the first character. But

print lcfirst("\x{c0}aaaaa\x{101}"), "\n";

does change it.

To me that is as expected.

print lcfirst substr "\x{100}\x{c0}aaaaa", 1;

Lowercasing isn't defined for as many characters in ASCII or Latin-1 as
it is in Unicode.
Unicode semantics get activated when a codepoint above 255 is involved.

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 21, 2008

From @nothingmuch

On Thu, Aug 21, 2008 at 13​:22​:36 +0200, Dr.Ruud wrote​:

Unicode semantics get activated when a codepoint above 255 is involved.

Or a code point above 127 with use utf8 or use encoding

--
  Yuval Kogman <nothingmuch@​woobling.org>
http​://nothingmuch.woobling.org 0xEBD27418

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Aug 21, 2008

From @nothingmuch

On Thu, Aug 21, 2008 at 14​:31​:40 +0300, Yuval Kogman wrote​:

Or a code point above 127 with use utf8 or use encoding

I should clarify that this is only in the context of the string
constants.

A code point above 127 will be treated as unicode if the string is
properly marked as such, and the way to achieve that for string
constants is 'use utf8'.

--
  Yuval Kogman <nothingmuch@​woobling.org>
http​://nothingmuch.woobling.org 0xEBD27418

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 20, 2008

From @khwilliamson

I'm the person who submitted this bug report. I think this bug should
be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I
downloaded the Perl 5.10 source and hacked up an experimental version
that seems to fix it. And now I've joined this list to see how to
proceed. I don't know the protocol involved, so I'll just jump in, and
hopefully that will be all right.

To refresh your memory, the current implementation of perl on non-EBCDIC
machines is problematic for characters in the range 128-255 when no
locale is set.

The slides from the talk "Working around *the* Unicode bug" during
YAPC​::Europe 2007 in Vienna​:
http​://juerd.nl/files/slides/2007yapceu/unicodesemantics.html
give more cases of problems than were in my bug report.

The crux of the problem is that on non-EBCDIC machines, in the absence
of locale, in order to have meaningful semantics, a character (or code
point) has to be stored in utf8, except in pattern matching the \h, \H,
\v and \V or any of the \p{} patterns. (This leads to an anomaly with
the no-break space which is considered to be horizontal space (\h), but
not space (\s).) (The characters also always have base semantics of
having an ordinal number, and also of being not-a-anything (meaning that
they all pattern match \W, \D, \S, [[​:^punct]], etc.))

Perl stores characters as utf8 automatically if a string contains any
code points above 255, and it is trivially true for ascii code points.
That leaves a hole-in-the-doughnut of characters between 128 and 255
with behavior that varies depending on whether they are stored as utf8
or not. This is contrary, for example, to the Camel book​: "character
semantics are preserved at an abstract level regardless of
representation" (p.403). (How they get stored depends on how they were
input, or whether or not they are part of a longer string containing
code points larger than 255, or if they have been explicitly set by
using utf8​::upgrade or utf8​::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least
documented (though somewhat confusingly). And one can use the \p{}
constructs to avoid the issue.

The second is case changing functions, like lcfirst() or \U in pattern
substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think, for
example, that quotemeta() will escape all these characters, though I
don't believe that this causes a real problem.

One response I got to my bug report was that a lot of code depends on
things working the way they currently do. I'm wondering if that applies
to all three of the areas, or just the first?

Also, from reading the perl source, it appears to me that EBCDIC
machines may work differently (and more correctly to my way of thinking)
than Ascii-ish ones.

An idea I've had is to add a pragma like "use latin1", or maybe "use
locale unicode", or something else as a way of not breaking existing
application code.

Anyway, I'm hoping to get some sort of fix in for this. In my
experimental implementation (which currently doesn't change EBCDIC
handling), it is mostly just extending the existing definitions of ascii
semantics to include the 128..255 latin1 range. Code logic changes were
required only in the uc and ucfirst functions (to accommodate 3
characters which require special handling), and in the regular
expression compilation (to accommodate 2 characters which need special
handling). Obviously, in my ignorance, I may be missing things that
others can enlighten me on.

So I'd like to know how to proceed

Karl Williamson

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 20, 2008

From perl@nevcal.com

On approximately 9/20/2008 3​:52 PM, came the following characters from
the keyboard of karl williamson​:

I'm the person who submitted this bug report. I think this bug should
be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I
downloaded the Perl 5.10 source and hacked up an experimental version
that seems to fix it. And now I've joined this list to see how to
proceed. I don't know the protocol involved, so I'll just jump in, and
hopefully that will be all right.

To refresh your memory, the current implementation of perl on non-EBCDIC
machines is problematic for characters in the range 128-255 when no
locale is set.

The slides from the talk "Working around *the* Unicode bug" during
YAPC​::Europe 2007 in Vienna​:
http​://juerd.nl/files/slides/2007yapceu/unicodesemantics.html
give more cases of problems than were in my bug report.

The crux of the problem is that on non-EBCDIC machines, in the absence
of locale, in order to have meaningful semantics, a character (or code
point) has to be stored in utf8, except in pattern matching the \h, \H,
\v and \V or any of the \p{} patterns. (This leads to an anomaly with
the no-break space which is considered to be horizontal space (\h), but
not space (\s).) (The characters also always have base semantics of
having an ordinal number, and also of being not-a-anything (meaning that
they all pattern match \W, \D, \S, [[​:^punct]], etc.))

Perl stores characters as utf8 automatically if a string contains any
code points above 255, and it is trivially true for ascii code points.
That leaves a hole-in-the-doughnut of characters between 128 and 255
with behavior that varies depending on whether they are stored as utf8
or not. This is contrary, for example, to the Camel book​: "character
semantics are preserved at an abstract level regardless of
representation" (p.403). (How they get stored depends on how they were
input, or whether or not they are part of a longer string containing
code points larger than 255, or if they have been explicitly set by
using utf8​::upgrade or utf8​::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least
documented (though somewhat confusingly). And one can use the \p{}
constructs to avoid the issue.

The second is case changing functions, like lcfirst() or \U in pattern
substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think, for
example, that quotemeta() will escape all these characters, though I
don't believe that this causes a real problem.

One response I got to my bug report was that a lot of code depends on
things working the way they currently do. I'm wondering if that applies
to all three of the areas, or just the first?

Also, from reading the perl source, it appears to me that EBCDIC
machines may work differently (and more correctly to my way of thinking)
than Ascii-ish ones.

An idea I've had is to add a pragma like "use latin1", or maybe "use
locale unicode", or something else as a way of not breaking existing
application code.

Anyway, I'm hoping to get some sort of fix in for this. In my
experimental implementation (which currently doesn't change EBCDIC
handling), it is mostly just extending the existing definitions of ascii
semantics to include the 128..255 latin1 range. Code logic changes were
required only in the uc and ucfirst functions (to accommodate 3
characters which require special handling), and in the regular
expression compilation (to accommodate 2 characters which need special
handling). Obviously, in my ignorance, I may be missing things that
others can enlighten me on.

So I'd like to know how to proceed

Karl Williamson

I applaud your willingness to dive in.

For compatibility reasons, as has been discussed on this list
previously, a pragma of some sort must be used to request the
incompatible enhancement (which you call a fix).

N.B. There are lots of discussions about it in the archive, some
recently, if you haven't found them, you should; if you find it hard to
find them, ask, and I (or someone) will try to find the starting points
for you, perhaps the summaries would be a good place to look to find the
discussions; I participated in most of them.

Those discussions are lengthy reading, unfortunately, but they do point
out an extensive list of issues, perhaps approaching completeness.

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 21, 2008

From @andk

On Sat, 20 Sep 2008 16​:52​:02 -0600, karl williamson <contact@​khwilliamson.com> said​:

  > I'm the person who submitted this bug report. I think this bug should
  > be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I
  > downloaded the Perl 5.10 source and hacked up an experimental version
  > that seems to fix it. And now I've joined this list to see how to
  > proceed. I don't know the protocol involved, so I'll just jump in, and
  > hopefully that will be all right.

Thank you! As for the protocol​: do not patch 5.10, patch bleadperl
instead.

--
andreas

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @rgs

2008/9/21 karl williamson <contact@​khwilliamson.com>​:

The crux of the problem is that on non-EBCDIC machines, in the absence
of locale, in order to have meaningful semantics, a character (or code
point) has to be stored in utf8, except in pattern matching the \h, \H, \v
and \V or any of the \p{} patterns. (This leads to an anomaly with the
no-break space which is considered to be horizontal space (\h), but not
space (\s).) (The characters also always have base semantics of having an
ordinal number, and also of being not-a-anything (meaning that they all
pattern match \W, \D, \S, [[​:^punct]], etc.))

Perl stores characters as utf8 automatically if a string contains any
code points above 255, and it is trivially true for ascii code points.
That leaves a hole-in-the-doughnut of characters between 128 and 255
with behavior that varies depending on whether they are stored as utf8
or not. This is contrary, for example, to the Camel book​: "character
semantics are preserved at an abstract level regardless of
representation" (p.403). (How they get stored depends on how they were
input, or whether or not they are part of a longer string containing
code points larger than 255, or if they have been explicitly set by
using utf8​::upgrade or utf8​::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least
documented (though somewhat confusingly). And one can use the \p{}
constructs to avoid the issue.

The second is case changing functions, like lcfirst() or \U in pattern
substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think, for
example, that quotemeta() will escape all these characters, though I
don't believe that this causes a real problem.

This is a good summary of the issues.

One response I got to my bug report was that a lot of code depends on
things working the way they currently do. I'm wondering if that applies
to all three of the areas, or just the first?

In general, one finds that people write code relying on almost anything...

Also, from reading the perl source, it appears to me that EBCDIC
machines may work differently (and more correctly to my way of thinking)
than Ascii-ish ones.

That's in theory probable, but we don't have testers on EBCDIC
machines those days...

An idea I've had is to add a pragma like "use latin1", or maybe "use
locale unicode", or something else as a way of not breaking existing
application code.

I think that the current Unicode bugs are annoying enough to deserve
an incompatible change in perl 5.12.
However, for perl 5.10.x, something could be added to switch to a more
correct behaviour, if possible without slowing everything down...

Anyway, I'm hoping to get some sort of fix in for this. In my
experimental implementation (which currently doesn't change EBCDIC
handling), it is mostly just extending the existing definitions of ascii
semantics to include the 128..255 latin1 range. Code logic changes were
required only in the uc and ucfirst functions (to accommodate 3
characters which require special handling), and in the regular
expression compilation (to accommodate 2 characters which need special
handling). Obviously, in my ignorance, I may be missing things that
others can enlighten me on.

So I'd like to know how to proceed

If you're a git user, you can work on a branch cloned from
git​://perl5.git.perl.org/perl.git

Do not hesitate to ask questions here.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @Juerd

Moritz Lenz skribis 2008-08-21 9​:50 (+0200)​:

This is a known bug, and probably not fixable, because too much code
depends on it.

It is fixable, and the backwards incompatibility has already been
announced in perl5100delta​:

| The handling of Unicode still is unclean in several places, where it's
| dependent on whether a string is internally flagged as UTF-8. This will
| be made more consistent in perl 5.12, but that won't be possible without
| a certain amount of backwards incompatibility.

It will be fixed, and it's wonderful to have a volunteer for that!
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @Juerd

Dr.Ruud skribis 2008-08-21 13​:22 (+0200)​:

Unicode semantics get activated when a codepoint above 255 is involved.

No, unicode semantics get activated when the internal encoding of the
string is utf8, even if it contains no character above 255, and even if
it only contains ASCII characters.

It's a bug. A known and old bug, but it must be fixed some time.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @Juerd

karl williamson skribis 2008-09-20 16​:52 (-0600)​:

One response I got to my bug report was that a lot of code depends on
things working the way they currently do. I'm wondering if that applies
to all three of the areas, or just the first?

All three, but rest assured that this has already been discussed in
great detail, and that the pumpking's decision was that backwards
incompatibility would be better than keeping the bug.

This decision is clearly reflected in perl5100delta​:

| The handling of Unicode still is unclean in several places, where it's
| dependent on whether a string is internally flagged as UTF-8. This will
| be made more consistent in perl 5.12, but that won't be possible without
| a certain amount of backwards incompatibility.

Please proceed with fixing the bug. I am very happy with your offer to
smash this one.

Also, from reading the perl source, it appears to me that EBCDIC
machines may work differently (and more correctly to my way of thinking)
than Ascii-ish ones.

As always, I refrain from thinking about EBCDIC. I'd say​: keep the
current behavior for EBCDIC platforms - there haven't been *any*
complaints from them as far as I've heard.

An idea I've had is to add a pragma like "use latin1", or maybe "use
locale unicode", or something else as a way of not breaking existing
application code.

Please do break existing code, harsh as that may be. It is much more
likely that broken code magically starts working correctly, by the way.

Pragmas have problems, especially in regular expressions. And it's very
hard to load a pragma conditionally, which makes writing version
portable code hard. Besides that, any pragma affecting regex matches
needs to be carried in qr//, which in this case means new regex flags to
indicate the behavior for (?i​:...). According to dmq, adding flags is
hard.

Obviously, in my ignorance, I may be missing things that others can
enlighten me on.

Please feel free to copy the unit tests in Unicode​::Semantics!
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @Juerd

Glenn Linderman skribis 2008-09-20 16​:31 (-0700)​:

For compatibility reasons, as has been discussed on this list
previously, a pragma of some sort must be used to request the
incompatible enhancement (which you call a fix).

As the current behavior is a bug, the enhancement can rightfully be
called a fix.

What's this about the pragma that "must be used"? Yes, it has been
discussed, but no consensus has pointed in that direction.

In fact, perl5100delta clearly announces backwards incompatibility.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 22, 2008

From @druud62

Juerd Waalboer schreef​:

Dr.Ruud​:

Unicode semantics get activated when a codepoint above 255 is
involved.

No, unicode semantics get activated when the internal encoding of the
string is utf8, even if it contains no character above 255, and even
if it only contains ASCII characters.

Yes, Unicode semantics get activated when a codepoint above 255 is
involved.

Yes, there are other ways too, like​:

perl -Mstrict -Mwarnings -Mencoding=utf8 -le'
  my $s = chr(65);
  print utf8​::is_utf8($s);
'
1

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From @ikegami

On Sat, Sep 20, 2008 at 6​:52 PM, karl williamson
<contact@​khwilliamson.com>wrote​:

There may be others which I haven't looked for yet. I think, for
example, that quotemeta() will escape all these characters, though I
don't believe that this causes a real problem.

There are inconsistencies with quotemeta (and therefore \Q)

perl -wle"utf8​::downgrade( $x = chr(130) ); print quotemeta $x"

perl -wle"utf8​::upgrade( $x = chr(130) ); print quotemeta $x"
é

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From @iabyn

On Mon, Sep 22, 2008 at 09​:55​:23PM +0200, Juerd Waalboer wrote​:

It's a bug. A known and old bug, but it must be fixed some time.

Here's a general suggestion related to fixing Unicode-related issues.

A well-known issue is that the SVf_UTF8 flag means two different things​:

  1) whether the 'sequence of integers' are stored one per byte, or use
  the variable-length utf-8 encoding scheme;

  2) what semantics apply to that sequence of integers.

We also have various bodges, such as attaching magic to cache utf8
indexes.

All this stems from the fact that there's no space in an SV to store all
the information we want. So....

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc. This then both allows us to disambiguate the meaning of
SVf_UTF8 (in the aux structure there would be two different flags for the
two meanings), but would also provide room for future enhancements (eg
space for a UTF32 flag should someone wish to implement that storage
format).

Just a thought...

--
"I do not resent criticism, even when, for the sake of emphasis,
it parts for the time with reality".
  -- Winston Churchill, House of Commons, 22nd Jan 1941.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From @Juerd

Dave Mitchell skribis 2008-09-23 17​:03 (+0100)​:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc.

It sounds rather complicated, whereas the current plan would be to
continue with the single bit flag, and only remove one of its meanings.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From perl@nevcal.com

On approximately 9/23/2008 9​:58 AM, came the following characters from
the keyboard of Juerd Waalboer​:

Dave Mitchell skribis 2008-09-23 17​:03 (+0100)​:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc.

It is not at all clear to me that encoding, charset, and locale are
Unicodish info... Unicode frees us from such stuff, except at boundary
conditions, where we must deal with devices or formats that have
limitations. This extra information seems more appropriately bound to
file/device handles than to strings.

Cached indexes are a nice performance help, I don't know enough about
the internals to know if reworking them from being done as magic, to
being done in some frightfully (in thinking of XS) new structure would
be an overall win or loss.

It sounds rather complicated, whereas the current plan would be to
continue with the single bit flag, and only remove one of its meanings.

I guess Juerd is referring to removing any semantic meaning of the flag,
and leaving it to simply be a representational flag? That
representational flag would indicate that the structure of the string is
single-byte oriented (no individual characters exceed a numeric value of
255), or multi-bytes oriented (characters may exceed a numeric value of
255, and characters greater than a numeric value of 127 will be stored
in multiple, sequential bytes).

After such a removal, present-perl would reach the idyllic state
(idyllic-perl) of implementing only Unicode semantics for all string
operations. (Even the EBCDIC port should reach that idyllic state,
although it would use a different encoding of numbers to characters,
UTF-EBCDIC instead of UTF-8.) If other encodings are desired/used,
there would be two application approaches to dealing with it​:

1) convert all other encodings to Unicode, perform semantic operations
as needed, convert the results to some other encoding. This is already
the recommended approach, although present-perl's attempt to retain the
single-byte oriented representational format as much as possible
presently makes this a bit tricky.

2) leave data in other encodings, but avoid the use of Perl operations
that apply Unicode semantics in manners that are inconsistent with the
semantics of the other encoding. Write specific code to implement the
semantics of the other encoding as needed, without doing the re-coding.
  This could be somewhat error prone, but could be achieved, since,
after all, strings are simply an ordered list of numbers, to which any
application semantics that are desired can be applied. Idyllic-perl
simply provides a fairly large collection of string operations that have
Unicode semantics, which are inappropriate for use with strings having
other semantics.

Note that binary data in strings is simply a special case of strings
with non-Unicode semantics... In present-perl, there are three sets of
string semantics selected by the representation, ASCII (operations like
character classes and case shifting), Latin-1 (the only operation that
supports Latin-1 semantics is the conversion from single-byte
representation to multi-byte representation), and Unicode (operations
like character classes and case shifting). It is already inappropriate
to apply operations that imply ASCII or Unicode semantics to binary
strings of either representation. Applying the representation
conversion operation to binary data is perfectly legal, and doesn't
change the binary values in any way... but is generally not a mental
shift that most programmers wish to make in dealing with binary
data--most prefer their binary data to remain in the single-byte
oriented representation, and they are welcome to code in such a manner
that they do.

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From @khwilliamson

Glenn,

The reason I called it a bug is that I, an experienced Perl programmer,
attempted to enhance an application to understand unicode. I read the
Camel book and the on-line documentation and came to a very different
expectation as to how it worked than it does in reality. I then thought
I was scouring the documentation when things went wrong, and still
didn't get it. It was only after a lot of experimentation and some
internet searches that I started to see the cause of the problem. I was
using 5.8.8; perhaps the documentation has changed in 5.10. And perhaps
my own expectations of how I thought it should work caused me to be
blind to things in the documentation that were contrary to my
preconceived notions.

Whatever one calls it, there does seem to be some support for changing
the behavior. After reading your response and further reflection, I
think that Goal #1 of not breaking old programs is contradictory to the
other ones. Indeed, a few regression tests fail with my experimental
implementation. Some of them are commented that they are taking
advantage of the anomaly to verify that the operation they test doesn't
change the utf8-ness of the data. Others explicitly are testing that,
for example, taking lc(E with an accent) returns itself unless an
appropriate locale is specified. I doubt that the code that test was
for really cares, but if so, why put in the test? There are a couple of
failures which are obtuse, and uncommented, so I haven't yet tried to
figure out what was going on. I wanted to see if I should proceed at
all before doing so.

I have looked in the archive and found some discussions about this
problem, but certainly not a lot. Please let me know of ones you think
important that I read.

Karl Williamson

Glenn Linderman wrote​:

For compatibility reasons, as has been discussed on this list
previously, a pragma of some sort must be used to request the
incompatible enhancement (which you call a fix).

N.B. There are lots of discussions about it in the archive, some
recently, if you haven't found them, you should; if you find it hard to
find them, ask, and I (or someone) will try to find the starting points
for you, perhaps the summaries would be a good place to look to find the
discussions; I participated in most of them.

Those discussions are lengthy reading, unfortunately, but they do point
out an extensive list of issues, perhaps approaching completeness.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 23, 2008

From perl@nevcal.com

On approximately 9/23/2008 10​:33 AM, came the following characters from
the keyboard of karl williamson​:

Glenn,

The reason I called it a bug is that I, an experienced Perl
programmer, attempted to enhance an application to understand
unicode. I read the Camel book and the on-line documentation and came
to a very different expectation as to how it worked than it does in
reality.

The behavior is non-obvious. I may be blind to the deficiencies of the
documentation, because of knowing roughly how it works, due to hanging
out on p5p too long :) It has been an open discussion about whether it
is working as designed (with lots of gotchas for the application
programmer), or whether the design, in fact, is the bug. Seems Raphael
has declared it to be a bug in the 5.10 release notes, and something
that can/should be incompatibly changed/fixed for 5.12, but I missed
that declaration. Any solution for 5.8.x or 5.10.x, though, would have
be treated as an enhancement, turned on by a pragma, because the current
design, buggy or not, is the current design for which applications are
coded.

I then thought I was scouring the documentation when things went
wrong, and still didn't get it. It was only after a lot of
experimentation and some internet searches that I started to see the
cause of the problem. I was using 5.8.8; perhaps the documentation
has changed in 5.10. And perhaps my own expectations of how I thought
it should work caused me to be blind to things in the documentation
that were contrary to my preconceived notions.

The documentation has been in as much flux as the code, from 5.6.x to
5.8.x to 5.10.x. Unfortunately, there are enough warts in the design
that it is hard to find all the places where the documentation should be
clarified. My most recent message to p5p clarifies what I think is the
idyllic state that I hope is the one that you share, and will achieve
for 5.12 (or for a pragma-enabled 5.10) redesign/bug-fix.

Whatever one calls it, there does seem to be some support for changing
the behavior. After reading your response and further reflection, I
think that Goal #1 of not breaking old programs is contradictory to
the other ones.

Yes, there is a definite conflict between those goals, and from that
conflict arises many of the behaviours that are not expected by
reasonable programmers when designing their application code.

Indeed, a few regression tests fail with my experimental
implementation. Some of them are commented that they are taking
advantage of the anomaly to verify that the operation they test
doesn't change the utf8-ness of the data. Others explicitly are
testing that, for example, taking lc(E with an accent) returns itself
unless an appropriate locale is specified. I doubt that the code that
test was for really cares, but if so, why put in the test? There are
a couple of failures which are obtuse, and uncommented, so I haven't
yet tried to figure out what was going on. I wanted to see if I
should proceed at all before doing so.

Sure. Please proceed. Especially with Raphael's openness to
incompatible changes in this area for 5.12, it would be possible to
remove all of the warts, conflicts, and unexpected behaviours. Of
course, incompatible changes are always considered for major releases,
but not always accepted. But this area seems to have a green light.
The current situation is very painful, compared to other languages that
implement Unicode. The compatibility issue was very real, however, when
the original design was done, no doubt partly due to Perl's extensive
CPAN collection, particularly the XS part of that CPAN collection. Some
of that concern has been alleviated due to enhancements to the XS code
in the intervening years, although no doubt you may encounter bugs in
some of those enhancements, also.

I have looked in the archive and found some discussions about this
problem, but certainly not a lot. Please let me know of ones you
think important that I read.

The discussions are more lengthy (per post, and per number of posts),
than numerous (by thread count)... and contain more heat than light,
often. Perhaps you've found them all.

Given Raphael's green light, and if you are pointed at changes to Perl
5.12, the most important thing is to cover all the relevant operations,
so that all string operations apply Unicode semantics to all their
operands, regardless of their representational format.

Here is one thread​:

Subject​: "on the almost impossibility to write correct XS modules"
started by Marc Lehmann on April 25, 2008, and lasted with that subject
line until at least May 22! So almost a whole month!

demerphq spawned a related thread subject​: "On the problem of strings
and binary data in Perl." on May 20, 2008. This attempted to deal with
multi-lingual strings; there is more to the issue of proper handling of
multi-lingual strings than being able to represent all the characters
that each one uses, but that is a very specialized type of program; at
least being able to represent the characters is a good start; being able
to pass "language" as an operand to certain semantic operations would be
good (implicitly, via locale, or explicitly, via a parameter).

Another related issue is that various operations that attempt to
implement Unicode semantics don't go the whole way, and have interesting
semantics for when strings (even strings represented in multi-byte
format) don't actually contain Unicode. Idyllic-perl should have chr/ord
as simple ways to convert between numbers and characters, and not burden
them with any sort of Unicode semantics. See bug #51936, and the p5p
thread it spawned (search for the bug number in the archives). See also
bug #51710 and the threads it spawned, about utf8_valid. While
utf8_valid probably should be enhanced, its existence is probably
reasonable justification to not burden chr/ord with Unicode validity checks.

Let's not forget Pack & Unpack. There's one thread about that with
Subject​: Perl 5.8 and perl 5.10 differences on UTF/Pack things started
in June 18, 2008... and a much older one started by Marc Lehmann (not
sure what that subject line was, but it resulted in a fairly recent
change to Pack, by Marc).

Other related threads have the following subject lines​:
use encoding 'utf8' bug for Latin-1 range
proposed change in utf8 filename semantics
Compress​::Zlib, pack "C" and utf-8
Smack! (this spawned some other threads that left Smack! in their
subjects, but which were added to)
perl, the data, and the tf8 flag
the utf8 flag
encoding neutral unpack

The philosophy should be that no Perl operations should have different
semantics based on the representation of the string being single-byte or
multi-byte format. Operations and places to watch out for include (one
of these threads attempted a complete list of operations that had
different semantics, this is my memory of some of them)​:

String constant metacharacters such as \u \U \l \L
case shifting code such as uc & lc
regexp case insensitivity and character classes
chr/ord
utf8​::is_valid

pack/unpack - packing should always produce a single-byte string, and
unpack should generally expect a single-byte string... but if, for some
reason, unpack is handed a multi-byte string, it should not pretend it
really should have been a single-byte string, but instead, it should
interpret the string as input characters. If there are any input
characters actually greater than 255, this should probably be considered
a bug, because pack doesn't produce such. Perhaps Marc's fix was the
last issue along that line for unpack...

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From @rgs

2008/9/23 Dave Mitchell <davem@​iabyn.com>​:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc.

I don't think we want to store the charset/locale with the string.

Consider the string "istanbul". If you're treating this string as
English, you'll capitalize it as "ISTANBUL", but if you want to follow
the Stambouliot spelling, it's "İSTANBUL".

Now consider the string "Consider the string "istanbul"". Shall we
capitalize it as "CONSİDER THE STRİNG "İSTANBUL"" ? Obviously
attaching a language to a string is going to be a problem when you
have to handle multi-language strings.

So the place that makes sense to provide this information is, in my
opinion, in lc and uc (and derivatives)​: in the code, not the data.
(So a pragma can be used, too.)

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From @khwilliamson

I have been studying some of the discussions in this group about this
problem, and find them overwhelming. So, I'm going to just put forth a
simple straw proposal that doesn't address a number of the things that
people were talking about, but does solve a lot of things.

This is a very concrete proposal, and I would like to get agreement on
the semantics involved​:
There will be a new mode of operation which will be enabled or
disabled by means yet to be decided. When enabled, the new behavior
will be that a character in a scalar or pattern will have the same
semantics whether or not it is stored as utf8. The operations that
are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and
patten matching, (including \U, \u, \L, and \l, and matching of
things like \w, [[​:punct​:]]). This is effectively what would happen
if we were operating under an iso-8859-1 locale with the following
modifications to get full unicode semantics​:
1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL
  LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in
  pattern substitutions. The result will be in utf8, since the
  capital letter is above 0xff.
2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for
  ucfirst, and \U and \u in pattern substitutions. The result
  will be in utf8, since the capital letter is above 0xff.
3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of
  LATIN CAPITAL LETTER S followed by itself; ie, 'SS'. Same for
  \U in pattern substitutions. The result will have the utf8-ness
  as the original.
4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting
  of the two characters LATIN CAPITAL LETTER S followed by LATIN
  SMALL LETTER S; ie, 'Ss'. Same for \u in pattern substitutions.
  The result will have the utf8-ness as the original.
5) If the MICRO SIGN is in a pattern with case ignored, it will
  match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL
  LETTER MU.
6) If the LATIN SMALL LETTER SHARP S is in a pattern with case
  ignored, it will match itself and any of 'SS', 'Ss', 'ss'.
7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with
  case ignored, it will match itself and LATIN CAPITAL LETTER Y
  WITH DIAERESIS

This mode would not impose a compile-time latin1-like locale on the
perl program. For example, whether perl identifiers could
have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would
not be affected by this mode

I do not propose to automatically convert ("downgrade") strings from
utf8 to latin1 when utf8 is not needed. For example, lc(LATIN CAPITAL
LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding

I don't know what to do about EBCDIC machines. I propose leaving
them to work the way they currently do.

I don't know what to do about interacting with "use bytes". One
option is for them to be mutually incompatible, that is, if you
turn one on, it turns the other off. Another option is if both
are in effect that it would be exactly the same as if a latin1
run-time locale was set, without any of the modifications listed
above.

Are there other interactions that we need to worry about?

I would like to defer how this mode gets enabled or disabled until we
agree on the semantics of what happens when it is enabled.

I think that a number of the issues that have been raised in the past
are in some way independent of this proposal. We may want to do some of
them, but should we do at least this much, or not?

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From vadim@vkonovalov.ru

karl williamson wrote​:

I have been studying some of the discussions in this group about this
problem, and find them overwhelming. So, I'm going to just put forth
a simple straw proposal that doesn't address a number of the things
that people were talking about, but does solve a lot of things.

This is a very concrete proposal, and I would like to get agreement on
the semantics involved​:
There will be a new mode of operation which will be enabled or
disabled by means yet to be decided. When enabled, the new behavior
will be that a character in a scalar or pattern will have the same
semantics whether or not it is stored as utf8. The operations that
are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and
patten matching, (including \U, \u, \L, and \l, and matching of
things like \w, [[​:punct​:]]). This is effectively what would happen
if we were operating under an iso-8859-1 locale

what the "under an iso-8859-1 locale" exactly?

reading perllocale gives me​:

USING LOCALES
  The use locale pragma

  By default, Perl ignores the current locale. The "use locale"
pragma tells Perl to
  use the current locale for some operations​:

Do I understand correctly that your proposal will never touch me
provided that I never do "use locale;"?
You do not mean posix locale, don't you?

Do I remember correctly that using locales is not recommended in Perl?

with the following
modifications to get full unicode semantics​:
1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL
LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in
pattern substitutions. The result will be in utf8, since the
capital letter is above 0xff.

could you please be more precise with uc(blablabal)?

what you currently wrote is a syntax error

....

Best regards,
Vadim.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From @khwilliamson

What I meant is not a literal locale, but that the semantics would be
the same as iso-8859-1 characters but with the listed modifications. I
was trying to avoid listing all the 8859-1 semantics. But in brief,
there are 128 characters above ascii in 8859-1, and they each have
semantics. 0xC0 for example is a latin capital letter A with a grave
accent. Its lower case is 0xE0. If you are on a Un*x like system, you
can type 'man latin1' at a command line prompt to get the entire list.
It doesn't however say which things are punctuation, which are word
characters, etc. But they are the same in Unicode, so the Unicode
standard lists all of them. Characters that are listed in the man page
that are marked as capital all have corresponding lower case versions
that are easy to figure out by their names. The three characters I
mentioned as modifications to get unicode are considered lower case and
have either multiple character upper case versions, or their upper case
version is not in latin1

My proposal would touch you UNLESS you do have a 'use locale'. Your
locale would override my proposal. In other words, by specifying "use
locale", my proposal would not touch your program. The documentation
does say not to use locales, but in looking at the code, it appears to
me that a locale takes precedence, and does work ok. I believe that you
can get many of the Perl glitches to go away by having a locale which
specifies iso-8859-1. But I haven't actually tried it.

Vadim Konovalov wrote​:

karl williamson wrote​:

I have been studying some of the discussions in this group about this
problem, and find them overwhelming. So, I'm going to just put forth
a simple straw proposal that doesn't address a number of the things
that people were talking about, but does solve a lot of things.

This is a very concrete proposal, and I would like to get agreement on
the semantics involved​:
There will be a new mode of operation which will be enabled or
disabled by means yet to be decided. When enabled, the new behavior
will be that a character in a scalar or pattern will have the same
semantics whether or not it is stored as utf8. The operations that
are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and
patten matching, (including \U, \u, \L, and \l, and matching of
things like \w, [[​:punct​:]]). This is effectively what would happen
if we were operating under an iso-8859-1 locale

what the "under an iso-8859-1 locale" exactly?

reading perllocale gives me​:

USING LOCALES
The use locale pragma

  By default\, Perl ignores the current locale\.  The "use locale" 

pragma tells Perl to
use the current locale for some operations​:

Do I understand correctly that your proposal will never touch me
provided that I never do "use locale;"?
You do not mean posix locale, don't you?

Do I remember correctly that using locales is not recommended in Perl?

with the following
modifications to get full unicode semantics​:
1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL
LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in
pattern substitutions. The result will be in utf8, since the
capital letter is above 0xff.

could you please be more precise with uc(blablabal)?

what you currently wrote is a syntax error

....

Best regards,
Vadim.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From perl@nevcal.com

On approximately 9/26/2008 11​:44 AM, came the following characters from
the keyboard of karl williamson​:

I have been studying some of the discussions in this group about this
problem, and find them overwhelming. So, I'm going to just put forth a
simple straw proposal that doesn't address a number of the things that
people were talking about, but does solve a lot of things.

Yeah, I gave you a lot of reading material. I hoped not to scare you
off, but I didn't want you be ignorant of the previous discussions, do a
bunch of work that only solved part of the problems, and have it
rejected because it wasn't a complete solution.

This is a very concrete proposal, and I would like to get agreement on
the semantics involved​:
There will be a new mode of operation which will be enabled or
disabled by means yet to be decided.

This makes it sound like you are targeting 5.10.x; since you are talking
about modes of operation. On the other hand, if the implementation
isn't significantly more complex than current code, keeping the current
behavior might be a safe approach, even if somewhere, somehow, the new
behavior decides to become the default behavior.

When enabled, the new behavior
will be that a character in a scalar or pattern will have the same
semantics whether or not it is stored as utf8. The operations that
are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and
patten matching, (including \U, \u, \L, and \l, and matching of
things like \w, [[​:punct​:]]).

This sounds like it might be a complete list of operations. I think \u,
\U, \l, and \L are more string interpolation operators rather than
pattern matching operators, but that is just terminology.

This is effectively what would happen
if we were operating under an iso-8859-1 locale with the following
modifications to get full unicode semantics​:
1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL
LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in
pattern substitutions. The result will be in utf8, since the
capital letter is above 0xff.
2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for
ucfirst, and \U and \u in pattern substitutions. The result
will be in utf8, since the capital letter is above 0xff.
3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of
LATIN CAPITAL LETTER S followed by itself; ie, 'SS'. Same for
\U in pattern substitutions. The result will have the utf8-ness
as the original.
4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting
of the two characters LATIN CAPITAL LETTER S followed by LATIN
SMALL LETTER S; ie, 'Ss'. Same for \u in pattern substitutions.
The result will have the utf8-ness as the original.
5) If the MICRO SIGN is in a pattern with case ignored, it will
match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL
LETTER MU.
6) If the LATIN SMALL LETTER SHARP S is in a pattern with case
ignored, it will match itself and any of 'SS', 'Ss', 'ss'.
7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with
case ignored, it will match itself and LATIN CAPITAL LETTER Y
WITH DIAERESIS

This mode would not impose a compile-time latin1-like locale on
the perl program. For example, whether perl identifiers could
have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would
not be affected by this mode

These all sound like appropriate behaviors to implement for a Unicode
semantics mode. However, I wouldn't know (or particularly care) if it
is a complete of differences between Latin-1 and Unicode semantics. I'm
not at all interested in Latin-1 semantics. Today, the operators you
list all have ASCII semantics, most everyone seems to agree that Unicode
semantics would be preferred. Latin-1 semantics are only used in
upgrade/downgrade operations, at present. (Unless someone says use
locale; which, as you say, is not recommended to use locales.)

I do not propose to automatically convert ("downgrade") strings from
utf8 to latin1 when utf8 is not needed. For example, lc(LATIN CAPITAL
LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding

Fine. All else being equal (utf8 just being a representation) it
shouldn't make any difference.

I don't know what to do about EBCDIC machines. I propose leaving
them to work the way they currently do.

Best effort non-breakage seems to be the best we can currently expect...

I don't know what to do about interacting with "use bytes". One
option is for them to be mutually incompatible, that is, if you
turn one on, it turns the other off. Another option is if both
are in effect that it would be exactly the same as if a latin1
run-time locale was set, without any of the modifications listed
above.

Another possibility would be that all the above listed operations would
be noops or produce errors, because they all imply Unicode character
semantics, whereas use bytes declares that the data is binary.

"\U\x45\x23\x37" should just be "\x45\x23\x37" for example of a noop.

Are there other interactions that we need to worry about?

Probably. Every XS writer under the sun has assumed different things
about utf8 flag semantics, I'm sure. So you should worry about handling
the flakkk.

I would like to defer how this mode gets enabled or disabled until we
agree on the semantics of what happens when it is enabled.

Sure, but if you target 5.10.x you need some way of enabling or
disabling. If you target 5.12, enabling may happen because it is 5.12.

I think that a number of the issues that have been raised in the past
are in some way independent of this proposal. We may want to do some of
them, but should we do at least this much, or not?

It might be nice to recap anything that isn't being addressed, at least
in general terms, so that someone doesn't "remember" it at the last
minute, and claim that your proposal is worthless without a solution in
that area.

Unicode filename handling, especially on Windows, might be a contentious
point, as it is also basically broken. In fact, once Perl has Unicode
semantics for all strings, then it would be basically appropriate for
the Windows port to start using the "wide" UTF-16 APIs, instead of the
the "byte" APIs for all OS API calls. This might be a fair-size bullet
to chew on, but it would be extremely useful; today, it is extremely
difficult to write multilingual programs using Perl on Windows, and the
biggest culprit is the use of the 8-bit APIs, with the _UNICODE (I
think) define not being turned on when compiling perl and extensions.
Enough that I have had to learn Python for a recent project. In large
part, one could claim that this is a Windows port issue, not a core perl
issue, of course... there is no reason that the Windows port couldn't
have already starting using wide APIs, even with the limited Unicode
support in perl proper... everyone (here at least) knows the kludges to
use to get perl proper to use Unicode consistently enough to get work
done, but the I/O boundary on Windows is a real problem.

You'll need to give this proposal a week or so of discussion time before
you can be sure that everyone that cares has commented, or longer
(perhaps much longer) if there is dissension. However, I think a lot of
the dissension has been beaten out in earlier discussions, so perhaps
the time is ripe that a fresh voice with motivation to make some fixes,
can actually make progress on this topic.

--
Glenn -- http​://nevcal.com/

A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 26, 2008

From @Juerd

Hello Karl,

I strongly agree with your proposed solutions. (I'm ambivalent only
about the 4th​: ucfirst "ß".)

Thank you for the summary.

karl williamson skribis 2008-09-26 12​:44 (-0600)​:

1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL
LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in
pattern substitutions. The result will be in utf8, since the
capital letter is above 0xff.

"in utf8" is ambiguous. It can mean either length(uc($y_umlaut)) == 2 or
is_utf8(uc($y_umlaut)). The former would be wrong, the latter would be
correct.

May I suggest including the words "upgrade" and "internal"?

  The resulting string will be upgraded to utf8 internally, ...

I don't know what to do about interacting with "use bytes". One
option is for them to be mutually incompatible, that is, if you
turn one on, it turns the other off.
(...)
I would like to defer how this mode gets enabled or disabled until we
agree on the semantics of what happens when it is enabled.

Turning your solutions on explicitly is probably wrong, at least for
5.12.

Using a pragma is problematic because of qr//, and because it cannot be
enabled conditionally (in any reasonably easy way).

I'd prefer to skip any discussion about how to enable or disable this -
enable it by default and don't provide any way to disable it.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,

  Juerd Waalboer​: Perl hacker <#####@​juerd.nl> <http​://juerd.nl/sig>
  Convolution​: ICT solutions and consultancy <sales@​convolution.nl>
1;

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 27, 2008

From @druud62

karl williamson schreef​:

I would like to defer how this mode gets enabled or disabled
until we agree on the semantics of what happens when it is
enabled.

  use kurila; # ;-)

--
Affijn, Ruud

"Gewoon is een tijger."

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Sep 27, 2008

From vadim@vkonovalov.ru

Dr.Ruud wrote​:

karl williamson schreef​:

I would like to defer how this mode gets enabled or disabled
until we agree on the semantics of what happens when it is
enabled.

use kurila; # ;-)

kurila is so largely incompatible, it is even off-topicable!

(initially I thought its on-topic but then I was convinced by responders
it isn't and looking at direction it go it really is not on-topic on p5p)

BR,
Vadim.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Oct 7, 2008

From @khwilliamson

My proposal from a week and a half ago hasn't spawned much
dissension--yet. I'll take that as a good sign, and proceed.

Here's a hodge-podge of my thoughts about it, but most important, I am
concerned about the enabling and disabling of this. I think there has
to be some way to disable it in case current code has come to rely on
what I call broken behavior.

It looks like in 5.12, Rafael wants the new mode to be default behavior.
  But he also said that a switch could be added in 5.10.x to turn it on,
as long as performance doesn't suffer.

Glenn, "use bytes" doesn't mean necessarily binary. For example,
use bytes;
print lc('A'), "\n";

prints 'a'. It does mean ASCII semantics even for utf8​::upgraded strings.

If there is a way to en/dis-able this mode, doesn't that have to be a
pragma? Doesn't it have to be lexically scoped? And if the answers to
these are yes, what do we do with things that are created under one mode
and then executed in the other?

Juerd wrote​:

Pragmas have problems, especially in regular expressions. And it's very
hard to load a pragma conditionally, which makes writing version
portable code hard. Besides that, any pragma affecting regex matches
needs to be carried in qr//, which in this case means new regex flags to
indicate the behavior for (?i​:...). According to dmq, adding flags is
hard.

I don't understand what you mean that pragmas have problems, esp in
re's. Please explain.

I had thought I had this solved for qr//i. The way I was planning to
implement this for pattern matching is quite simple. First, by changing
the existing fold table definitions to include the Unicode semantics,
the pattern matching magically starts working without any code logic
changes for all but two characters​: the German sharp ss, and the micron
symbol. For these, I was planning to use the existing mechanisms to
compile the re as utf8, so it wouldn't require any new flags. Thus qr//
would be utf8 if it contained these two characters. And it works today
to match such a pattern against both non-utf8 and utf8 strings. I
haven't tested to see what happens when such a pattern is executed under
use bytes. I was presuming it did something reasonable. But now I'm
not so sure, as I've found a number of bugs in the re code in my
testing, and some are of a nature that I don't feel comfortable with my
level of knowledge about how it works to dive in and fix them. They
should be fixed anyway, and I'm hoping some expert will undertake that.
  I think that once they're fixed, that I could extend them to work in
the latin1 range quite easily. So the bottom line is that qr//i may or
may not be a problem.

For the other interactions, I'm not sure there is a problem. If one
creates a string whether or not this mechanism is on, it remains 8 bits,
unless it has a code point above 255. If one operates on it while this
mechanism is on, it gets unicode semantics, which in a few cases
irretrievably convert it to utf8 because the result is above 255. If
one operates on it while this mechanism is off, you get ASCII semantics.
  I don't really see a problem with that.

I think it would be easy to extend this to EBCDIC, at least the three
encodings perl has compiled-in tables for. The problem is that Rafael
said that there's no one testing on EBCDIC machines, so I couldn't know
if it worked or not before releasing it.

I'm also thinking that the Windows file name problems can be considered
independent of this, and addressed at a later time.

I also agree with Glenn's and Juerd's wording changes.

I saw nothing in my reading of the code that would lead me to touch the
utf8 flag's meaning. But I am finding weird bugs in which Perl
apparently gets mixed up about the flag. These vanish if I rearrange
the order of supposedly independent lines in the program. It looks like
it could be a wild write. I wrote a bug report [perl #59378], but I
think that the description of that is wrong.

So the bottom line for now, is I'd like to get some consensus about how
to turn it on and off (and whether to, which I think the answer is there
has to be a way to turn it off.) I guess I would claim that in 5.12,
"use bytes" could be used to turn it off. But that may be
controversial, and doesn't address backporting it.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0006-regcomp.c-Use-latin1-folding-in-synthetic-start-cla.patch
From be110e8fe5a08bd964d4bb091aef4daa3212950b Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 19:00:00 -0700
Subject: [PATCH] regcomp.c: Use latin1 folding in synthetic start class

This is because the pattern may not specify unicode semantics, but if
the target matching string is in utf8, then unicode semantics may be
needed nonetheless.  So to avoid the regexec optimizer rejecting the
match, we need to allow for a possible false positive.
---
 regcomp.c |   34 +++++++++++++++++++---------------
 1 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 79623d2..392b075 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3073,11 +3073,18 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		/* Check whether it is compatible with what we know already! */
 		int compat = 1;
 
+
+		/* If compatibile, we or it in below.  It is compatible if is
+		 * in the bitmp and either 1) its bit or its fold is set, or 2)
+		 * it's for a locale.  Even if there isn't unicode semantics
+		 * here, at runtime there may be because of matching against a
+		 * utf8 string, so accept a possible false positive for
+		 * latin1-range folds */
 		if (uc >= 0x100 ||
 		    (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
 		    && !ANYOF_BITMAP_TEST(data->start_class, uc)
 		    && (!(data->start_class->flags & ANYOF_FOLD)
-			|| !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
+			|| !ANYOF_BITMAP_TEST(data->start_class, PL_fold_latin1[uc])))
                     )
 		    compat = 0;
 		ANYOF_CLASS_ZERO(data->start_class);
@@ -3119,12 +3126,13 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 	    if (flags & SCF_DO_STCLASS_AND) {
 		/* Check whether it is compatible with what we know already! */
 		int compat = 1;
-
 		if (uc >= 0x100 ||
-		    (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
-		    && !ANYOF_BITMAP_TEST(data->start_class, uc)
-		     && !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
+		 (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
+		  && !ANYOF_BITMAP_TEST(data->start_class, uc)
+		  && !ANYOF_BITMAP_TEST(data->start_class, PL_fold_latin1[uc])))
+		{
 		    compat = 0;
+		}
 		ANYOF_CLASS_ZERO(data->start_class);
 		ANYOF_BITMAP_ZERO(data->start_class);
 		if (compat) {
@@ -3136,13 +3144,11 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
 		    }
 		    else {
 
-			/* Also set the other member of the fold pair.  Can't
-			 * do this for locale, because not known until runtime
-			 */
-			ANYOF_BITMAP_SET(data->start_class,
-					 (OP(scan) == EXACTFU)
-						    ? PL_fold_latin1[uc]
-						    : PL_fold[uc]);
+			/* Also set the other member of the fold pair.  In case
+			 * that unicode semantics is called for at runtime, use
+			 * the full latin1 fold.  (Can't do this for locale,
+			 * because not known until runtime */
+			ANYOF_BITMAP_SET(data->start_class, PL_fold_latin1[uc]);
 		    }
 		}
 	    }
@@ -3158,9 +3164,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
                              * can't do that in locale because not known until
                              * run-time */
                             ANYOF_BITMAP_SET(data->start_class,
-                                            (OP(scan) == EXACTFU)
-                                                        ? PL_fold_latin1[uc]
-                                                        : PL_fold[uc]);
+					     PL_fold_latin1[uc]);
                         }
 		    }
 		    data->start_class->flags &= ~ANYOF_EOS;
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0007-regcomp.sym-update-comment.patch
From 444f010a3c52b735e4bdd29220cb10b3f384bc18 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:38:09 -0700
Subject: [PATCH] regcomp.sym: update comment

---
 regcomp.sym |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/regcomp.sym b/regcomp.sym
index a85d33f..ab57929 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -145,7 +145,7 @@ RENUM       BRANCHJ,    off 1 . 1 ; Group with independently numbered parens.
 # inline charclass data (ascii only), the 'C' store it in the structure.
 # NOTE: the relative order of the TRIE-like regops  is signifigant
 
-TRIE        TRIE,       trie 1    ; Match many EXACT(FL?)? at once. flags==type
+TRIE        TRIE,       trie 1    ; Match many EXACT(F[LU]?)? at once. flags==type
 TRIEC       TRIE,trie charclass   ; Same as TRIE, but with embedded charclass data
 
 # For start classes, contains an added fail table.
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0008-regcomp.sym-Add-REFFU-and-NREFFU-nodes.patch
From 76bd258db2ca18264a7ee18f0655a55a47ce5cb5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:39:16 -0700
Subject: [PATCH] regcomp.sym: Add REFFU and NREFFU nodes

These will be used for matching capture buffers case-insensitively using
Unicode semantics.

make regen will regenerate the delivered regnodes.h
---
 regcomp.sym |    7 +++++++
 regnodes.h  |   29 ++++++++++++++++++++---------
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/regcomp.sym b/regcomp.sym
index ab57929..4e787a7 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -194,6 +194,13 @@ NHORIZWS    NHORIZWS,   none 0 S  ; not horizontal whitespace   (Perl 6)
 FOLDCHAR    FOLDCHAR,   codepoint 1 ; codepoint with tricky case folding properties.
 EXACTFU     EXACT,      str	    ; Match this string, folded, Unicode semantics for non-utf8 (prec. by length).
 
+# These could have been implemented using the FLAGS field of the regnode, but
+# by having a separate node type, we can use the existing switch statement to
+# avoid some tests
+REFFU       REF,        num 1 V   ; Match already matched string, folded using unicode semantics for non-utf8
+NREFFU       REF,        num 1 V   ; Match already matched string, folded using unicode semantics for non-utf8
+
+
 # NEW STUFF ABOVE THIS LINE  
 
 ################################################################################
diff --git a/regnodes.h b/regnodes.h
index 97ac607..09ab661 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -6,8 +6,8 @@
 
 /* Regops and State definitions */
 
-#define REGNODE_MAX           	91
-#define REGMATCH_STATE_MAX    	131
+#define REGNODE_MAX           	93
+#define REGMATCH_STATE_MAX    	133
 
 #define	END                   	0	/* 0000 End of program. */
 #define	SUCCEED               	1	/* 0x01 Return from a subroutine, basically. */
@@ -70,7 +70,7 @@
 #define	MINMOD                	58	/* 0x3a Next operator is not greedy. */
 #define	LOGICAL               	59	/* 0x3b Next opcode should set the flag only. */
 #define	RENUM                 	60	/* 0x3c Group with independently numbered parens. */
-#define	TRIE                  	61	/* 0x3d Match many EXACT(FL?)? at once. flags==type */
+#define	TRIE                  	61	/* 0x3d Match many EXACT(F[LU]?)? at once. flags==type */
 #define	TRIEC                 	62	/* 0x3e Same as TRIE, but with embedded charclass data */
 #define	AHOCORASICK           	63	/* 0x3f Aho Corasick stclass. flags==type */
 #define	AHOCORASICKC          	64	/* 0x40 Same as AHOCORASICK, but with embedded charclass data */
@@ -99,8 +99,10 @@
 #define	NHORIZWS              	87	/* 0x57 not horizontal whitespace   (Perl 6) */
 #define	FOLDCHAR              	88	/* 0x58 codepoint with tricky case folding properties. */
 #define	EXACTFU               	89	/* 0x59 Match this string, folded, Unicode semantics for non-utf8 (prec. by length). */
-#define	OPTIMIZED             	90	/* 0x5a Placeholder for dump. */
-#define	PSEUDO                	91	/* 0x5b Pseudo opcode for internal use. */
+#define	REFFU                 	90	/* 0x5a Match already matched string, folded using unicode semantics for non-utf8 */
+#define	NREFFU                	91	/* 0x5b Match already matched string, folded using unicode semantics for non-utf8 */
+#define	OPTIMIZED             	92	/* 0x5c Placeholder for dump. */
+#define	PSEUDO                	93	/* 0x5d Pseudo opcode for internal use. */
 	/* ------------ States ------------- */
 #define	TRIE_next             	(REGNODE_MAX + 1)	/* state for TRIE */
 #define	TRIE_next_fail        	(REGNODE_MAX + 2)	/* state for TRIE */
@@ -239,6 +241,8 @@ EXTCONST U8 PL_regkind[] = {
 	NHORIZWS, 	/* NHORIZWS               */
 	FOLDCHAR, 	/* FOLDCHAR               */
 	EXACT,    	/* EXACTFU                */
+	REF,      	/* REFFU                  */
+	REF,      	/* NREFFU                 */
 	NOTHING,  	/* OPTIMIZED              */
 	PSEUDO,   	/* PSEUDO                 */
 	/* ------------ States ------------- */
@@ -379,6 +383,8 @@ static const U8 regarglen[] = {
 	0,                                   	/* NHORIZWS     */
 	EXTRA_SIZE(struct regnode_1),        	/* FOLDCHAR     */
 	0,                                   	/* EXACTFU      */
+	EXTRA_SIZE(struct regnode_1),        	/* REFFU        */
+	EXTRA_SIZE(struct regnode_1),        	/* NREFFU       */
 	0,                                   	/* OPTIMIZED    */
 	0,                                   	/* PSEUDO       */
 };
@@ -476,6 +482,8 @@ static const char reg_off_by_arg[] = {
 	0,	/* NHORIZWS     */
 	0,	/* FOLDCHAR     */
 	0,	/* EXACTFU      */
+	0,	/* REFFU        */
+	0,	/* NREFFU       */
 	0,	/* OPTIMIZED    */
 	0,	/* PSEUDO       */
 };
@@ -578,8 +586,10 @@ EXTCONST char * const PL_reg_name[] = {
 	"NHORIZWS",              	/* 0x57 */
 	"FOLDCHAR",              	/* 0x58 */
 	"EXACTFU",               	/* 0x59 */
-	"OPTIMIZED",             	/* 0x5a */
-	"PSEUDO",                	/* 0x5b */
+	"REFFU",                 	/* 0x5a */
+	"NREFFU",                	/* 0x5b */
+	"OPTIMIZED",             	/* 0x5c */
+	"PSEUDO",                	/* 0x5d */
 	/* ------------ States ------------- */
 	"TRIE_next",             	/* REGNODE_MAX +0x01 */
 	"TRIE_next_fail",        	/* REGNODE_MAX +0x02 */
@@ -674,7 +684,8 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__;
 #else
 EXTCONST U8 PL_varies[] __attribute__deprecated__ = {
     CLUMP, BRANCH, BACK, STAR, PLUS, CURLY, CURLYN, CURLYM, CURLYX, WHILEM,
-    REF, REFF, REFFL, SUSPEND, IFTHEN, BRANCHJ, NREF, NREFF, NREFFL,
+    REF, REFF, REFFL, SUSPEND, IFTHEN, BRANCHJ, NREF, NREFF, NREFFL, REFFU,
+    NREFFU,
     0
 };
 #endif /* DOINIT */
@@ -683,7 +694,7 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__ = {
 EXTCONST U8 PL_varies_bitmask[];
 #else
 EXTCONST U8 PL_varies_bitmask[] = {
-    0x00, 0x00, 0x00, 0xC0, 0xC1, 0x9F, 0x33, 0x01, 0x38, 0x00, 0x00, 0x00
+    0x00, 0x00, 0x00, 0xC0, 0xC1, 0x9F, 0x33, 0x01, 0x38, 0x00, 0x00, 0x0C
 };
 #endif /* DOINIT */
 
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0009-re-fold_grind.t-Refactor-to-test-utf8-patterns.patch
From 59b2b252ef94dc19789543fd6664953e3ae2a671 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:49:20 -0700
Subject: [PATCH] re/fold_grind.t: Refactor to test utf8 patterns.

The previous version wasn't really testing utf8 patterns.
---
 t/re/fold_grind.t |   25 ++++++++++++++-----------
 1 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index 13fdd3c..fd69cdb 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -13,6 +13,7 @@ BEGIN {
 
 use strict;
 use warnings;
+use Encode;
 
 # Tests both unicode and not, so make sure not implicitly testing unicode
 no feature 'unicode_strings';
@@ -238,7 +239,8 @@ foreach my $test (sort { numerically } keys %tests) {
     #diag $progress;
 
     # Now grind out tests, using various combinations.
-    foreach my $uni_semantics ("", 'u') {   # Both non- and uni semantics
+    # XXX foreach my $charset ('d', 'u', 'l') {
+    foreach my $charset ('d', 'u') {
       foreach my $utf8_target (0, 1) {    # Both utf8 and not, for
                                           # code points < 256
         my $upgrade_target = "";
@@ -247,17 +249,17 @@ foreach my $test (sort { numerically } keys %tests) {
         # something above latin1.  So impossible to test if to not to be in
         # utf8; and otherwise, no upgrade is needed.
         next if $target_above_latin1 && ! $utf8_target;
-        $upgrade_target = '; utf8::upgrade($c)' if ! $target_above_latin1 && $utf8_target;
+        $upgrade_target = ' utf8::upgrade($c);' if ! $target_above_latin1 && $utf8_target;
 
-        foreach my $uni_pattern (0, 1) {
-          next if $pattern_above_latin1 && ! $uni_pattern;
+        foreach my $utf8_pattern (0, 1) {
+          next if $pattern_above_latin1 && ! $utf8_pattern;
+          my $uni_semantics = $utf8_target || $charset eq 'u' || ($charset eq 'd' && $utf8_pattern);
           my $upgrade_pattern = "";
-          $upgrade_pattern = '; use re "/u"' if ! $pattern_above_latin1 && $uni_pattern;
+          $upgrade_pattern = ' utf8::upgrade($p);' if ! $pattern_above_latin1 && $utf8_pattern;
 
           my $lhs = join "", @x_target;
           my @rhs = @x_pattern;
-          #print "$lhs: ", "/@rhs/\n";
-
+          my $should_fail = ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self;
           foreach my $bracketed (0, 1) {   # Put rhs in [...], or not
             foreach my $inverted (0,1) {
                 next if $inverted && ! $bracketed;
@@ -314,9 +316,9 @@ foreach my $test (sort { numerically } keys %tests) {
                           # something on one or both sides that force it to.
                           my $must_match = ! $can_match_null || ($l_anchor && $r_anchor) || ($l_anchor && $append) || ($r_anchor && $prepend) || ($prepend && $append);
                           #next unless $must_match;
-                          my $quantified = "(?$uni_semantics:$l_anchor$prepend$interior${quantifier}$append$r_anchor)";
+                          my $quantified = "(?$charset:$l_anchor$prepend$interior${quantifier}$append$r_anchor)";
                           my $op;
-                          if ($must_match && ! $utf8_target && ! $uni_pattern && ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self)  {
+                          if ($must_match && $should_fail)  {
                               $op = 0;
                           } else {
                               $op = 1;
@@ -324,8 +326,9 @@ foreach my $test (sort { numerically } keys %tests) {
                           $op = ! $op if $must_match && $inverted;
                           $op = ($op) ? '=~' : '!~';
 
-                          my $stuff .= " utf8_target=$utf8_target, uni_semantics=$uni_semantics, uni_pattern=$uni_pattern, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, quantifier=$quantifier, l_anchor=$l_anchor, r_anchor=$r_anchor";
-                          my $eval = "my \$c = \"$prepend$lhs$append\"$upgrade_target; $upgrade_pattern; \$c $op /$quantified/i;";
+                          my $stuff .= " uni_semantics=$uni_semantics, should_fail=$should_fail, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, quantifier=$quantifier, l_anchor=$l_anchor, r_anchor=$r_anchor";
+                          $stuff .= "; pattern_above_latin1=$pattern_above_latin1; utf8_pattern=$utf8_pattern";
+                          my $eval = "my \$c = \"$prepend$lhs$append\"; my \$p = qr/$quantified/i;$upgrade_target$upgrade_pattern \$c $op \$p;";
 
                           # XXX Doesn't currently test multi-char folds
                           next if @pattern != 1;
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0010-regexec.c-Handle-REFFU-and-NREFFU-refactor.patch
From a6dcef1fd3ecd8d8374c56a89632cb19b590264d Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:05:25 -0700
Subject: [PATCH] regexec.c:  Handle REFFU and NREFFU; refactor

This adds handling of the Unicode folding semantics capture buffer
backreferences.  I've refactored the code so that the case statements
set up the type of folding, to avoid having to test for which in the
common code.

Also, the previous code was confusing fold case and lowercase.  There is
already a routine to handle the fold case, so that simplified things.
---
 regexec.c |  115 ++++++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 72 insertions(+), 43 deletions(-)

diff --git a/regexec.c b/regexec.c
index ff76c84..ffa2da4 100644
--- a/regexec.c
+++ b/regexec.c
@@ -3927,31 +3927,69 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
 	    break;
             
 	case NREFFL:
-	{
+	{   /* The capture buffer cases.  The ones beginning with N for the
+	       named buffers just convert to the equivalent numbered and
+	       pretend they were called as the corresponding numbered buffer
+	       op.  */
 	    char *s;
 	    char type;
+	    I32 (*folder)() = NULL;	/* NULL assumes will be NREF, REF: no
+					   folding */
+	    const U8 * fold_array = NULL;
+
 	    PL_reg_flags |= RF_tainted;
-	    /* FALL THROUGH */
-	case NREF:
+	    folder = foldEQ_locale;
+	    fold_array = PL_fold_locale;
+	    type = REFFL;
+	    goto do_nref;
+
+	case NREFFU:
+	    folder = foldEQ_latin1;
+	    fold_array = PL_fold_latin1;
+	    type = REFFU;
+	    goto do_nref;
+
 	case NREFF:
-	    type = OP(scan);
+	    folder = foldEQ;
+	    fold_array = PL_fold;
+	    type = REFF;
+	    goto do_nref;
+
+	case NREF:
+	    type = REF;
+	  do_nref:
+
+	    /* For the named back references, find the corresponding buffer
+	     * number */
 	    n = reg_check_named_buff_matched(rex,scan);
 
-            if ( n ) {
-                type = REF + ( type - NREF );
-                goto do_ref;
-            } else {
+            if ( ! n ) {
                 sayNO;
-            }
-            /* unreached */
+	    }
+	    goto do_nref_ref_common;
+
 	case REFFL:
 	    PL_reg_flags |= RF_tainted;
+	    folder = foldEQ_locale;
+	    fold_array = PL_fold_locale;
+	    goto do_ref;
+
+	case REFFU:
+	    folder = foldEQ_latin1;
+	    fold_array = PL_fold_latin1;
+	    goto do_ref;
+
+	case REFF:
+	    folder = foldEQ;
+	    fold_array = PL_fold;
 	    /* FALL THROUGH */
+
         case REF:
-	case REFF: 
-	    n = ARG(scan);  /* which paren pair */
+	  do_ref:
 	    type = OP(scan);
-	  do_ref:  
+	    n = ARG(scan);  /* which paren pair */
+
+	  do_nref_ref_common:
 	    ln = PL_regoffs[n].start;
 	    PL_reg_leftiter = PL_reg_maxiter;		/* Void cache */
 	    if (*PL_reglastparen < n || ln == -1)
@@ -3960,49 +3998,40 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
 		break;
 
 	    s = PL_bostr + ln;
-	    if (utf8_target && type != REF) {	/* REF can do byte comparison */
-		char *l = locinput;
-		const char *e = PL_bostr + PL_regoffs[n].end;
-		/*
-		 * Note that we can't do the "other character" lookup trick as
-		 * in the 8-bit case (no pun intended) because in Unicode we
-		 * have to map both upper and title case to lower case.
-		 */
-		if (type == REFF) {
-		    while (s < e) {
-			STRLEN ulen1, ulen2;
-			U8 tmpbuf1[UTF8_MAXBYTES_CASE+1];
-			U8 tmpbuf2[UTF8_MAXBYTES_CASE+1];
-
-			if (l >= PL_regeol)
-			    sayNO;
-			toLOWER_utf8((U8*)s, tmpbuf1, &ulen1);
-			toLOWER_utf8((U8*)l, tmpbuf2, &ulen2);
-			if (ulen1 != ulen2 || memNE((char *)tmpbuf1, (char *)tmpbuf2, ulen1))
-			    sayNO;
-			s += ulen1;
-			l += ulen2;
-		    }
+	    if (type != REF	/* REF can do byte comparison */
+		&& (utf8_target
+                    || (type == REFFU
+                        && (*s == (char) LATIN_SMALL_LETTER_SHARP_S
+                            || *locinput == (char) LATIN_SMALL_LETTER_SHARP_S))))
+	    { /* XXX handle REFFL better */
+		char * limit = PL_regeol;
+
+		/* This call case insensitively compares the entire buffer
+		    * at s, with the current input starting at locinput, but
+		    * not going off the end given by PL_regeol, and returns in
+		    * limit upon success, how much of the current input was
+		    * matched */
+		if (! foldEQ_utf8(s, NULL, PL_regoffs[n].end - ln, utf8_target,
+				    locinput, &limit, 0, utf8_target))
+		{
+		    sayNO;
 		}
-		locinput = l;
+		locinput = limit;
 		nextchr = UCHARAT(locinput);
 		break;
 	    }
 
-	    /* Inline the first character, for speed. */
+	    /* Not utf8:  Inline the first character, for speed. */
 	    if (UCHARAT(s) != nextchr &&
 		(type == REF ||
-		 (UCHARAT(s) != (type == REFF
-				  ? PL_fold : PL_fold_locale)[nextchr])))
+		 UCHARAT(s) != fold_array[nextchr]))
 		sayNO;
 	    ln = PL_regoffs[n].end - ln;
 	    if (locinput + ln > PL_regeol)
 		sayNO;
 	    if (ln > 1 && (type == REF
 			   ? memNE(s, locinput, ln)
-			   : (type == REFF
-			      ? ! foldEQ(s, locinput, ln)
-			      : ! foldEQ_locale(s, locinput, ln))))
+			   : ! folder(s, locinput, ln)))
 		sayNO;
 	    locinput += ln;
 	    nextchr = UCHARAT(locinput);
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0011-regcomp.c-Generate-REFFU-and-NREFFU.patch
From 603718e020407d784c920301500232e5bd8902bf Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:35:13 -0700
Subject: [PATCH] regcomp.c: Generate REFFU and NREFFU

This causes the new nodes that denote Unicode semantics in
backreferences to be generated when appropriate.

Because the addition of these nodes was at the end of the node list, the
arithmetic relation that previously was valid no longer is.
---
 regcomp.c |   34 ++++++++++++++++++++++++++--------
 1 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index 392b075..2df0a6e 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -5849,9 +5849,15 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                         SvREFCNT_inc_simple_void(sv_dat);
                     }
                     RExC_sawback = 1;
-                    ret = reganode(pRExC_state,
-                    	   (U8)(FOLD ? (LOC ? NREFFL : NREFF) : NREF),
-                    	   num);
+		    ret = reganode(pRExC_state,
+				   ((! FOLD)
+				     ? NREF
+				     : (UNI_SEMANTICS)
+				       ? NREFFU
+				       : (LOC)
+				         ? NREFFL
+					 : NREFF),
+				    num);
                     *flagp |= HASWIDTH;
 
                     Set_Node_Offset(ret, parse_start+1);
@@ -7531,8 +7537,14 @@ tryagain:
 
                 RExC_sawback = 1;
                 ret = reganode(pRExC_state,
-                	   (U8)(FOLD ? (LOC ? NREFFL : NREFF) : NREF),
-                	   num);
+                               ((! FOLD)
+                                 ? NREF
+                                 : (UNI_SEMANTICS)
+                                   ? NREFFU
+                                   : (LOC)
+                                     ? NREFFL
+                                     : NREFF),
+                                num);
                 *flagp |= HASWIDTH;
 
                 /* override incorrect value set in reganode MJD */
@@ -7593,8 +7605,14 @@ tryagain:
 		    }
 		    RExC_sawback = 1;
 		    ret = reganode(pRExC_state,
-				   (U8)(FOLD ? (LOC ? REFFL : REFF) : REF),
-				   num);
+				   ((! FOLD)
+				     ? REF
+				     : (UNI_SEMANTICS)
+				       ? REFFU
+				       : (LOC)
+				         ? REFFL
+					 : REFF),
+				    num);
 		    *flagp |= HASWIDTH;
 
                     /* override incorrect value set in reganode MJD */
@@ -9594,7 +9612,7 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o)
     else if (k == REF || k == OPEN || k == CLOSE || k == GROUPP || OP(o)==ACCEPT) {
 	Perl_sv_catpvf(aTHX_ sv, "%d", (int)ARG(o));	/* Parenth number */
 	if ( RXp_PAREN_NAMES(prog) ) {
-            if ( k != REF || OP(o) < NREF) {	    
+            if ( k != REF || (OP(o) != NREF && OP(o) != NREFF && OP(o) != NREFFL && OP(o) != NREFFU)) {
 	        AV *list= MUTABLE_AV(progi->data->data[progi->name_list_idx]);
 	        SV **name= av_fetch(list, ARG(o), 0 );
 	        if (name)
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0012-re-fold_grind.t-Add-tests-for-NREFFU-REFFU.patch
From 82e2266183ac5b10b6dcfc4d165545629ccd227a Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:58:37 -0700
Subject: [PATCH] re/fold_grind.t: Add tests for NREFFU, REFFU

This adds simple tests for these.  Inspection of the code indicated to
me that more complex tests were not warranted.
---
 t/re/fold_grind.t |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index fd69cdb..55241e3 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -1,5 +1,4 @@
-# Grind out a lot of combinatoric tests for folding.  Still missing are
-# testing backreferences and tries.
+# Grind out a lot of combinatoric tests for folding.
 
 use charnames ":full";
 
@@ -259,7 +258,28 @@ foreach my $test (sort { numerically } keys %tests) {
 
           my $lhs = join "", @x_target;
           my @rhs = @x_pattern;
+          my $rhs = join "", @rhs;
           my $should_fail = ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self;
+
+          # Do simple tests of referencing capture buffers, named and
+          # numbered.
+          my $op = '=~';
+          $op = '!~' if $should_fail;
+          my $eval = "my \$c = \"$lhs$rhs\"; my \$p = qr/(?$charset:^($rhs)\\1\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+          push @eval_tests, qq[ok(eval '$eval', '$eval')];
+          $eval = "my \$c = \"$lhs$rhs\"; my \$p = qr/(?$charset:^(?<grind>$rhs)\\k<grind>\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+          push @eval_tests, qq[ok(eval '$eval', '$eval')];
+          $count += 2;
+          if ($lhs ne $rhs) {
+            $eval = "my \$c = \"$rhs$lhs\"; my \$p = qr/(?$charset:^($rhs)\\1\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+            push @eval_tests, qq[ok(eval '$eval', '$eval')];
+            $eval = "my \$c = \"$rhs$lhs\"; my \$p = qr/(?$charset:^(?<grind>$rhs)\\k<grind>\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+            push @eval_tests, qq[ok(eval '$eval', '$eval')];
+            $count += 2;
+          }
+          #diag $eval_tests[-1];
+          #next;
+
           foreach my $bracketed (0, 1) {   # Put rhs in [...], or not
             foreach my $inverted (0,1) {
                 next if $inverted && ! $bracketed;
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0013-Nit-in-perlunicode.pod.patch
From 442698edd07704c7fbcd83ba3c1a0d3fed06373f Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:15:18 -0700
Subject: [PATCH] Nit in perlunicode.pod

---
 pod/perlunicode.pod |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index b950f7b..20acb55 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -23,7 +23,7 @@ Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
 
 Perl knows when a filehandle uses Perl's internal Unicode encodings
 (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
-the ":utf8" layer.  Other encodings can be converted to Perl's
+the ":encoding(utf8)" layer.  Other encodings can be converted to Perl's
 encoding on input or from Perl's encoding on output by use of the
 ":encoding(...)"  layer.  See L<open>.
 
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0014-Document-Unicode-doc-fix.patch
From 371a6b022abefe8c1377d3d8811431654d1da46d Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:33:54 -0700
Subject: [PATCH] Document Unicode doc fix

---
 lib/feature.pm      |   21 ++++++++++++++----
 pod/perldelta.pod   |   33 +++++++++++++++++++++++------
 pod/perlre.pod      |   44 ++++++++++++++++++++++-----------------
 pod/perlunicode.pod |   57 +++++++++++++++-----------------------------------
 pod/perlunifaq.pod  |   42 ++++++++++++++++++------------------
 5 files changed, 105 insertions(+), 92 deletions(-)

diff --git a/lib/feature.pm b/lib/feature.pm
index f8a9078..c70010d 100644
--- a/lib/feature.pm
+++ b/lib/feature.pm
@@ -105,11 +105,22 @@ See L<perlsub/"Persistent Private Variables"> for details.
 
 =head2 the 'unicode_strings' feature
 
-C<use feature 'unicode_strings'> tells the compiler to treat
-all strings outside of C<use locale> and C<use bytes> as Unicode. It is
-available starting with Perl 5.11.3, but is not fully implemented.
-
-See L<perlunicode/The "Unicode Bug"> for details.
+C<use feature 'unicode_strings'> tells the compiler to use Unicode semantics
+in all string operations executed within its scope (unless they are also
+within the scope of either C<use locale> or C<use bytes>).  The same applies
+to all regular expressions compiled within the scope, even if executed outside
+it.
+
+C<no feature 'unicode_strings'> tells the compiler to use the traditional
+Perl semantics wherein the native character set semantics is used unless it is
+clear to Perl that Unicode is desired.  This can lead to some surprises
+when the behavior suddenly changes.  (See
+L<perlunicode/The "Unicode Bug"> for details.)  For this reason, if you are
+potentially using Unicode in your program, the
+C<use feature 'unicode_strings'> subpragma is B<strongly> recommended.
+
+This subpragma is available starting with Perl 5.11.3, but was not fully
+implemented until 5.13.8.
 
 =head1 FEATURE BUNDLES
 
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index cfeff1f..b7d710b 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -2,7 +2,6 @@
 
 =for comment
 This has been completed up to 779bcb7d, except for:
-1b9f127-fad448f (Karl Williamson says he will do this)
 ad9e76a8629ed1ac483f0a7ed0e4da40ac5a1a00
 d9a4b459f94297889956ac3adc42707365f274c2
 
@@ -81,6 +80,18 @@ method support still works as expected:
   open my $fh, ">", $file;
   $fh->autoflush(1);        # IO::File not loaded
 
+=head2 Full functionality for C<use feature 'unicode_strings'>
+
+This release provides full functionality for C<use feature
+'unicode_strings'>.  Under its scope, all string operations executed and
+regular expressions compiled (even if executed outside its scope) have
+Unicode semantics.   See L<feature>.
+
+This feature avoids the "Unicode Bug" (See
+L<perlunicode/The "Unicode Bug"> for details.)  If their is a
+possibility that your code will process Unicode strings, you are
+B<strongly> encouraged to use this subpragma to avoid nasty surprises.
+
 =head1 Security
 
 XXX Any security-related notices go here.  In particular, any security
@@ -492,12 +503,6 @@ L<[perl #79178]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79178>.
 
 =item *
 
-A number of bugs with regular expression bracketed character classes
-have been fixed, mostly having to do with matching characters in the
-non-ASCII Latin-1 range.
-
-=item *
-
 A closure containing an C<if> statement followed by a constant or variable
 is no longer treated as a constant
 L<[perl #63540]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=63540>.
@@ -514,6 +519,20 @@ A regular expression optimisation would sometimes cause a match with a
 C<{n,m}> quantifier to fail when it should match
 L<[perl #79152]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79152>.
 
+=item *
+
+What has become known as the "Unicode Bug" is resolved in this release.
+Under C<use feature 'unicode_strings'>, the internal storage format of a
+string no longer affects the external semantics.  There are two known
+exceptions.  User-defined case changing functions, which are planned to
+be deprecated in 5.14, require utf8-encoded strings to function; and the
+character C<LATIN SMALL LETTER SHARP S> in regular expression
+case-insensitive matching has a somewhat different set of bugs depending
+on the internal storage format.  Case-insensitive matching of all
+characters that have multi-character matches, as this one does, is
+problematical in Perl.
+L<[perl #58182]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182>.
+
 =back
 
 =head1 Known Problems
diff --git a/pod/perlre.pod b/pod/perlre.pod
index acc1ad5..f415a16 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -646,31 +646,37 @@ locale, and can differ from one match to another if there is an
 intervening call of the
 L<setlocale() function|perllocale/The setlocale function>.
 This modifier is automatically set if the regular expression is compiled
-within the scope of a C<"use locale"> pragma.
+within the scope of a C<"use locale"> pragma.  Results are not
+well-defined when using this and matching against a utf8-encoded string.
 
 C<"u"> means to use Unicode semantics when pattern matching.  It is
-automatically set if the regular expression is compiled within the scope
-of a L<C<"use feature 'unicode_strings">|feature> pragma (and isn't
-also in the scope of L<C<"use locale">|locale> nor
-L<C<"use bytes">|bytes> pragmas.  It is not fully implemented at the
-time of this writing, but work is being done to complete the job.  On
-EBCDIC platforms this currently has no effect, but on ASCII platforms,
-it effectively turns them into Latin-1 platforms.  That is, the ASCII
-characters remain as ASCII characters (since ASCII is a subset of
-Latin-1), but the non-ASCII code points are treated as Latin-1
-characters.  Right now, this only applies to the C<"\b">, C<"\s">, and
-C<"\w"> pattern matching operators, plus their complements.  For
-example, when this option is not on, C<"\w"> matches precisely
-C<[A-Za-z0-9_]> (on a non-utf8 string).  When the option is on, it
-matches not just those, but all the Latin-1 word characters (such as an
-"n" with a tilde).  It thus matches exactly the same set of code points
-from 0 to 255 as it would if the string were encoded in utf8.
+automatically set if the regular expression is encoded in utf8, or is
+compiled within the scope of a
+L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in
+the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes>
+pragmas.  On ASCII platforms, the code points between 128 and 255 take on their
+Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
+in strict ASCII their meanings are undefined.  Thus the platform
+effectively becomes a Unicode platform.  The ASCII characters remain as
+ASCII characters (since ASCII is a subset of Latin-1 and Unicode).  For
+example, when this option is not on, on a non-utf8 string, C<"\w">
+matches precisely C<[A-Za-z0-9_]>.  When the option is on, it matches
+not just those, but all the Latin-1 word characters (such as an "n" with
+a tilde).  On EBCDIC platforms, which already are equivalent to Latin-1,
+this modifier changes behavior only when the C<"/i"> modifier is also
+specified, and affects only two characters, giving them full Unicode
+semantics: the C<MICRO SIGN> will match the Greek capital and
+small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP
+S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not.
+(This last case is buggy, however.)
 
 C<"d"> means to use the traditional Perl pattern matching behavior.
 This is dualistic (hence the name C<"d">, which also could stand for
-"default").  When this is in effect, Perl matches utf8-encoded strings
+"depends").  When this is in effect, Perl matches utf8-encoded strings
 using Unicode rules, and matches non-utf8-encoded strings using the
-platform's native character set rules.
+platform's native character set rules.  (If the regular expression
+itself is encoded in utf8, Unicode rules are used regardless of the
+target string's encoding.)
 See L<perlunicode/The "Unicode Bug">.  It is automatically selected by
 default if the regular expression is compiled neither within the scope
 of a C<"use locale"> pragma nor a <C<"use feature 'unicode_strings">
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 20acb55..925ae36 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1450,7 +1450,8 @@ The term, the "Unicode bug" has been applied to an inconsistency with the
 Unicode characters whose ordinals are in the Latin-1 Supplement block, that
 is, between 128 and 255.  Without a locale specified, unlike all other
 characters or code points, these characters have very different semantics in
-byte semantics versus character semantics.
+byte semantics versus character semantics, unless
+C<use feature 'unicode_strings'> is specified.
 
 In character semantics they are interpreted as Unicode code points, which means
 they have the same semantics as Latin-1 (ISO-8859-1).
@@ -1514,45 +1515,21 @@ ASCII range (except in a locale), along with Perl's desire to add Unicode
 support seamlessly.  The result wasn't seamless: these characters were
 orphaned.
 
-Work is being done to correct this, but only some of it is complete.
-What has been finished is:
-
-=over
-
-=item *
-
-the matching of C<\b>, C<\s>, C<\w> and the Posix
-character classes and their complements in regular expressions
-
-=item *
-
-case changing (but not user-defined casing)
-
-=item *
-
-case-insensitive (C</i>) regular expression matching for [bracketed
-character classes] only, except for some bugs with C<LATIN SMALL
-LETTER SHARP S> (which is supposed to match the two character sequence
-"ss" (or "Ss" or "sS" or "SS"), but Perl has a number of bugs for all
-such multi-character case insensitive characters, of which this is just
-one example.
-
-=back
-
-Due to concerns, and some evidence, that older code might
-have come to rely on the existing behavior, the new behavior must be explicitly
-enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
-no new syntax is involved.
-
-See L<perlfunc/lc> for details on how this pragma works in combination with
-various others for casing.
-
-Even though the implementation is incomplete, it is planned to have this
-pragma affect all the problematic behaviors in later releases: you can't
-have one without them all.
-
-In the meantime, a workaround is to always call utf8::upgrade($string), or to
-use the standard module L<Encode>.   Also, a scalar that has any characters
+Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
+cause Perl to use Unicode semantics on all string operations within the
+scope of the feature subpragma.  Regular expressions compiled in its
+scope retain that behavior even when executed or compiled into larger
+regular expressions outside the scope.  (The pragma does not, however,
+affect user-defined case changing operations.  These still require a
+UTF-8 encoded string to operate.)
+
+In Perl 5.12, the subpragma affected casing changes, but not regular
+expressions.  See L<perlfunc/lc> for details on how this pragma works in
+combination with various others for casing.
+
+For earlier Perls, or when a string is passed to a function outside the
+subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
+or to use the standard module L<Encode>.   Also, a scalar that has any characters
 whose ordinal is above 0x100, or which were specified using either of the
 C<\N{...}> notations will automatically have character semantics.
 
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod
index 877e4d1..9fd2b38 100644
--- a/pod/perlunifaq.pod
+++ b/pod/perlunifaq.pod
@@ -138,27 +138,27 @@ concern, and you can just C<eval> dumped data as always.
 
 =head2 Why do some characters not uppercase or lowercase correctly?
 
-It seemed like a good idea at the time, to keep the semantics the same for
-standard strings, when Perl got Unicode support.  The plan is to fix this
-in the future, and the casing component has in fact mostly been fixed, but we
-have to deal with the fact that Perl treats equal strings differently,
-depending on the internal state.
-
-First the casing.  Just put a C<use feature 'unicode_strings'> near the
-beginning of your program.  Within its lexical scope, C<uc>, C<lc>, C<ucfirst>,
-C<lcfirst>, and the regular expression escapes C<\U>, C<\L>, C<\u>, C<\l> use
-Unicode semantics for changing case regardless of whether the UTF8 flag is on
-or not.  However, if you pass strings to subroutines in modules outside the
-pragma's scope, they currently likely won't behave this way, and you have to
-try one of the solutions below.  There is another exception as well:  if you
-have furnished your own casing functions to override the default, these will
-not be called unless the UTF8 flag is on)
-
-This remains a problem for the regular expression constructs
-C</.../i>, C<(?i:...)>, and C</[[:posix:]]/>.
-
-To force Unicode semantics, you can upgrade the internal representation to
-by doing C<utf8::upgrade($string)>. This can be used
+Starting in Perl 5.14 (and partially in Perl 5.12), just put a
+C<use feature 'unicode_strings'> near the beginning of your program.
+Within its lexical scope you shouldn't have this problem.  It also is
+automatically enabled under C<use feature ':5.12'> or using C<-E> on the
+command line for Perl 5.12 or higher.
+
+The rationale for requiring this is to not break older programs that
+rely on the way things worked before Unicode came along.  Those older
+programs knew only about the ASCII character set, and so may not work
+properly for additional characters.  When a string is encoded in UTF-8,
+Perl assumes that the program is prepared to deal with Unicode, but when
+the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC
+platform) is wanted, and so those characters that are not ASCII
+characters aren't recognized as to what they would be in Unicode.
+C<use feature 'unicode_strings'> tells Perl to treat all characters as
+Unicode, whether the string is encoded in UTF-8 or not, thus avoiding
+the problem.
+
+However, on earlier Perls, or if you pass strings to subroutines outside
+the feature's scope, you can force Unicode semantics by changing the
+encoding to UTF-8 by doing C<utf8::upgrade($string)>. This can be used
 safely on any string, as it checks and does not change strings that have
 already been upgraded.
 
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0015-Nit-in-perlre.pod.patch
From 6536d050580ef103778c3163f0fcf213580f1445 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:34:25 -0700
Subject: [PATCH] Nit in perlre.pod

---
 pod/perlre.pod |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/pod/perlre.pod b/pod/perlre.pod
index f415a16..b74618f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -686,7 +686,7 @@ Note that the C<d>, C<l>, C<p>, and C<u> modifiers are special in that
 they can only be enabled, not disabled, and the C<d>, C<l>, and C<u>
 modifiers are mutually exclusive: specifying one de-specifies the
 others, and a maximum of one may appear in the construct.  Thus, for
-example, C<(?-p)>, C<(?-d:...)>, and C<(?-dl:...)> will warn when
+example, C<(?-p)>, C<(?-d:...)>, and C<(?dl:...)> will warn when
 compiled under C<use warnings>.
 
 Note also that the C<p> modifier is special in that its presence
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0016-Nit-in-perlunicode.pod.patch
From aed0c30ba7ea67ac1704251c054a48138084596c Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:34:58 -0700
Subject: [PATCH] Nit in perlunicode.pod

---
 pod/perlunicode.pod |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 925ae36..242238f 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -101,9 +101,9 @@ or from literals and constants in the source text.
 The C<bytes> pragma will always, regardless of platform, force byte
 semantics in a particular lexical scope.  See L<bytes>.
 
-The C<use feature 'unicode_strings'> pragma is intended to always, regardless
-of platform, force character (Unicode) semantics in a particular lexical scope.
-In release 5.12, it is partially implemented, applying only to case changes.
+The C<use feature 'unicode_strings'> pragma is intended always,
+regardless of platform, to force character (Unicode) semantics in a
+particular lexical scope.
 See L</The "Unicode Bug"> below.
 
 The C<utf8> pragma is primarily a compatibility device that enables
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

0017-Nit-in-perluniintro.pod.patch
From 3fcce5accbef27d94c7a970a42ab4b580440bf33 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:36:44 -0700
Subject: [PATCH] Nit in perluniintro.pod

---
 pod/perluniintro.pod |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index f0b2be5..6a8c07d 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -83,7 +83,7 @@ Because of backward compatibility with legacy encodings, the "a unique
 number for every character" idea breaks down a bit: instead, there is
 "at least one number for every character".  The same character could
 be represented differently in several legacy encodings.  The
-converse is also not true: some code points do not have an assigned
+converse is not also true: some code points do not have an assigned
 character.  Firstly, there are unallocated code points within
 otherwise used blocks.  Secondly, there are special Unicode control
 characters that do not represent true characters.
-- 
1.5.6.3

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From tchrist@perl.com

Thank you, Karl. Thank you very much.

--tom

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @xdg

On Wed, Dec 1, 2010 at 7​:28 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

Thank you, Karl.  Thank you very much.

Likewise, thank you, Karl for all your work on these (and other)
Unicode related bugs. In working with you on various patches, I had
to learn a lot more about Unicode than I did and I'm probably better
off as a programmer for it.

-- David

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @rjbs

* karl williamson <public@​khwilliamson.com> [2010-12-01T19​:14​:58]

This series of commits, along with many previous ones, resolves
[perl #58182], the "Unicode Bug".

I am full of glee!

Thanks, Karl! Your work has been amazing, educational, and inspirational.
Please stick around longer!

--
rjbs

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @cpansprout

On Wed Dec 01 16​:16​:32 2010, public@​khwilliamson.com wrote​:

This series of commits, along with many previous ones, resolves [perl
#58182], the "Unicode Bug".

Thank you. Applied as 164739 to 35146e3.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this Dec 2, 2010
@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @khwilliamson

Juerd Waalboer wrote​:

karl williamson skribis 2010-12-01 17​:14 (-0700)​:

This series of commits, along with many previous ones, resolves
[perl #58182], the "Unicode Bug".

Your work on this set of issues has been wonderful from the beginning.
I'm very happy to see #58182 resolved. This will make programming Perl
a whole lot easier.

Thank you so much!

As explained in the perldelta, there are still two known minor areas
where the behavior varies depending on the utf8ness of the
underlying string

Two minor areas is almost infinitely better than a dozen major ones!

I thought at the time that it would take a few weeks at most.

:)

Perhaps you could find some time in the next 3 months to look at
revising the pods you wrote to take into consideration this new wrinkle.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @demerphq

On 23 September 2008 18​:03, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Sep 22, 2008 at 09​:55​:23PM +0200, Juerd Waalboer wrote​:

It's a bug. A known and old bug, but it must be fixed some time.

Here's a general suggestion related to fixing Unicode-related issues.

A well-known issue is that the SVf_UTF8 flag means two different things​:

   1) whether the 'sequence of integers' are stored one per byte, or use
   the variable-length utf-8 encoding scheme;

   2) what semantics apply to that sequence of integers.

We also have various bodges, such as attaching magic to cache utf8
indexes.

All this stems from the fact that there's no space in an SV to store all
the information we want. So....

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc. This then both allows us to disambiguate the meaning of
SVf_UTF8 (in the aux structure there would be two different flags for the
two meanings), but would also provide room for future enhancements (eg
space for a UTF32 flag should someone wish to implement that storage
format).

Just a thought...

++

yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @demerphq

2008/9/26 Rafael Garcia-Suarez <rgarciasuarez@​gmail.com>​:

2008/9/23 Dave Mitchell <davem@​iabyn.com>​:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc.

I don't think we want to store the charset/locale with the string.

Consider the string "istanbul". If you're treating this string as
English, you'll capitalize it as "ISTANBUL", but if you want to follow
the Stambouliot spelling, it's "İSTANBUL".

Now consider the string "Consider the string "istanbul"". Shall we
capitalize it as "CONSİDER THE STRİNG "İSTANBUL"" ? Obviously
attaching a language to a string is going to be a problem when you
have to handle multi-language strings.

So the place that makes sense to provide this information is, in my
opinion, in lc and uc (and derivatives)​: in the code, not the data.
(So a pragma can be used, too.)

Could you expand on this? When I try to reason it through I see so
many issues I'm wondering if I'm missing something.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @rgs

2010/12/2 demerphq <demerphq@​gmail.com>​:

2008/9/26 Rafael Garcia-Suarez <rgarciasuarez@​gmail.com>​:

2008/9/23 Dave Mitchell <davem@​iabyn.com>​:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all the
extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc.

I don't think we want to store the charset/locale with the string.

Consider the string "istanbul". If you're treating this string as
English, you'll capitalize it as "ISTANBUL", but if you want to follow
the Stambouliot spelling, it's "İSTANBUL".

Now consider the string "Consider the string "istanbul"". Shall we
capitalize it as "CONSİDER THE STRİNG "İSTANBUL"" ? Obviously
attaching a language to a string is going to be a problem when you
have to handle multi-language strings.

So the place that makes sense to provide this information is, in my
opinion, in lc and uc (and derivatives)​: in the code, not the data.
(So a pragma can be used, too.)

Could you expand on this? When I try to reason it through I see so
many issues I'm wondering if I'm missing something.

In short, locale is not a property of a string, but of the code that
processes it.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From @rjbs

* demerphq <demerphq@​gmail.com> [2010-12-02T09​:34​:18]

On 23 September 2008 18​:03, Dave Mitchell <davem@​iabyn.com> wrote​:

All this stems from the fact that there's no space in an SV to store all
the information we want. So....

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all
the extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc. This then both allows us to disambiguate the meaning of
SVf_UTF8 (in the aux structure there would be two different flags for the
two meanings), but would also provide room for future enhancements (eg
space for a UTF32 flag should someone wish to implement that storage
format).

++

Yes, ++ indeed.

We've been looking at storing something like this with ad hoc magic, but magic
isn't copied, which led to looking at hacks atop hacks.

If one could look at a scalar and know​:

  1. it's text
  2. it's binary
  3. it's binary, but specifically text encoded in XYZ
  4. we don't know

...it would be *massively* *incredibly* useful at fixing *many* bugs in dealing
with encoded text.

Consider some sort of significant, potentially beer-related award offered to
the porters who get such a feature produced, landed, and into a production
release.

--
rjbs

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 2, 2010

From juerd@tnx.nl

karl williamson skribis 2010-12-01 17​:14 (-0700)​:

This series of commits, along with many previous ones, resolves
[perl #58182], the "Unicode Bug".

Your work on this set of issues has been wonderful from the beginning.
I'm very happy to see #58182 resolved. This will make programming Perl
a whole lot easier.

Thank you so much!

As explained in the perldelta, there are still two known minor areas
where the behavior varies depending on the utf8ness of the
underlying string

Two minor areas is almost infinitely better than a dozen major ones!

I thought at the time that it would take a few weeks at most.

:)
--
Met vriendelijke groet, // Kind regards, // Korajn salutojn,

Juerd Waalboer <juerd@​tnx.nl>
TNX

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 10, 2010

From @khwilliamson

Ricardo Signes wrote​:

* demerphq <demerphq@​gmail.com> [2010-12-02T09​:34​:18]

On 23 September 2008 18​:03, Dave Mitchell <davem@​iabyn.com> wrote​:

All this stems from the fact that there's no space in an SV to store all
the information we want. So....

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an
Extended String flag. This flag indicates that prepended to the SvPVX
string is an auxilliary structure (cf the hv_aux struct) that contains all
the extra needed unicodish info, such as encoding, charset, locale, cached
indexes etc etc. This then both allows us to disambiguate the meaning of
SVf_UTF8 (in the aux structure there would be two different flags for the
two meanings), but would also provide room for future enhancements (eg
space for a UTF32 flag should someone wish to implement that storage
format).
++

Yes, ++ indeed.

We've been looking at storing something like this with ad hoc magic, but magic
isn't copied, which led to looking at hacks atop hacks.

If one could look at a scalar and know​:

1. it's text
2. it's binary
3. it's binary, but specifically text encoded in XYZ
4. we don't know

...it would be *massively* *incredibly* useful at fixing *many* bugs in dealing
with encoded text.

Consider some sort of significant, potentially beer-related award offered to
the porters who get such a feature produced, landed, and into a production
release.

Shouldn't this be added to perltodo if it really should get done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant