-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent and wrong handling of 8th bit set chars with no locale #9455
Comments
From @khwilliamsonThis is a bug report for perl from corporate@khwilliamson.com, Characters in the range U+0080 through U+00FF behave inconsistently I presume that they do work properly when a locale is set, but I haven't print uc("\x{e0}"), "\n"; # (a with grave accent) yields itself instead of a capital A with grave accent (U+00C0). This print "\x{e0}" =~ /\x{c0}/i, "\n"; will print a null string on a line, as the match fails. The same behavior occurs for all characters in this range that are The behavior that is inconsistent mostly occurs with upper case letters print lcfirst("\x{c0}aaaaa"), "\n"; doesn't change the first character. But print lcfirst("\x{c0}aaaaa\x{101}"), "\n"; does change it. There is something seriously wrong when a character print "\x{c0}aaaaaa" =~ /^\x{e0}/i, "\n"; will show the match failing, but print "\x{c0}aaaaaa\x{101}" =~ /^\x{e0}/i, "\n"; will show the match succeeding. Again a character maybe hundreds of The same behavior occurs for all characters in this range that are Also, the byte character classes inconsistently match characters in this print "\xa0" =~ /^\s/, "\n"; will show that the match returns false but print "\xa0\x{101}" =~ /^\s/, "\n"; will show that the match returns true. But this behavior is sort-of From reading the documentation, I presume that the inconsistent behavior I am frankly astonished that this bug exists, as I have come to expect Flags: Site configuration information for perl 5.10.0: Configured by ActiveState at Wed May 14 05:06:16 PDT 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Locally applied patches: @INC for perl 5.10.0: Environment for perl 5.10.0: PATH=/opt/ActivePerl-5.10/bin:/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin |
From @moritzkarl williamson wrote:
This is a known bug, and probably not fixable, because too much code A possible workaround is CHeers, |
The RT System itself - Status changed from 'new' to 'open' |
From @druud62karl williamson schreef:
To me that is as expected. print lcfirst substr "\x{100}\x{c0}aaaaa", 1; Lowercasing isn't defined for as many characters in ASCII or Latin-1 as -- "Gewoon is een tijger." |
From @nothingmuchOn Thu, Aug 21, 2008 at 13:22:36 +0200, Dr.Ruud wrote:
Or a code point above 127 with use utf8 or use encoding -- |
From @nothingmuchOn Thu, Aug 21, 2008 at 14:31:40 +0300, Yuval Kogman wrote:
I should clarify that this is only in the context of the string A code point above 127 will be treated as unicode if the string is -- |
From @khwilliamsonI'm the person who submitted this bug report. I think this bug should To refresh your memory, the current implementation of perl on non-EBCDIC The slides from the talk "Working around *the* Unicode bug" during The crux of the problem is that on non-EBCDIC machines, in the absence Perl stores characters as utf8 automatically if a string contains any I know of three areas where this leads to problems. The first is the pattern matching already alluded to. This is at least The second is case changing functions, like lcfirst() or \U in pattern And the third is ignoring case in pattern matches. There may be others which I haven't looked for yet. I think, for One response I got to my bug report was that a lot of code depends on Also, from reading the perl source, it appears to me that EBCDIC An idea I've had is to add a pragma like "use latin1", or maybe "use Anyway, I'm hoping to get some sort of fix in for this. In my So I'd like to know how to proceed Karl Williamson |
From perl@nevcal.comOn approximately 9/20/2008 3:52 PM, came the following characters from
I applaud your willingness to dive in. For compatibility reasons, as has been discussed on this list N.B. There are lots of discussions about it in the archive, some Those discussions are lengthy reading, unfortunately, but they do point --
|
From @andk
> I'm the person who submitted this bug report. I think this bug should Thank you! As for the protocol: do not patch 5.10, patch bleadperl -- |
From @rgs2008/9/21 karl williamson <contact@khwilliamson.com>:
This is a good summary of the issues.
In general, one finds that people write code relying on almost anything...
That's in theory probable, but we don't have testers on EBCDIC
I think that the current Unicode bugs are annoying enough to deserve
If you're a git user, you can work on a branch cloned from Do not hesitate to ask questions here. |
From @JuerdMoritz Lenz skribis 2008-08-21 9:50 (+0200):
It is fixable, and the backwards incompatibility has already been | The handling of Unicode still is unclean in several places, where it's It will be fixed, and it's wonderful to have a volunteer for that! Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @JuerdDr.Ruud skribis 2008-08-21 13:22 (+0200):
No, unicode semantics get activated when the internal encoding of the It's a bug. A known and old bug, but it must be fixed some time. Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @Juerdkarl williamson skribis 2008-09-20 16:52 (-0600):
All three, but rest assured that this has already been discussed in This decision is clearly reflected in perl5100delta: | The handling of Unicode still is unclean in several places, where it's Please proceed with fixing the bug. I am very happy with your offer to
As always, I refrain from thinking about EBCDIC. I'd say: keep the
Please do break existing code, harsh as that may be. It is much more Pragmas have problems, especially in regular expressions. And it's very
Please feel free to copy the unit tests in Unicode::Semantics! Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @JuerdGlenn Linderman skribis 2008-09-20 16:31 (-0700):
As the current behavior is a bug, the enhancement can rightfully be What's this about the pragma that "must be used"? Yes, it has been In fact, perl5100delta clearly announces backwards incompatibility. Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @druud62Juerd Waalboer schreef:
Yes, Unicode semantics get activated when a codepoint above 255 is Yes, there are other ways too, like: perl -Mstrict -Mwarnings -Mencoding=utf8 -le' -- "Gewoon is een tijger." |
From @ikegamiOn Sat, Sep 20, 2008 at 6:52 PM, karl williamson
There are inconsistencies with quotemeta (and therefore \Q)
|
From @iabynOn Mon, Sep 22, 2008 at 09:55:23PM +0200, Juerd Waalboer wrote:
Here's a general suggestion related to fixing Unicode-related issues. A well-known issue is that the SVf_UTF8 flag means two different things: 1) whether the 'sequence of integers' are stored one per byte, or use 2) what semantics apply to that sequence of integers. We also have various bodges, such as attaching magic to cache utf8 All this stems from the fact that there's no space in an SV to store all How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Just a thought... -- |
From @JuerdDave Mitchell skribis 2008-09-23 17:03 (+0100):
It sounds rather complicated, whereas the current plan would be to Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From perl@nevcal.comOn approximately 9/23/2008 9:58 AM, came the following characters from
It is not at all clear to me that encoding, charset, and locale are Cached indexes are a nice performance help, I don't know enough about
I guess Juerd is referring to removing any semantic meaning of the flag, After such a removal, present-perl would reach the idyllic state 1) convert all other encodings to Unicode, perform semantic operations 2) leave data in other encodings, but avoid the use of Perl operations Note that binary data in strings is simply a special case of strings --
|
From @khwilliamsonGlenn, The reason I called it a bug is that I, an experienced Perl programmer, Whatever one calls it, there does seem to be some support for changing I have looked in the archive and found some discussions about this Karl Williamson Glenn Linderman wrote:
|
From perl@nevcal.comOn approximately 9/23/2008 10:33 AM, came the following characters from
The behavior is non-obvious. I may be blind to the deficiencies of the
The documentation has been in as much flux as the code, from 5.6.x to
Yes, there is a definite conflict between those goals, and from that
Sure. Please proceed. Especially with Raphael's openness to
The discussions are more lengthy (per post, and per number of posts), Given Raphael's green light, and if you are pointed at changes to Perl Here is one thread: Subject: "on the almost impossibility to write correct XS modules" demerphq spawned a related thread subject: "On the problem of strings Another related issue is that various operations that attempt to Let's not forget Pack & Unpack. There's one thread about that with Other related threads have the following subject lines: The philosophy should be that no Perl operations should have different String constant metacharacters such as \u \U \l \L pack/unpack - packing should always produce a single-byte string, and --
|
From @rgs2008/9/23 Dave Mitchell <davem@iabyn.com>:
I don't think we want to store the charset/locale with the string. Consider the string "istanbul". If you're treating this string as Now consider the string "Consider the string "istanbul"". Shall we So the place that makes sense to provide this information is, in my |
From @khwilliamsonI have been studying some of the discussions in this group about this This is a very concrete proposal, and I would like to get agreement on This mode would not impose a compile-time latin1-like locale on the I do not propose to automatically convert ("downgrade") strings from I don't know what to do about EBCDIC machines. I propose leaving I don't know what to do about interacting with "use bytes". One Are there other interactions that we need to worry about? I would like to defer how this mode gets enabled or disabled until we I think that a number of the issues that have been raised in the past |
From vadim@vkonovalov.rukarl williamson wrote:
what the "under an iso-8859-1 locale" exactly? reading perllocale gives me: USING LOCALES By default, Perl ignores the current locale. The "use locale" Do I understand correctly that your proposal will never touch me Do I remember correctly that using locales is not recommended in Perl?
could you please be more precise with uc(blablabal)? what you currently wrote is a syntax error
Best regards, |
From @khwilliamsonWhat I meant is not a literal locale, but that the semantics would be My proposal would touch you UNLESS you do have a 'use locale'. Your Vadim Konovalov wrote:
|
From perl@nevcal.comOn approximately 9/26/2008 11:44 AM, came the following characters from
Yeah, I gave you a lot of reading material. I hoped not to scare you
This makes it sound like you are targeting 5.10.x; since you are talking
This sounds like it might be a complete list of operations. I think \u,
These all sound like appropriate behaviors to implement for a Unicode
Fine. All else being equal (utf8 just being a representation) it
Best effort non-breakage seems to be the best we can currently expect...
Another possibility would be that all the above listed operations would "\U\x45\x23\x37" should just be "\x45\x23\x37" for example of a noop.
Probably. Every XS writer under the sun has assumed different things
Sure, but if you target 5.10.x you need some way of enabling or
It might be nice to recap anything that isn't being addressed, at least Unicode filename handling, especially on Windows, might be a contentious You'll need to give this proposal a week or so of discussion time before --
|
From @JuerdHello Karl, I strongly agree with your proposed solutions. (I'm ambivalent only Thank you for the summary. karl williamson skribis 2008-09-26 12:44 (-0600):
"in utf8" is ambiguous. It can mean either length(uc($y_umlaut)) == 2 or May I suggest including the words "upgrade" and "internal"? The resulting string will be upgraded to utf8 internally, ...
Turning your solutions on explicitly is probably wrong, at least for Using a pragma is problematic because of qr//, and because it cannot be I'd prefer to skip any discussion about how to enable or disable this - Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> |
From @druud62karl williamson schreef:
use kurila; # ;-) -- "Gewoon is een tijger." |
From vadim@vkonovalov.ruDr.Ruud wrote:
kurila is so largely incompatible, it is even off-topicable! (initially I thought its on-topic but then I was convinced by responders BR, |
From @khwilliamsonMy proposal from a week and a half ago hasn't spawned much Here's a hodge-podge of my thoughts about it, but most important, I am It looks like in 5.12, Rafael wants the new mode to be default behavior. Glenn, "use bytes" doesn't mean necessarily binary. For example, prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings. If there is a way to en/dis-able this mode, doesn't that have to be a Juerd wrote:Pragmas have problems, especially in regular expressions. And it's very
|
From @khwilliamson0006-regcomp.c-Use-latin1-folding-in-synthetic-start-cla.patchFrom be110e8fe5a08bd964d4bb091aef4daa3212950b Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 19:00:00 -0700
Subject: [PATCH] regcomp.c: Use latin1 folding in synthetic start class
This is because the pattern may not specify unicode semantics, but if
the target matching string is in utf8, then unicode semantics may be
needed nonetheless. So to avoid the regexec optimizer rejecting the
match, we need to allow for a possible false positive.
---
regcomp.c | 34 +++++++++++++++++++---------------
1 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 79623d2..392b075 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -3073,11 +3073,18 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
/* Check whether it is compatible with what we know already! */
int compat = 1;
+
+ /* If compatibile, we or it in below. It is compatible if is
+ * in the bitmp and either 1) its bit or its fold is set, or 2)
+ * it's for a locale. Even if there isn't unicode semantics
+ * here, at runtime there may be because of matching against a
+ * utf8 string, so accept a possible false positive for
+ * latin1-range folds */
if (uc >= 0x100 ||
(!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
&& !ANYOF_BITMAP_TEST(data->start_class, uc)
&& (!(data->start_class->flags & ANYOF_FOLD)
- || !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
+ || !ANYOF_BITMAP_TEST(data->start_class, PL_fold_latin1[uc])))
)
compat = 0;
ANYOF_CLASS_ZERO(data->start_class);
@@ -3119,12 +3126,13 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
if (flags & SCF_DO_STCLASS_AND) {
/* Check whether it is compatible with what we know already! */
int compat = 1;
-
if (uc >= 0x100 ||
- (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
- && !ANYOF_BITMAP_TEST(data->start_class, uc)
- && !ANYOF_BITMAP_TEST(data->start_class, (UNI_SEMANTICS) ? PL_fold_latin1[uc] : PL_fold[uc])))
+ (!(data->start_class->flags & (ANYOF_CLASS | ANYOF_LOCALE))
+ && !ANYOF_BITMAP_TEST(data->start_class, uc)
+ && !ANYOF_BITMAP_TEST(data->start_class, PL_fold_latin1[uc])))
+ {
compat = 0;
+ }
ANYOF_CLASS_ZERO(data->start_class);
ANYOF_BITMAP_ZERO(data->start_class);
if (compat) {
@@ -3136,13 +3144,11 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
}
else {
- /* Also set the other member of the fold pair. Can't
- * do this for locale, because not known until runtime
- */
- ANYOF_BITMAP_SET(data->start_class,
- (OP(scan) == EXACTFU)
- ? PL_fold_latin1[uc]
- : PL_fold[uc]);
+ /* Also set the other member of the fold pair. In case
+ * that unicode semantics is called for at runtime, use
+ * the full latin1 fold. (Can't do this for locale,
+ * because not known until runtime */
+ ANYOF_BITMAP_SET(data->start_class, PL_fold_latin1[uc]);
}
}
}
@@ -3158,9 +3164,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
* can't do that in locale because not known until
* run-time */
ANYOF_BITMAP_SET(data->start_class,
- (OP(scan) == EXACTFU)
- ? PL_fold_latin1[uc]
- : PL_fold[uc]);
+ PL_fold_latin1[uc]);
}
}
data->start_class->flags &= ~ANYOF_EOS;
--
1.5.6.3
|
From @khwilliamson0007-regcomp.sym-update-comment.patchFrom 444f010a3c52b735e4bdd29220cb10b3f384bc18 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:38:09 -0700
Subject: [PATCH] regcomp.sym: update comment
---
regcomp.sym | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/regcomp.sym b/regcomp.sym
index a85d33f..ab57929 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -145,7 +145,7 @@ RENUM BRANCHJ, off 1 . 1 ; Group with independently numbered parens.
# inline charclass data (ascii only), the 'C' store it in the structure.
# NOTE: the relative order of the TRIE-like regops is signifigant
-TRIE TRIE, trie 1 ; Match many EXACT(FL?)? at once. flags==type
+TRIE TRIE, trie 1 ; Match many EXACT(F[LU]?)? at once. flags==type
TRIEC TRIE,trie charclass ; Same as TRIE, but with embedded charclass data
# For start classes, contains an added fail table.
--
1.5.6.3
|
From @khwilliamson0008-regcomp.sym-Add-REFFU-and-NREFFU-nodes.patchFrom 76bd258db2ca18264a7ee18f0655a55a47ce5cb5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:39:16 -0700
Subject: [PATCH] regcomp.sym: Add REFFU and NREFFU nodes
These will be used for matching capture buffers case-insensitively using
Unicode semantics.
make regen will regenerate the delivered regnodes.h
---
regcomp.sym | 7 +++++++
regnodes.h | 29 ++++++++++++++++++++---------
2 files changed, 27 insertions(+), 9 deletions(-)
diff --git a/regcomp.sym b/regcomp.sym
index ab57929..4e787a7 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -194,6 +194,13 @@ NHORIZWS NHORIZWS, none 0 S ; not horizontal whitespace (Perl 6)
FOLDCHAR FOLDCHAR, codepoint 1 ; codepoint with tricky case folding properties.
EXACTFU EXACT, str ; Match this string, folded, Unicode semantics for non-utf8 (prec. by length).
+# These could have been implemented using the FLAGS field of the regnode, but
+# by having a separate node type, we can use the existing switch statement to
+# avoid some tests
+REFFU REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8
+NREFFU REF, num 1 V ; Match already matched string, folded using unicode semantics for non-utf8
+
+
# NEW STUFF ABOVE THIS LINE
################################################################################
diff --git a/regnodes.h b/regnodes.h
index 97ac607..09ab661 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -6,8 +6,8 @@
/* Regops and State definitions */
-#define REGNODE_MAX 91
-#define REGMATCH_STATE_MAX 131
+#define REGNODE_MAX 93
+#define REGMATCH_STATE_MAX 133
#define END 0 /* 0000 End of program. */
#define SUCCEED 1 /* 0x01 Return from a subroutine, basically. */
@@ -70,7 +70,7 @@
#define MINMOD 58 /* 0x3a Next operator is not greedy. */
#define LOGICAL 59 /* 0x3b Next opcode should set the flag only. */
#define RENUM 60 /* 0x3c Group with independently numbered parens. */
-#define TRIE 61 /* 0x3d Match many EXACT(FL?)? at once. flags==type */
+#define TRIE 61 /* 0x3d Match many EXACT(F[LU]?)? at once. flags==type */
#define TRIEC 62 /* 0x3e Same as TRIE, but with embedded charclass data */
#define AHOCORASICK 63 /* 0x3f Aho Corasick stclass. flags==type */
#define AHOCORASICKC 64 /* 0x40 Same as AHOCORASICK, but with embedded charclass data */
@@ -99,8 +99,10 @@
#define NHORIZWS 87 /* 0x57 not horizontal whitespace (Perl 6) */
#define FOLDCHAR 88 /* 0x58 codepoint with tricky case folding properties. */
#define EXACTFU 89 /* 0x59 Match this string, folded, Unicode semantics for non-utf8 (prec. by length). */
-#define OPTIMIZED 90 /* 0x5a Placeholder for dump. */
-#define PSEUDO 91 /* 0x5b Pseudo opcode for internal use. */
+#define REFFU 90 /* 0x5a Match already matched string, folded using unicode semantics for non-utf8 */
+#define NREFFU 91 /* 0x5b Match already matched string, folded using unicode semantics for non-utf8 */
+#define OPTIMIZED 92 /* 0x5c Placeholder for dump. */
+#define PSEUDO 93 /* 0x5d Pseudo opcode for internal use. */
/* ------------ States ------------- */
#define TRIE_next (REGNODE_MAX + 1) /* state for TRIE */
#define TRIE_next_fail (REGNODE_MAX + 2) /* state for TRIE */
@@ -239,6 +241,8 @@ EXTCONST U8 PL_regkind[] = {
NHORIZWS, /* NHORIZWS */
FOLDCHAR, /* FOLDCHAR */
EXACT, /* EXACTFU */
+ REF, /* REFFU */
+ REF, /* NREFFU */
NOTHING, /* OPTIMIZED */
PSEUDO, /* PSEUDO */
/* ------------ States ------------- */
@@ -379,6 +383,8 @@ static const U8 regarglen[] = {
0, /* NHORIZWS */
EXTRA_SIZE(struct regnode_1), /* FOLDCHAR */
0, /* EXACTFU */
+ EXTRA_SIZE(struct regnode_1), /* REFFU */
+ EXTRA_SIZE(struct regnode_1), /* NREFFU */
0, /* OPTIMIZED */
0, /* PSEUDO */
};
@@ -476,6 +482,8 @@ static const char reg_off_by_arg[] = {
0, /* NHORIZWS */
0, /* FOLDCHAR */
0, /* EXACTFU */
+ 0, /* REFFU */
+ 0, /* NREFFU */
0, /* OPTIMIZED */
0, /* PSEUDO */
};
@@ -578,8 +586,10 @@ EXTCONST char * const PL_reg_name[] = {
"NHORIZWS", /* 0x57 */
"FOLDCHAR", /* 0x58 */
"EXACTFU", /* 0x59 */
- "OPTIMIZED", /* 0x5a */
- "PSEUDO", /* 0x5b */
+ "REFFU", /* 0x5a */
+ "NREFFU", /* 0x5b */
+ "OPTIMIZED", /* 0x5c */
+ "PSEUDO", /* 0x5d */
/* ------------ States ------------- */
"TRIE_next", /* REGNODE_MAX +0x01 */
"TRIE_next_fail", /* REGNODE_MAX +0x02 */
@@ -674,7 +684,8 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__;
#else
EXTCONST U8 PL_varies[] __attribute__deprecated__ = {
CLUMP, BRANCH, BACK, STAR, PLUS, CURLY, CURLYN, CURLYM, CURLYX, WHILEM,
- REF, REFF, REFFL, SUSPEND, IFTHEN, BRANCHJ, NREF, NREFF, NREFFL,
+ REF, REFF, REFFL, SUSPEND, IFTHEN, BRANCHJ, NREF, NREFF, NREFFL, REFFU,
+ NREFFU,
0
};
#endif /* DOINIT */
@@ -683,7 +694,7 @@ EXTCONST U8 PL_varies[] __attribute__deprecated__ = {
EXTCONST U8 PL_varies_bitmask[];
#else
EXTCONST U8 PL_varies_bitmask[] = {
- 0x00, 0x00, 0x00, 0xC0, 0xC1, 0x9F, 0x33, 0x01, 0x38, 0x00, 0x00, 0x00
+ 0x00, 0x00, 0x00, 0xC0, 0xC1, 0x9F, 0x33, 0x01, 0x38, 0x00, 0x00, 0x0C
};
#endif /* DOINIT */
--
1.5.6.3
|
From @khwilliamson0009-re-fold_grind.t-Refactor-to-test-utf8-patterns.patchFrom 59b2b252ef94dc19789543fd6664953e3ae2a671 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 21:49:20 -0700
Subject: [PATCH] re/fold_grind.t: Refactor to test utf8 patterns.
The previous version wasn't really testing utf8 patterns.
---
t/re/fold_grind.t | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index 13fdd3c..fd69cdb 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -13,6 +13,7 @@ BEGIN {
use strict;
use warnings;
+use Encode;
# Tests both unicode and not, so make sure not implicitly testing unicode
no feature 'unicode_strings';
@@ -238,7 +239,8 @@ foreach my $test (sort { numerically } keys %tests) {
#diag $progress;
# Now grind out tests, using various combinations.
- foreach my $uni_semantics ("", 'u') { # Both non- and uni semantics
+ # XXX foreach my $charset ('d', 'u', 'l') {
+ foreach my $charset ('d', 'u') {
foreach my $utf8_target (0, 1) { # Both utf8 and not, for
# code points < 256
my $upgrade_target = "";
@@ -247,17 +249,17 @@ foreach my $test (sort { numerically } keys %tests) {
# something above latin1. So impossible to test if to not to be in
# utf8; and otherwise, no upgrade is needed.
next if $target_above_latin1 && ! $utf8_target;
- $upgrade_target = '; utf8::upgrade($c)' if ! $target_above_latin1 && $utf8_target;
+ $upgrade_target = ' utf8::upgrade($c);' if ! $target_above_latin1 && $utf8_target;
- foreach my $uni_pattern (0, 1) {
- next if $pattern_above_latin1 && ! $uni_pattern;
+ foreach my $utf8_pattern (0, 1) {
+ next if $pattern_above_latin1 && ! $utf8_pattern;
+ my $uni_semantics = $utf8_target || $charset eq 'u' || ($charset eq 'd' && $utf8_pattern);
my $upgrade_pattern = "";
- $upgrade_pattern = '; use re "/u"' if ! $pattern_above_latin1 && $uni_pattern;
+ $upgrade_pattern = ' utf8::upgrade($p);' if ! $pattern_above_latin1 && $utf8_pattern;
my $lhs = join "", @x_target;
my @rhs = @x_pattern;
- #print "$lhs: ", "/@rhs/\n";
-
+ my $should_fail = ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self;
foreach my $bracketed (0, 1) { # Put rhs in [...], or not
foreach my $inverted (0,1) {
next if $inverted && ! $bracketed;
@@ -314,9 +316,9 @@ foreach my $test (sort { numerically } keys %tests) {
# something on one or both sides that force it to.
my $must_match = ! $can_match_null || ($l_anchor && $r_anchor) || ($l_anchor && $append) || ($r_anchor && $prepend) || ($prepend && $append);
#next unless $must_match;
- my $quantified = "(?$uni_semantics:$l_anchor$prepend$interior${quantifier}$append$r_anchor)";
+ my $quantified = "(?$charset:$l_anchor$prepend$interior${quantifier}$append$r_anchor)";
my $op;
- if ($must_match && ! $utf8_target && ! $uni_pattern && ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self) {
+ if ($must_match && $should_fail) {
$op = 0;
} else {
$op = 1;
@@ -324,8 +326,9 @@ foreach my $test (sort { numerically } keys %tests) {
$op = ! $op if $must_match && $inverted;
$op = ($op) ? '=~' : '!~';
- my $stuff .= " utf8_target=$utf8_target, uni_semantics=$uni_semantics, uni_pattern=$uni_pattern, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, quantifier=$quantifier, l_anchor=$l_anchor, r_anchor=$r_anchor";
- my $eval = "my \$c = \"$prepend$lhs$append\"$upgrade_target; $upgrade_pattern; \$c $op /$quantified/i;";
+ my $stuff .= " uni_semantics=$uni_semantics, should_fail=$should_fail, bracketed=$bracketed, prepend=$prepend, append=$append, parend=$parend, quantifier=$quantifier, l_anchor=$l_anchor, r_anchor=$r_anchor";
+ $stuff .= "; pattern_above_latin1=$pattern_above_latin1; utf8_pattern=$utf8_pattern";
+ my $eval = "my \$c = \"$prepend$lhs$append\"; my \$p = qr/$quantified/i;$upgrade_target$upgrade_pattern \$c $op \$p;";
# XXX Doesn't currently test multi-char folds
next if @pattern != 1;
--
1.5.6.3
|
From @khwilliamson0010-regexec.c-Handle-REFFU-and-NREFFU-refactor.patchFrom a6dcef1fd3ecd8d8374c56a89632cb19b590264d Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:05:25 -0700
Subject: [PATCH] regexec.c: Handle REFFU and NREFFU; refactor
This adds handling of the Unicode folding semantics capture buffer
backreferences. I've refactored the code so that the case statements
set up the type of folding, to avoid having to test for which in the
common code.
Also, the previous code was confusing fold case and lowercase. There is
already a routine to handle the fold case, so that simplified things.
---
regexec.c | 115 ++++++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 72 insertions(+), 43 deletions(-)
diff --git a/regexec.c b/regexec.c
index ff76c84..ffa2da4 100644
--- a/regexec.c
+++ b/regexec.c
@@ -3927,31 +3927,69 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
break;
case NREFFL:
- {
+ { /* The capture buffer cases. The ones beginning with N for the
+ named buffers just convert to the equivalent numbered and
+ pretend they were called as the corresponding numbered buffer
+ op. */
char *s;
char type;
+ I32 (*folder)() = NULL; /* NULL assumes will be NREF, REF: no
+ folding */
+ const U8 * fold_array = NULL;
+
PL_reg_flags |= RF_tainted;
- /* FALL THROUGH */
- case NREF:
+ folder = foldEQ_locale;
+ fold_array = PL_fold_locale;
+ type = REFFL;
+ goto do_nref;
+
+ case NREFFU:
+ folder = foldEQ_latin1;
+ fold_array = PL_fold_latin1;
+ type = REFFU;
+ goto do_nref;
+
case NREFF:
- type = OP(scan);
+ folder = foldEQ;
+ fold_array = PL_fold;
+ type = REFF;
+ goto do_nref;
+
+ case NREF:
+ type = REF;
+ do_nref:
+
+ /* For the named back references, find the corresponding buffer
+ * number */
n = reg_check_named_buff_matched(rex,scan);
- if ( n ) {
- type = REF + ( type - NREF );
- goto do_ref;
- } else {
+ if ( ! n ) {
sayNO;
- }
- /* unreached */
+ }
+ goto do_nref_ref_common;
+
case REFFL:
PL_reg_flags |= RF_tainted;
+ folder = foldEQ_locale;
+ fold_array = PL_fold_locale;
+ goto do_ref;
+
+ case REFFU:
+ folder = foldEQ_latin1;
+ fold_array = PL_fold_latin1;
+ goto do_ref;
+
+ case REFF:
+ folder = foldEQ;
+ fold_array = PL_fold;
/* FALL THROUGH */
+
case REF:
- case REFF:
- n = ARG(scan); /* which paren pair */
+ do_ref:
type = OP(scan);
- do_ref:
+ n = ARG(scan); /* which paren pair */
+
+ do_nref_ref_common:
ln = PL_regoffs[n].start;
PL_reg_leftiter = PL_reg_maxiter; /* Void cache */
if (*PL_reglastparen < n || ln == -1)
@@ -3960,49 +3998,40 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
break;
s = PL_bostr + ln;
- if (utf8_target && type != REF) { /* REF can do byte comparison */
- char *l = locinput;
- const char *e = PL_bostr + PL_regoffs[n].end;
- /*
- * Note that we can't do the "other character" lookup trick as
- * in the 8-bit case (no pun intended) because in Unicode we
- * have to map both upper and title case to lower case.
- */
- if (type == REFF) {
- while (s < e) {
- STRLEN ulen1, ulen2;
- U8 tmpbuf1[UTF8_MAXBYTES_CASE+1];
- U8 tmpbuf2[UTF8_MAXBYTES_CASE+1];
-
- if (l >= PL_regeol)
- sayNO;
- toLOWER_utf8((U8*)s, tmpbuf1, &ulen1);
- toLOWER_utf8((U8*)l, tmpbuf2, &ulen2);
- if (ulen1 != ulen2 || memNE((char *)tmpbuf1, (char *)tmpbuf2, ulen1))
- sayNO;
- s += ulen1;
- l += ulen2;
- }
+ if (type != REF /* REF can do byte comparison */
+ && (utf8_target
+ || (type == REFFU
+ && (*s == (char) LATIN_SMALL_LETTER_SHARP_S
+ || *locinput == (char) LATIN_SMALL_LETTER_SHARP_S))))
+ { /* XXX handle REFFL better */
+ char * limit = PL_regeol;
+
+ /* This call case insensitively compares the entire buffer
+ * at s, with the current input starting at locinput, but
+ * not going off the end given by PL_regeol, and returns in
+ * limit upon success, how much of the current input was
+ * matched */
+ if (! foldEQ_utf8(s, NULL, PL_regoffs[n].end - ln, utf8_target,
+ locinput, &limit, 0, utf8_target))
+ {
+ sayNO;
}
- locinput = l;
+ locinput = limit;
nextchr = UCHARAT(locinput);
break;
}
- /* Inline the first character, for speed. */
+ /* Not utf8: Inline the first character, for speed. */
if (UCHARAT(s) != nextchr &&
(type == REF ||
- (UCHARAT(s) != (type == REFF
- ? PL_fold : PL_fold_locale)[nextchr])))
+ UCHARAT(s) != fold_array[nextchr]))
sayNO;
ln = PL_regoffs[n].end - ln;
if (locinput + ln > PL_regeol)
sayNO;
if (ln > 1 && (type == REF
? memNE(s, locinput, ln)
- : (type == REFF
- ? ! foldEQ(s, locinput, ln)
- : ! foldEQ_locale(s, locinput, ln))))
+ : ! folder(s, locinput, ln)))
sayNO;
locinput += ln;
nextchr = UCHARAT(locinput);
--
1.5.6.3
|
From @khwilliamson0011-regcomp.c-Generate-REFFU-and-NREFFU.patchFrom 603718e020407d784c920301500232e5bd8902bf Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:35:13 -0700
Subject: [PATCH] regcomp.c: Generate REFFU and NREFFU
This causes the new nodes that denote Unicode semantics in
backreferences to be generated when appropriate.
Because the addition of these nodes was at the end of the node list, the
arithmetic relation that previously was valid no longer is.
---
regcomp.c | 34 ++++++++++++++++++++++++++--------
1 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index 392b075..2df0a6e 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -5849,9 +5849,15 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
SvREFCNT_inc_simple_void(sv_dat);
}
RExC_sawback = 1;
- ret = reganode(pRExC_state,
- (U8)(FOLD ? (LOC ? NREFFL : NREFF) : NREF),
- num);
+ ret = reganode(pRExC_state,
+ ((! FOLD)
+ ? NREF
+ : (UNI_SEMANTICS)
+ ? NREFFU
+ : (LOC)
+ ? NREFFL
+ : NREFF),
+ num);
*flagp |= HASWIDTH;
Set_Node_Offset(ret, parse_start+1);
@@ -7531,8 +7537,14 @@ tryagain:
RExC_sawback = 1;
ret = reganode(pRExC_state,
- (U8)(FOLD ? (LOC ? NREFFL : NREFF) : NREF),
- num);
+ ((! FOLD)
+ ? NREF
+ : (UNI_SEMANTICS)
+ ? NREFFU
+ : (LOC)
+ ? NREFFL
+ : NREFF),
+ num);
*flagp |= HASWIDTH;
/* override incorrect value set in reganode MJD */
@@ -7593,8 +7605,14 @@ tryagain:
}
RExC_sawback = 1;
ret = reganode(pRExC_state,
- (U8)(FOLD ? (LOC ? REFFL : REFF) : REF),
- num);
+ ((! FOLD)
+ ? REF
+ : (UNI_SEMANTICS)
+ ? REFFU
+ : (LOC)
+ ? REFFL
+ : REFF),
+ num);
*flagp |= HASWIDTH;
/* override incorrect value set in reganode MJD */
@@ -9594,7 +9612,7 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o)
else if (k == REF || k == OPEN || k == CLOSE || k == GROUPP || OP(o)==ACCEPT) {
Perl_sv_catpvf(aTHX_ sv, "%d", (int)ARG(o)); /* Parenth number */
if ( RXp_PAREN_NAMES(prog) ) {
- if ( k != REF || OP(o) < NREF) {
+ if ( k != REF || (OP(o) != NREF && OP(o) != NREFF && OP(o) != NREFFL && OP(o) != NREFFU)) {
AV *list= MUTABLE_AV(progi->data->data[progi->name_list_idx]);
SV **name= av_fetch(list, ARG(o), 0 );
if (name)
--
1.5.6.3
|
From @khwilliamson0012-re-fold_grind.t-Add-tests-for-NREFFU-REFFU.patchFrom 82e2266183ac5b10b6dcfc4d165545629ccd227a Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Tue, 30 Nov 2010 22:58:37 -0700
Subject: [PATCH] re/fold_grind.t: Add tests for NREFFU, REFFU
This adds simple tests for these. Inspection of the code indicated to
me that more complex tests were not warranted.
---
t/re/fold_grind.t | 24 ++++++++++++++++++++++--
1 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/t/re/fold_grind.t b/t/re/fold_grind.t
index fd69cdb..55241e3 100644
--- a/t/re/fold_grind.t
+++ b/t/re/fold_grind.t
@@ -1,5 +1,4 @@
-# Grind out a lot of combinatoric tests for folding. Still missing are
-# testing backreferences and tries.
+# Grind out a lot of combinatoric tests for folding.
use charnames ":full";
@@ -259,7 +258,28 @@ foreach my $test (sort { numerically } keys %tests) {
my $lhs = join "", @x_target;
my @rhs = @x_pattern;
+ my $rhs = join "", @rhs;
my $should_fail = ! $uni_semantics && $ord >= 128 && $ord < 256 && ! $is_self;
+
+ # Do simple tests of referencing capture buffers, named and
+ # numbered.
+ my $op = '=~';
+ $op = '!~' if $should_fail;
+ my $eval = "my \$c = \"$lhs$rhs\"; my \$p = qr/(?$charset:^($rhs)\\1\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+ push @eval_tests, qq[ok(eval '$eval', '$eval')];
+ $eval = "my \$c = \"$lhs$rhs\"; my \$p = qr/(?$charset:^(?<grind>$rhs)\\k<grind>\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+ push @eval_tests, qq[ok(eval '$eval', '$eval')];
+ $count += 2;
+ if ($lhs ne $rhs) {
+ $eval = "my \$c = \"$rhs$lhs\"; my \$p = qr/(?$charset:^($rhs)\\1\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+ push @eval_tests, qq[ok(eval '$eval', '$eval')];
+ $eval = "my \$c = \"$rhs$lhs\"; my \$p = qr/(?$charset:^(?<grind>$rhs)\\k<grind>\$)/i;$upgrade_target$upgrade_pattern \$c $op \$p";
+ push @eval_tests, qq[ok(eval '$eval', '$eval')];
+ $count += 2;
+ }
+ #diag $eval_tests[-1];
+ #next;
+
foreach my $bracketed (0, 1) { # Put rhs in [...], or not
foreach my $inverted (0,1) {
next if $inverted && ! $bracketed;
--
1.5.6.3
|
From @khwilliamson0013-Nit-in-perlunicode.pod.patchFrom 442698edd07704c7fbcd83ba3c1a0d3fed06373f Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:15:18 -0700
Subject: [PATCH] Nit in perlunicode.pod
---
pod/perlunicode.pod | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index b950f7b..20acb55 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -23,7 +23,7 @@ Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
-the ":utf8" layer. Other encodings can be converted to Perl's
+the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
--
1.5.6.3
|
From @khwilliamson0014-Document-Unicode-doc-fix.patchFrom 371a6b022abefe8c1377d3d8811431654d1da46d Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:33:54 -0700
Subject: [PATCH] Document Unicode doc fix
---
lib/feature.pm | 21 ++++++++++++++----
pod/perldelta.pod | 33 +++++++++++++++++++++++------
pod/perlre.pod | 44 ++++++++++++++++++++++-----------------
pod/perlunicode.pod | 57 +++++++++++++++-----------------------------------
pod/perlunifaq.pod | 42 ++++++++++++++++++------------------
5 files changed, 105 insertions(+), 92 deletions(-)
diff --git a/lib/feature.pm b/lib/feature.pm
index f8a9078..c70010d 100644
--- a/lib/feature.pm
+++ b/lib/feature.pm
@@ -105,11 +105,22 @@ See L<perlsub/"Persistent Private Variables"> for details.
=head2 the 'unicode_strings' feature
-C<use feature 'unicode_strings'> tells the compiler to treat
-all strings outside of C<use locale> and C<use bytes> as Unicode. It is
-available starting with Perl 5.11.3, but is not fully implemented.
-
-See L<perlunicode/The "Unicode Bug"> for details.
+C<use feature 'unicode_strings'> tells the compiler to use Unicode semantics
+in all string operations executed within its scope (unless they are also
+within the scope of either C<use locale> or C<use bytes>). The same applies
+to all regular expressions compiled within the scope, even if executed outside
+it.
+
+C<no feature 'unicode_strings'> tells the compiler to use the traditional
+Perl semantics wherein the native character set semantics is used unless it is
+clear to Perl that Unicode is desired. This can lead to some surprises
+when the behavior suddenly changes. (See
+L<perlunicode/The "Unicode Bug"> for details.) For this reason, if you are
+potentially using Unicode in your program, the
+C<use feature 'unicode_strings'> subpragma is B<strongly> recommended.
+
+This subpragma is available starting with Perl 5.11.3, but was not fully
+implemented until 5.13.8.
=head1 FEATURE BUNDLES
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index cfeff1f..b7d710b 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -2,7 +2,6 @@
=for comment
This has been completed up to 779bcb7d, except for:
-1b9f127-fad448f (Karl Williamson says he will do this)
ad9e76a8629ed1ac483f0a7ed0e4da40ac5a1a00
d9a4b459f94297889956ac3adc42707365f274c2
@@ -81,6 +80,18 @@ method support still works as expected:
open my $fh, ">", $file;
$fh->autoflush(1); # IO::File not loaded
+=head2 Full functionality for C<use feature 'unicode_strings'>
+
+This release provides full functionality for C<use feature
+'unicode_strings'>. Under its scope, all string operations executed and
+regular expressions compiled (even if executed outside its scope) have
+Unicode semantics. See L<feature>.
+
+This feature avoids the "Unicode Bug" (See
+L<perlunicode/The "Unicode Bug"> for details.) If their is a
+possibility that your code will process Unicode strings, you are
+B<strongly> encouraged to use this subpragma to avoid nasty surprises.
+
=head1 Security
XXX Any security-related notices go here. In particular, any security
@@ -492,12 +503,6 @@ L<[perl #79178]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79178>.
=item *
-A number of bugs with regular expression bracketed character classes
-have been fixed, mostly having to do with matching characters in the
-non-ASCII Latin-1 range.
-
-=item *
-
A closure containing an C<if> statement followed by a constant or variable
is no longer treated as a constant
L<[perl #63540]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=63540>.
@@ -514,6 +519,20 @@ A regular expression optimisation would sometimes cause a match with a
C<{n,m}> quantifier to fail when it should match
L<[perl #79152]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79152>.
+=item *
+
+What has become known as the "Unicode Bug" is resolved in this release.
+Under C<use feature 'unicode_strings'>, the internal storage format of a
+string no longer affects the external semantics. There are two known
+exceptions. User-defined case changing functions, which are planned to
+be deprecated in 5.14, require utf8-encoded strings to function; and the
+character C<LATIN SMALL LETTER SHARP S> in regular expression
+case-insensitive matching has a somewhat different set of bugs depending
+on the internal storage format. Case-insensitive matching of all
+characters that have multi-character matches, as this one does, is
+problematical in Perl.
+L<[perl #58182]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182>.
+
=back
=head1 Known Problems
diff --git a/pod/perlre.pod b/pod/perlre.pod
index acc1ad5..f415a16 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -646,31 +646,37 @@ locale, and can differ from one match to another if there is an
intervening call of the
L<setlocale() function|perllocale/The setlocale function>.
This modifier is automatically set if the regular expression is compiled
-within the scope of a C<"use locale"> pragma.
+within the scope of a C<"use locale"> pragma. Results are not
+well-defined when using this and matching against a utf8-encoded string.
C<"u"> means to use Unicode semantics when pattern matching. It is
-automatically set if the regular expression is compiled within the scope
-of a L<C<"use feature 'unicode_strings">|feature> pragma (and isn't
-also in the scope of L<C<"use locale">|locale> nor
-L<C<"use bytes">|bytes> pragmas. It is not fully implemented at the
-time of this writing, but work is being done to complete the job. On
-EBCDIC platforms this currently has no effect, but on ASCII platforms,
-it effectively turns them into Latin-1 platforms. That is, the ASCII
-characters remain as ASCII characters (since ASCII is a subset of
-Latin-1), but the non-ASCII code points are treated as Latin-1
-characters. Right now, this only applies to the C<"\b">, C<"\s">, and
-C<"\w"> pattern matching operators, plus their complements. For
-example, when this option is not on, C<"\w"> matches precisely
-C<[A-Za-z0-9_]> (on a non-utf8 string). When the option is on, it
-matches not just those, but all the Latin-1 word characters (such as an
-"n" with a tilde). It thus matches exactly the same set of code points
-from 0 to 255 as it would if the string were encoded in utf8.
+automatically set if the regular expression is encoded in utf8, or is
+compiled within the scope of a
+L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in
+the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes>
+pragmas. On ASCII platforms, the code points between 128 and 255 take on their
+Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
+in strict ASCII their meanings are undefined. Thus the platform
+effectively becomes a Unicode platform. The ASCII characters remain as
+ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For
+example, when this option is not on, on a non-utf8 string, C<"\w">
+matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches
+not just those, but all the Latin-1 word characters (such as an "n" with
+a tilde). On EBCDIC platforms, which already are equivalent to Latin-1,
+this modifier changes behavior only when the C<"/i"> modifier is also
+specified, and affects only two characters, giving them full Unicode
+semantics: the C<MICRO SIGN> will match the Greek capital and
+small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP
+S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not.
+(This last case is buggy, however.)
C<"d"> means to use the traditional Perl pattern matching behavior.
This is dualistic (hence the name C<"d">, which also could stand for
-"default"). When this is in effect, Perl matches utf8-encoded strings
+"depends"). When this is in effect, Perl matches utf8-encoded strings
using Unicode rules, and matches non-utf8-encoded strings using the
-platform's native character set rules.
+platform's native character set rules. (If the regular expression
+itself is encoded in utf8, Unicode rules are used regardless of the
+target string's encoding.)
See L<perlunicode/The "Unicode Bug">. It is automatically selected by
default if the regular expression is compiled neither within the scope
of a C<"use locale"> pragma nor a <C<"use feature 'unicode_strings">
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 20acb55..925ae36 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1450,7 +1450,8 @@ The term, the "Unicode bug" has been applied to an inconsistency with the
Unicode characters whose ordinals are in the Latin-1 Supplement block, that
is, between 128 and 255. Without a locale specified, unlike all other
characters or code points, these characters have very different semantics in
-byte semantics versus character semantics.
+byte semantics versus character semantics, unless
+C<use feature 'unicode_strings'> is specified.
In character semantics they are interpreted as Unicode code points, which means
they have the same semantics as Latin-1 (ISO-8859-1).
@@ -1514,45 +1515,21 @@ ASCII range (except in a locale), along with Perl's desire to add Unicode
support seamlessly. The result wasn't seamless: these characters were
orphaned.
-Work is being done to correct this, but only some of it is complete.
-What has been finished is:
-
-=over
-
-=item *
-
-the matching of C<\b>, C<\s>, C<\w> and the Posix
-character classes and their complements in regular expressions
-
-=item *
-
-case changing (but not user-defined casing)
-
-=item *
-
-case-insensitive (C</i>) regular expression matching for [bracketed
-character classes] only, except for some bugs with C<LATIN SMALL
-LETTER SHARP S> (which is supposed to match the two character sequence
-"ss" (or "Ss" or "sS" or "SS"), but Perl has a number of bugs for all
-such multi-character case insensitive characters, of which this is just
-one example.
-
-=back
-
-Due to concerns, and some evidence, that older code might
-have come to rely on the existing behavior, the new behavior must be explicitly
-enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
-no new syntax is involved.
-
-See L<perlfunc/lc> for details on how this pragma works in combination with
-various others for casing.
-
-Even though the implementation is incomplete, it is planned to have this
-pragma affect all the problematic behaviors in later releases: you can't
-have one without them all.
-
-In the meantime, a workaround is to always call utf8::upgrade($string), or to
-use the standard module L<Encode>. Also, a scalar that has any characters
+Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to
+cause Perl to use Unicode semantics on all string operations within the
+scope of the feature subpragma. Regular expressions compiled in its
+scope retain that behavior even when executed or compiled into larger
+regular expressions outside the scope. (The pragma does not, however,
+affect user-defined case changing operations. These still require a
+UTF-8 encoded string to operate.)
+
+In Perl 5.12, the subpragma affected casing changes, but not regular
+expressions. See L<perlfunc/lc> for details on how this pragma works in
+combination with various others for casing.
+
+For earlier Perls, or when a string is passed to a function outside the
+subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>,
+or to use the standard module L<Encode>. Also, a scalar that has any characters
whose ordinal is above 0x100, or which were specified using either of the
C<\N{...}> notations will automatically have character semantics.
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod
index 877e4d1..9fd2b38 100644
--- a/pod/perlunifaq.pod
+++ b/pod/perlunifaq.pod
@@ -138,27 +138,27 @@ concern, and you can just C<eval> dumped data as always.
=head2 Why do some characters not uppercase or lowercase correctly?
-It seemed like a good idea at the time, to keep the semantics the same for
-standard strings, when Perl got Unicode support. The plan is to fix this
-in the future, and the casing component has in fact mostly been fixed, but we
-have to deal with the fact that Perl treats equal strings differently,
-depending on the internal state.
-
-First the casing. Just put a C<use feature 'unicode_strings'> near the
-beginning of your program. Within its lexical scope, C<uc>, C<lc>, C<ucfirst>,
-C<lcfirst>, and the regular expression escapes C<\U>, C<\L>, C<\u>, C<\l> use
-Unicode semantics for changing case regardless of whether the UTF8 flag is on
-or not. However, if you pass strings to subroutines in modules outside the
-pragma's scope, they currently likely won't behave this way, and you have to
-try one of the solutions below. There is another exception as well: if you
-have furnished your own casing functions to override the default, these will
-not be called unless the UTF8 flag is on)
-
-This remains a problem for the regular expression constructs
-C</.../i>, C<(?i:...)>, and C</[[:posix:]]/>.
-
-To force Unicode semantics, you can upgrade the internal representation to
-by doing C<utf8::upgrade($string)>. This can be used
+Starting in Perl 5.14 (and partially in Perl 5.12), just put a
+C<use feature 'unicode_strings'> near the beginning of your program.
+Within its lexical scope you shouldn't have this problem. It also is
+automatically enabled under C<use feature ':5.12'> or using C<-E> on the
+command line for Perl 5.12 or higher.
+
+The rationale for requiring this is to not break older programs that
+rely on the way things worked before Unicode came along. Those older
+programs knew only about the ASCII character set, and so may not work
+properly for additional characters. When a string is encoded in UTF-8,
+Perl assumes that the program is prepared to deal with Unicode, but when
+the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC
+platform) is wanted, and so those characters that are not ASCII
+characters aren't recognized as to what they would be in Unicode.
+C<use feature 'unicode_strings'> tells Perl to treat all characters as
+Unicode, whether the string is encoded in UTF-8 or not, thus avoiding
+the problem.
+
+However, on earlier Perls, or if you pass strings to subroutines outside
+the feature's scope, you can force Unicode semantics by changing the
+encoding to UTF-8 by doing C<utf8::upgrade($string)>. This can be used
safely on any string, as it checks and does not change strings that have
already been upgraded.
--
1.5.6.3
|
From @khwilliamson0015-Nit-in-perlre.pod.patchFrom 6536d050580ef103778c3163f0fcf213580f1445 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:34:25 -0700
Subject: [PATCH] Nit in perlre.pod
---
pod/perlre.pod | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/pod/perlre.pod b/pod/perlre.pod
index f415a16..b74618f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -686,7 +686,7 @@ Note that the C<d>, C<l>, C<p>, and C<u> modifiers are special in that
they can only be enabled, not disabled, and the C<d>, C<l>, and C<u>
modifiers are mutually exclusive: specifying one de-specifies the
others, and a maximum of one may appear in the construct. Thus, for
-example, C<(?-p)>, C<(?-d:...)>, and C<(?-dl:...)> will warn when
+example, C<(?-p)>, C<(?-d:...)>, and C<(?dl:...)> will warn when
compiled under C<use warnings>.
Note also that the C<p> modifier is special in that its presence
--
1.5.6.3
|
From @khwilliamson0016-Nit-in-perlunicode.pod.patchFrom aed0c30ba7ea67ac1704251c054a48138084596c Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:34:58 -0700
Subject: [PATCH] Nit in perlunicode.pod
---
pod/perlunicode.pod | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 925ae36..242238f 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -101,9 +101,9 @@ or from literals and constants in the source text.
The C<bytes> pragma will always, regardless of platform, force byte
semantics in a particular lexical scope. See L<bytes>.
-The C<use feature 'unicode_strings'> pragma is intended to always, regardless
-of platform, force character (Unicode) semantics in a particular lexical scope.
-In release 5.12, it is partially implemented, applying only to case changes.
+The C<use feature 'unicode_strings'> pragma is intended always,
+regardless of platform, to force character (Unicode) semantics in a
+particular lexical scope.
See L</The "Unicode Bug"> below.
The C<utf8> pragma is primarily a compatibility device that enables
--
1.5.6.3
|
From @khwilliamson0017-Nit-in-perluniintro.pod.patchFrom 3fcce5accbef27d94c7a970a42ab4b580440bf33 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Wed, 1 Dec 2010 16:36:44 -0700
Subject: [PATCH] Nit in perluniintro.pod
---
pod/perluniintro.pod | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/pod/perluniintro.pod b/pod/perluniintro.pod
index f0b2be5..6a8c07d 100644
--- a/pod/perluniintro.pod
+++ b/pod/perluniintro.pod
@@ -83,7 +83,7 @@ Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character". The same character could
be represented differently in several legacy encodings. The
-converse is also not true: some code points do not have an assigned
+converse is not also true: some code points do not have an assigned
character. Firstly, there are unallocated code points within
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
--
1.5.6.3
|
From tchrist@perl.comThank you, Karl. Thank you very much. --tom |
From @xdgOn Wed, Dec 1, 2010 at 7:28 PM, Tom Christiansen <tchrist@perl.com> wrote:
Likewise, thank you, Karl for all your work on these (and other) -- David |
From @rjbs* karl williamson <public@khwilliamson.com> [2010-12-01T19:14:58]
I am full of glee! Thanks, Karl! Your work has been amazing, educational, and inspirational. -- |
From @cpansproutOn Wed Dec 01 16:16:32 2010, public@khwilliamson.com wrote:
Thank you. Applied as 164739 to 35146e3. |
@cpansprout - Status changed from 'open' to 'resolved' |
From @khwilliamsonJuerd Waalboer wrote:
Perhaps you could find some time in the next 3 months to look at |
From @demerphqOn 23 September 2008 18:03, Dave Mitchell <davem@iabyn.com> wrote:
++ yves |
From @demerphq2008/9/26 Rafael Garcia-Suarez <rgarciasuarez@gmail.com>:
Could you expand on this? When I try to reason it through I see so -- |
From @rgs2010/12/2 demerphq <demerphq@gmail.com>:
In short, locale is not a property of a string, but of the code that |
From @rjbs* demerphq <demerphq@gmail.com> [2010-12-02T09:34:18]
Yes, ++ indeed. We've been looking at storing something like this with ad hoc magic, but magic If one could look at a scalar and know: 1. it's text ...it would be *massively* *incredibly* useful at fixing *many* bugs in dealing Consider some sort of significant, potentially beer-related award offered to -- |
From juerd@tnx.nlkarl williamson skribis 2010-12-01 17:14 (-0700):
Your work on this set of issues has been wonderful from the beginning. Thank you so much!
Two minor areas is almost infinitely better than a dozen major ones!
:) Juerd Waalboer <juerd@tnx.nl> |
From @khwilliamsonRicardo Signes wrote:
Shouldn't this be added to perltodo if it really should get done? |
Migrated from rt.perl.org#58182 (status was 'resolved')
Searchable as RT58182$
The text was updated successfully, but these errors were encountered: