-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8 fatal warning #9659
Comments
From zefram@fysh.orgCreated by zefram@fysh.org$ perl -le 'print do { no warnings "utf8"; "\x{d800}" } =~ /\A[\x{123}]/ ? "yes" : "no"' Turning warnings on makes the regexp operation die, whereas with warnings The form of the regexp affects behaviour. If the regexp is /\A\x{123}/ The characters in the string that can be complained about this way The character is in fact encoded correctly in Perl-internal UTF-8. Perl Info
|
From @khwilliamsonOn Tue Feb 24 13:27:21 2009, zefram@fysh.org wrote:
I'm not sure what to do about this ticket. The basics of it anyway are Surrogates, on the other hand, should never appear in well-formed utf8, |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonZefram, Is it ok if I close this ticket? The Unicode standard says, "Because The message for FFFF has been changed to be correct. --Karl Williamson |
From zefram@fysh.orgKarl Williamson via RT wrote:
No. It's not OK for a warning to be fatal. The situation should either -zefram |
From @cpansproutOn Sun Mar 21 19:08:59 2010, khw wrote:
It may be working as designed, but it was not designed very well.
The regular expression engine is not a security layer. It should not Furthermore, Perl’s strings are not just Unicode. Unicode strings are Regular expressions are for looking at strings. So it should not warn or perl already warns for "\x{d800}" and chr 0xd800. So if such a string is I use Perl’s strings for storing 16-bit binary data. The result is that I propose we stop the regular expression engine from rejecting or There are three patches attached that fix a few cases. There will be |
From @cpansproutInline Patchdiff -up blead-63446.base/MANIFEST blead-63446-utf8-warnings/MANIFEST
--- blead-63446.base/MANIFEST 2010-11-26 08:06:10.000000000 -0800
+++ blead-63446-utf8-warnings/MANIFEST 2010-11-27 21:36:33.000000000 -0800
@@ -4804,6 +4804,7 @@ t/porting/podcheck.t Test the POD of sh
t/porting/regen.t Check that regen.pl doesn't need running
t/porting/test_bootstrap.t Test that the instructions for test bootstrapping aren't accidentally overlooked.
t/README Instructions for regression tests
+t/re/beyond_unicode.t See if regexps work with all characters
t/re/fold_grind.t See if case folding works properly
t/re/overload.t Test against string corruption in pattern matches on overloaded objects
t/re/pat_advanced.t See if advanced esoteric patterns work
diff -up blead-63446.base/regcomp.c blead-63446-utf8-warnings/regcomp.c
--- blead-63446.base/regcomp.c 2010-11-24 09:59:12.000000000 -0800
+++ blead-63446-utf8-warnings/regcomp.c 2010-11-28 05:37:38.000000000 -0800
@@ -3038,7 +3038,8 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_
if (UTF) {
const U8 * const s = (U8*)STRING(scan);
l = utf8_length(s, s + l);
- uc = utf8_to_uvchr(s, NULL);
+ uc =
+ utf8n_to_uvchr(s, UTF8_MAXBYTES, NULL, UTF8_ALLOW_ANYUV);
} else {
uc = *((U8*)STRING(scan));
}
@@ -7779,7 +7780,7 @@ tryagain:
if (UTF8_IS_START(*p) && UTF) {
STRLEN numlen;
ender = utf8n_to_uvchr((U8*)p, RExC_end - p,
- &numlen, UTF8_ALLOW_DEFAULT);
+ &numlen, UTF8_ALLOW_ANYUV);
p += numlen;
}
else
@@ -9078,7 +9079,10 @@ S_reguni(pTHX_ const RExC_state_t *pRExC
PERL_ARGS_ASSERT_REGUNI;
- return SIZE_ONLY ? UNISKIP(uv) : (uvchr_to_utf8((U8*)s, uv) - (U8*)s);
+ return
+ SIZE_ONLY
+ ? UNISKIP(uv)
+ : (uvuni_to_utf8_flags((U8*)s, uv, UNICODE_ALLOW_ANY) - (U8*)s);
}
/*
diff -Nurp blead-63446.base/t/re/beyond_unicode.t blead-63446-utf8-warnings/t/re/beyond_unicode.t
--- blead-63446.base/t/re/beyond_unicode.t 1969-12-31 16:00:00.000000000 -0800
+++ blead-63446-utf8-warnings/t/re/beyond_unicode.t 2010-11-28 05:49:47.000000000 -0800
@@ -0,0 +1,30 @@
+#!./perl -w
+
+# This script tests that the regular expression engine can handle all Perl
+# characters, including those that are not Unicode. Unicode characters are
+# merely a subset of Perl characters.
+
+BEGIN {
+ chdir 't' if -d 't';
+ @INC = '../lib';
+ require './test.pl';
+}
+
+plan 1;
+
+my @bad;
+
+sub report_bad {
+ if(@bad) {
+ diag "Bad ranges: ", join " ", map sprintf("%x00..%x00",$_,$_+1), @bad;
+ }
+}
+
+@bad = ();
+for(0..0x1200) {
+ next if rand > .25;
+ my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+ push @bad, $_ if $c !~ quotemeta $c;
+}
+ok !@bad, 'quotemeta $foo matches $foo for every character';
+report_bad; |
From @cpansproutInline Patchdiff -up blead-63446-utf8-warnings2/regcomp.c blead-63446-utf8-warnings3/regcomp.c
--- blead-63446-utf8-warnings2/regcomp.c 2010-11-28 06:27:23.000000000 -0800
+++ blead-63446-utf8-warnings3/regcomp.c 2010-11-28 11:03:40.000000000 -0800
@@ -8313,7 +8313,7 @@ parseit:
if (UTF) {
value = utf8n_to_uvchr((U8*)RExC_parse,
RExC_end - RExC_parse,
- &numlen, UTF8_ALLOW_DEFAULT);
+ &numlen, UTF8_ALLOW_ANYUV);
RExC_parse += numlen;
}
else
diff -up blead-63446-utf8-warnings2/regexec.c blead-63446-utf8-warnings3/regexec.c
--- blead-63446-utf8-warnings2/regexec.c 2010-11-28 06:32:01.000000000 -0800
+++ blead-63446-utf8-warnings3/regexec.c 2010-11-28 11:08:39.000000000 -0800
@@ -6217,10 +6217,8 @@ S_reginclass(pTHX_ const regexp * const
/* If c is not already the code point, get it */
if (utf8_target && !UTF8_IS_INVARIANT(c)) {
c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &c_len,
- (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV)
- | UTF8_ALLOW_FFFF | UTF8_CHECK_ONLY);
- /* see [perl #37836] for UTF8_ALLOW_ANYUV; [perl #38293] for
- * UTF8_ALLOW_FFFF */
+ UTF8_ALLOW_ANYUV | UTF8_CHECK_ONLY);
+ /* see [perl #37836], [perl #38293] and [perl #63446] */
if (c_len == (STRLEN)-1)
Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)");
}
diff -up blead-63446-utf8-warnings2/utf8.c blead-63446-utf8-warnings3/utf8.c
--- blead-63446-utf8-warnings2/utf8.c 2010-11-28 06:26:01.000000000 -0800
+++ blead-63446-utf8-warnings3/utf8.c 2010-11-28 12:40:01.000000000 -0800
@@ -2046,8 +2046,7 @@ Perl_swash_fetch(pTHX_ SV *swash, const
Unicode tables, not a native character number.
*/
const UV code_point = utf8n_to_uvuni(ptr, UTF8_MAXBYTES, 0,
- ckWARN(WARN_UTF8) ?
- 0 : UTF8_ALLOW_ANY);
+ UTF8_ALLOW_ANYUV);
swatch = swash_get(swash,
/* On EBCDIC & ~(0xA0-1) isn't a useful thing to do */
(klen) ? (code_point & ~(needents - 1)) : 0, |
From @cpansproutInline Patchdiff -up blead-63446-utf8-warnings/regcomp.c blead-63446-utf8-warnings2/regcomp.c
--- blead-63446-utf8-warnings/regcomp.c 2010-11-28 05:37:38.000000000 -0800
+++ blead-63446-utf8-warnings2/regcomp.c 2010-11-28 06:27:23.000000000 -0800
@@ -1348,7 +1348,7 @@ S_make_trie(pTHX_ RExC_state_t *pRExC_st
HV *widecharmap = NULL;
AV *revcharmap = newAV();
regnode *cur;
- const U32 uniflags = UTF8_ALLOW_DEFAULT;
+ const U32 uniflags = UTF8_ALLOW_ANYUV;
STRLEN len = 0;
UV uvc = 0;
U16 curword = 0;
diff -up blead-63446-utf8-warnings/regexec.c blead-63446-utf8-warnings2/regexec.c
--- blead-63446-utf8-warnings/regexec.c 2010-11-24 05:45:11.000000000 -0800
+++ blead-63446-utf8-warnings2/regexec.c 2010-11-28 06:32:01.000000000 -0800
@@ -1752,7 +1752,7 @@ S_find_byclass(pTHX_ regexp * prog, cons
*/
while (s <= last_start) {
- const U32 uniflags = UTF8_ALLOW_DEFAULT;
+ const U32 uniflags = UTF8_ALLOW_ANYUV;
U8 *uc = (U8*)s;
U16 charid = 0;
U32 base = 1;
@@ -2948,7 +2948,7 @@ S_regmatch(pTHX_ regmatch_info *reginfo,
#endif
dVAR;
register const bool utf8_target = PL_reg_match_utf8;
- const U32 uniflags = UTF8_ALLOW_DEFAULT;
+ const U32 uniflags = UTF8_ALLOW_ANYUV;
REGEXP *rex_sv = reginfo->prog;
regexp *rex = (struct regexp *)SvANY(rex_sv);
RXi_GET_DECL(rex,rexi);
diff -up blead-63446-utf8-warnings/t/re/beyond_unicode.t blead-63446-utf8-warnings2/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings/t/re/beyond_unicode.t 2010-11-28 06:08:59.000000000 -0800
+++ blead-63446-utf8-warnings2/t/re/beyond_unicode.t 2010-11-28 06:09:42.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
require './test.pl';
}
-plan 1;
+plan 2;
my @bad;
@@ -26,5 +26,13 @@ for(0..0x1200) {
my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
push @bad, $_ if $c !~ quotemeta $c;
}
-ok !@bad, 'quotemeta $foo matches $foo for every character';
+ok !@bad, '$foo =~ quotemeta $foo for every character';
+report_bad;
+
+for(0..0x1200) {
+ next if rand > .25;
+ my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+ push @bad, $_ if $c !~ /\Q$c\E|a/;
+}
+ok !@bad, '$foo =~ /\Q$foo\E|a/ for every character';
report_bad; |
From tchrist@perl.comIn-Reply-To: Message from "Father Chrysostomos via RT" <perlbug-followup@perl.org>
Quite.
It’s true.
That seems unfortunate.
What you’ve written seems perfectly reasonable — even desirable. However, I am rather concerned that this could lead to anomalous behavior. Java’s pattern matching acts completely nutty when presented with the kind To demonstrate, I’ll use U+1F47E, ALIEN MONSTER. That’s a 0xD83D and a * Correct surrogate order: true "\uD83D\uDC7E" =~ /./ * Half a surrogate pair: TRUE "\uD83D" =~ /./ * The other half of a surrogate pair: TRUE "\uDC7E" =~ /./ * Surrogates in backwards order: FALSE "\uDC7E\uD83D" =~ /./ See what I mean? Isn’t that loony? I’m not sure what you would --tom |
From @cpansproutOn Sun Nov 28 14:42:02 2010, tom christiansen wrote:
None of that will happen in perl, because 0xDC7E and U+1F47E are $ perl -le' print "yes" if "\x{1F47E}" =~ /\p{Cs}/' $ perl -le' print "yes" if "\x{1F47E}" =~ /^.\z/'
It will be treated the same way as \x{110000}-\x{ffffffff), except that I’m just making the utf8-warning implementation the same as the BTW, here are two more patches. |
From @cpansproutFrom: Father Chrysostomos <sprout@cpan.org> [perl #63446] "x" =~ /\x/ for all characters This makes "x" =~ /\x/ work for all characters that are not ASCII Inline Patchdiff -up blead-63446-utf8-warnings4/regcomp.c blead-63446-utf8-warnings5/regcomp.c
--- blead-63446-utf8-warnings4/regcomp.c 2010-11-28 11:03:40.000000000 -0800
+++ blead-63446-utf8-warnings5/regcomp.c 2010-11-28 14:24:16.000000000 -0800
@@ -8326,7 +8326,7 @@ parseit:
if (UTF) {
value = utf8n_to_uvchr((U8*)RExC_parse,
RExC_end - RExC_parse,
- &numlen, UTF8_ALLOW_DEFAULT);
+ &numlen, UTF8_ALLOW_ANYUV);
RExC_parse += numlen;
}
else
diff -up blead-63446-utf8-warnings4/t/re/beyond_unicode.t blead-63446-utf8-warnings5/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings4/t/re/beyond_unicode.t 2010-11-28 14:06:55.000000000 -0800
+++ blead-63446-utf8-warnings5/t/re/beyond_unicode.t 2010-11-28 14:44:59.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
require './test.pl';
}
-plan 3;
+plan 4;
my @bad;
@@ -18,7 +18,7 @@ sub test_against_many_chars(&$) {
my($test, $name) = @::_;
@bad = ();
for(0..0x1200) {
- next if rand > .25;
+ next if rand > .125;
&$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
}
ok !@bad, $name;
@@ -42,3 +42,10 @@ test_against_many_chars {
my $c = join "", @{$_[0]};
push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
} '$foo =~ /[$foo]/ for every character';
+
+test_against_many_chars {
+ # Skip this for the ASCII range, as "a" =~ /\a/ obviously does not match.
+ return if !$_;
+ my $c = join "", @{$_[0]};
+ push @bad, $_ if $c !~ ("^[" . ($c =~ s/(.)/\\$1/gross) . "]+\\z");
+} '"x" =~ /[\x]/ for every character'; |
From @cpansproutOn Sun Nov 28 14:54:51 2010, sprout wrote:
RT did not like those files. Let’s try this again: |
From @cpansproutFrom: Father Chrysostomos <sprout@cpan.org> Make t/re/beyond_unicode.t less repetititive Inline Patchdiff -up blead-63446-utf8-warnings3/t/re/beyond_unicode.t blead-63446-utf8-warnings4/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings3/t/re/beyond_unicode.t 2010-11-28 12:47:07.000000000 -0800
+++ blead-63446-utf8-warnings4/t/re/beyond_unicode.t 2010-11-28 14:06:55.000000000 -0800
@@ -14,33 +14,31 @@ plan 3;
my @bad;
-sub report_bad {
+sub test_against_many_chars(&$) {
+ my($test, $name) = @::_;
+ @bad = ();
+ for(0..0x1200) {
+ next if rand > .25;
+ &$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
+ }
+ ok !@bad, $name;
+
if(@bad) {
diag "Bad ranges: ", join " ", map sprintf("%x00..%x00",$_,$_+1), @bad;
}
}
-@bad = ();
-for(0..0x1200) {
- next if rand > .25;
- my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+ my $c = join "", @{$_[0]};
push @bad, $_ if $c !~ quotemeta $c;
-}
-ok !@bad, '$foo =~ quotemeta $foo for every character';
-report_bad;
+} '$foo =~ quotemeta $foo for every character';
-for(0..0x1200) {
- next if rand > .25;
- my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+ my $c = join "", @{$_[0]};
push @bad, $_ if $c !~ /\Q$c\E|a/;
-}
-ok !@bad, '$foo =~ /$foo|a/ for every character';
-report_bad;
+} '$foo =~ /$foo|a/ for every character';
-for(0..0x1200) {
- next if rand > .25;
- my $c = do { no warnings 'utf8'; join "", map chr, $_<<8 .. $_+1<<8 };
+test_against_many_chars {
+ my $c = join "", @{$_[0]};
push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
-}
-ok !@bad, '$foo =~ /[$foo]/ for every character';
-report_bad;
+} '$foo =~ /[$foo]/ for every character'; |
From @cpansproutFrom: Father Chrysostomos <sprout@cpan.org> [perl #63446] "x" =~ /\x/ for all characters This makes "x" =~ /\x/ work for all characters that are not ASCII Inline Patchdiff -up blead-63446-utf8-warnings4/regcomp.c blead-63446-utf8-warnings5/regcomp.c
--- blead-63446-utf8-warnings4/regcomp.c 2010-11-28 11:03:40.000000000 -0800
+++ blead-63446-utf8-warnings5/regcomp.c 2010-11-28 14:24:16.000000000 -0800
@@ -8326,7 +8326,7 @@ parseit:
if (UTF) {
value = utf8n_to_uvchr((U8*)RExC_parse,
RExC_end - RExC_parse,
- &numlen, UTF8_ALLOW_DEFAULT);
+ &numlen, UTF8_ALLOW_ANYUV);
RExC_parse += numlen;
}
else
diff -up blead-63446-utf8-warnings4/t/re/beyond_unicode.t blead-63446-utf8-warnings5/t/re/beyond_unicode.t
--- blead-63446-utf8-warnings4/t/re/beyond_unicode.t 2010-11-28 14:06:55.000000000 -0800
+++ blead-63446-utf8-warnings5/t/re/beyond_unicode.t 2010-11-28 14:44:59.000000000 -0800
@@ -10,7 +10,7 @@ BEGIN {
require './test.pl';
}
-plan 3;
+plan 4;
my @bad;
@@ -18,7 +18,7 @@ sub test_against_many_chars(&$) {
my($test, $name) = @::_;
@bad = ();
for(0..0x1200) {
- next if rand > .25;
+ next if rand > .125;
&$test([ do { no warnings 'utf8'; map chr, $_<<8 .. $_+1<<8 } ]);
}
ok !@bad, $name;
@@ -42,3 +42,10 @@ test_against_many_chars {
my $c = join "", @{$_[0]};
push @bad, $_ if $c !~ "^[\Q$c\E]+\\z";
} '$foo =~ /[$foo]/ for every character';
+
+test_against_many_chars {
+ # Skip this for the ASCII range, as "a" =~ /\a/ obviously does not match.
+ return if !$_;
+ my $c = join "", @{$_[0]};
+ push @bad, $_ if $c !~ ("^[" . ($c =~ s/(.)/\\$1/gross) . "]+\\z");
+} '"x" =~ /[\x]/ for every character'; |
From @khwilliamsonFather Chrysostomos via RT wrote:
I have some uneasiness about this. It needs ample vetting here. First, to make sure you know, I am planning to shortly change things so I had thought of doing that with surrogates as well, but this met with It seems to me that the best solution would be a way to declare a binary |
From @cpansproutOn Sun Nov 28 18:32:05 2010, public@khwilliamson.com wrote:
If warnings are on, right?
I’d better stop, then. :-)
Can you give me a reference?
/i and \x{d800} are orthogonal, so neither one should stop the other Whether I/O, chr and "\x{...}" warn or not, as long as I can turn off But I reiterate that regular expressions should never warn or die for |
From @demerphqOn 28 November 2010 22:16, Father Chrysostomos via RT
I agree, except that I would include /i matches. Using /i on a unicode flagged string implies you want (our brand of) In order to make that work effectively we need to be able to depend on So i think its just fine if the case-folding logic warns about something. But I agree that the regex engine should not block case sensitive matches. Yves -- |
From @khwilliamsonFather Chrysostomos via RT wrote:
Yes, I keep forgetting to say that.
At least until we see how this resolves, anyway.
The only relatively recent one I can find is a really mild comment from But I ran across this very similar discussion from two years ago: I'm willing to make surrogates internally allowed by default, like
This brings up another question that occurred to me. Didn't you say you
I think most of us agree. If Perl stored its strings internally in U32 |
From @cpansproutOn Mon Nov 29 11:53:45 2010, public@khwilliamson.com wrote:
By 16-bit binary data, I mean sequences of unsigned 16-bit integers. It |
From @cpansproutOn Mon Nov 29 02:06:16 2010, demerphq wrote:
That makes case-tolerance conceptually more complex than it needs to be. |
From @demerphqOn 12 December 2010 22:04, Father Chrysostomos via RT
What I was saying was that if we are doing a case insensitive match In other words, we should NOT warn if someone wants to match a string Also, just as a note, in the early days the character class notation cheers, -- |
From @chipdudeOn 12/12/2010 1:02 PM, Father Chrysostomos via RT wrote:
I hate to have to disagree, but: "UTF8" means "UCS Translation Format - So complaining when Perl takes seriously the "u" in "utf8" seems |
From @ikegamiOn Mon, Dec 13, 2010 at 7:14 PM, Reverend Chip <rev.chip@gmail.com> wrote:
The name of some internal flag is of very little importance. Perl currently supports strings of arbitrary 32-bit numbers in 32-bit $ perl -E'say 0xFFFFFFFFFFFFFFFF' $ perl -E'say ord chr 0xFFFFFFFFFFFFFFFF' $ perl -MEncode -E'$x=chr 0xFFFFFFFFFFFFFFFF; Encode::_utf8_off($x); say Despite being named "UTF8", the flag clearly does not imply adherence to (Obviously, uc() and the regex engine will assign meaning to those numbers, It may be that Perl should be changed so its strings are confined to strings - Eric |
From @chipdudeOn 12/13/2010 5:24 PM, Eric Brine wrote:
That's would be true, if it were purely an internal flag. But the flag
Well of course Perl is designed to perform as gracefully as possible as "Unless explicitly stated, Perl operators use character semantics The above-quoted objections are hardly worth knocking down. Please, |
From @demerphqOn 14 December 2010 07:22, Reverend Chip <rev.chip@gmail.com> wrote:
You are a bit misinformed. The internals specifically contemplated It is only when we must ascribe meaning to codepoints, such as when we There is no reason not to allow \x{D800} to be stored in a utf8 cheers, -- |
From @chipdudeOn 12/13/2010 11:24 PM, demerphq wrote:
Code cannot contemplate. What are you trying to say? A hypothetical
Well, of course. Unnecessary validation work is unnecessary. Still,
Perl does treat the string as having meaning under Unicode. This is
Perl's nature both includes compliance and integrity. It's established |
From @ikegamiOn Tue, Dec 14, 2010 at 1:46 PM, Reverend Chip <rev.chip@gmail.com> wrote:
as Unicode, then Perl does. But substr, length and index don't treat strings as Unicode. It doesn't I hope you're not saying I'm misusing substr by using it on binary data. So I'm not saying we should support more than Unicode or not, just that we - Eric |
From @chipdudeOn 12/14/2010 12:31 PM, Eric Brine wrote:
Yes, they do. But their error checking is minimal for performance
For the value of "it" that is Perl as a whole, I've already proven this |
From @khwilliamsonI'd like to come to some closure on this discussion. Let me start by "2.7 Unicode Strings "A Unicode string data type is simply an ordered sequence of code units. "Depending on the programming environment, a Unicode string may or may "It is straightforward to design basic string manipulation libraries And the definition of "abstract character": "D7 Abstract character: A unit of information used for the organization, "When representing data, the nature of that data is generally symbolic "An abstract character has no concrete form and should not be confused "An abstract character does not necessarily correspond to what a user "The abstract characters encoded by the Unicode Standard are known as "Abstract characters not directly encoded by the Unicode Standard can What that bureaucratize comes down to is that an abstract character is a There are 4 categories of code points that are not abstract characters 1) Those that may be assigned in the future; they have General Category Cn. 2) Noncharacters, also having Gc=Cn. These are reserved for internal 3) Beyond Unicode code points. These are the code points above 0x10FFFF 4) Surrogates, having Gc=Cs, which are reserved for use in pairs in My original proposal this time round was to fix the noncharacters to In this thread, I don't think I've heard what the harm is of allowing |
From @chipdudeOn 12/15/2010 11:55 AM, karl williamson wrote:
That is a much better explanation than I have previously made as to why
I think "allow" is too overloaded; perhaps I'm misunderstanding what you A utf8 string where some of the code points are surrogates, which is So it seems to me, in the end, that the warnings on surrogates in |
From @khwilliamsonReverend Chip wrote:
But surrogates do have semantics. The standard is kind of So they are actively maintaining the properties for surrogate code points.
A problem is that it doesn't currently work any way like this when (It is beyond me as to why the 31-bit limit, unless there was concern |
From @chipdudeOn 12/17/2010 1:51 PM, karl williamson wrote:
Only a committee would declare something illegal, and also specify how So: Under the "generous in what you accept" principle, if Unicode has
Erm, that sounds like something that should change, then. If current |
From tchrist@perl.com
I thought the same thing. Remember, Java is the place where both a single --tom |
From @chipdudeOn 12/17/2010 2:34 PM, Tom Christiansen wrote:
That's remarkable. I'll bet that a reversed surrogate pair matches
Semantic bleed back from Java makes sense; has the Java world been a |
From tchrist@perl.com
Clever guess! But it's worse than that: U+D83D TRUE: "?" =~ /./ U+1F47E TRUE: "👾" =~ /./ U+DC7E TRUE: "?" =~ /./ U+DC7E.D83D TRUE: "??" =~ /./ U+FDDD TRUE: "�" =~ /./ U+FFFF TRUE: "#" =~ /./ That's with a UTF-8 output encoding. Anything smell fishy to you? I mean, more than once ever few lines? :) --tom |
From @ap* Reverend Chip <rev.chip@gmail.com> [2010-12-14 01:15]:
You are disagreeing with Larry.
It actually does, and has documented that it does. http://perldoc.perl.org/Encode.html#UTF-8-vs.-utf8-vs.-UTF8
Personally I like that Perl is lax there. In fact I want a particular UTF-8 encoder/decoder for Perl at In <http://bsittler.livejournal.com/10381.html> it is described It is part of Perl’s heritage and appeal that it allows your code Regards, |
From @demerphqOn 14 December 2010 19:39, Reverend Chip <rev.chip@gmail.com> wrote:
The coder can contemplate, and the implement, support outside of pure So in particular (and this is documented) we use the term "utf8" to utf8 is equivalent to "UTF8" in that all legal (canonical) UTF8
"Perl knows it's Unicode" is an insufficiently well defined expression Flipping the utf8 bit on a SV tells perl that the integers stored
Only when performing an operation that requires lookup into the
No, it is not.
It is not lazy. It is deliberately designed to do this. Read the code
No. Again, you havent read the docs. We document that utf8 is not
You are misinformed. See "perlunifaq" <quote> =head2 What's the difference between C<UTF-8> and C<utf8>? C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8 For example, utf8 can be used for code points that don't exist in Unicode, like Okay, if you insist: the "internal format" is utf8, not UTF-8. (When it's not </quote> And further from Encode: <quote> ....We now view strings not as sequences of bytes, but as sequences That has been the perl's notion of UTF-8 but official UTF-8 is more Now that is overruled by Larry Wall himself. From: Larry Wall <larry@wall.org> On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote: For what it's worth, that's how I've always kept them straight in my Also for what it's worth, Perl 6 will mostly default to strict but Larry Do you copy? As of Perl 5.8.7, B<UTF-8> means strict, official UTF-8 encode("utf8", "\x{FFFF_FFFF}", 1); # okay C<UTF-8> in Encode is actually a canonical name for C<utf-8-strict>. find_encoding("UTF-8")->name # is 'utf-8-strict' The UTF8 flag is internally called UTF8, without a hyphen. It indicates </quote> Seems to me you have some reading to do. Yves -- |
From @demerphqOn 17 December 2010 00:53, Reverend Chip <rev.chip@gmail.com> wrote:
That is what I, and others have been saying all along. And is actually the only way for things to work which complies with cheers, -- |
From @cpansproutOn Dec 16, 2010, at 3:53 PM, Reverend Chip wrote:
Warnings for surrogates may seem logical at first, but they do not solve my problem of not being able to use modules that I didn’t write. So I’ll just have to continue monkey-patching warnings::import. (I’m getting used to that sort of thing.) \p{foo} should definitely be exempt, as we have \p{Cs} specifically for matching surrogates. |
From @chipdudeOn 12/18/2010 1:21 PM, demerphq wrote:
Indeed, you told me so.
I think the existing docs are ambiguous. No matter now, of course. |
From @chipdudeOn 12/19/2010 2:40 PM, Father Chrysostomos wrote:
Sorry for the trouble. This seems like a hack needed whenever warnings |
From @chipdudeOn 12/18/2010 1:18 PM, demerphq wrote:
OK, fine, I was mistaken. All code points are welcome. The fact that |
From @ikegamiOn Mon, Dec 20, 2010 at 6:45 AM, Reverend Chip <rev.chip@gmail.com> wrote:
It's going to be like math on NaN. There are billions of different NaN, just |
From tchrist@perl.com
Yes, *some* of them. The pre-3.1 ones only, mostly. It's a * Like they have no way to convert between code points and character * Like they ignore the rules about loose interpretation of properties. * Like they have \p{Alpha} for [A-Za-z] *only*, and no \p{Alphabetic}. * Like they have \p{InGreek} but not \p{IsGreek}. * Like they have \p{javaWhiteSpace}, which they *CLAIM* maps to Unicode * Like the fools use such lovely things as \p{javaJavaIdentifierStart}, * They can't even get casing right when they claim to. * They're canonical equivalence is broken because of the damned * You can't get back out the pattern that you originally compiled, and This is just the very tip of a huge festering pool of poorly engineered Java's claims of Unicode support are nothing but that: claims. They (1) Sure, but does anybody really care? (2) It's documented not to work, so we never have to make it work. (3) Yes, that's a shame, but fixing it would break backward (4) That's only required for Level N+1 compliance, so we don't have There is an arrogance and insularity, and NIH ignorance, amongst the Sun Since they refuse to act on my double-digit worth of Unicode Sure, I may not be able to shame them into fixing anything, but I RESOLVED: DO NOT USE JAVA IF YOU NEED TO DEAL WITH UNICODE F S
Just to their display. The internals are all converted into UTFMH-16. Here: read 'em and weep. --tom |
From tchrist@perl.comimport java.io.*; public class surotest { private static PrintStream stdout; private static void testmatch(String s, String re) { boolean found = Pattern.compile(re).matcher(s).find(); stdout.printf("U+"); stdout.printf(" %s: ", found ? " TRUE" : "false", s); public static void main(String[ ] args) { // yes, this is intentionally outdented /* stdout = new PrintStream(System.out, true, "UTF-8"); String[] slist = { for (String s : slist) { testmatch(s, "."); } |
From @khwilliamsonThe acceptance of surrogates no longer is dependent on warnings being --Karl Williamson |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#63446 (status was 'resolved')
Searchable as RT63446$
The text was updated successfully, but these errors were encountered: