utf8.c(): Default to allow problematic code points

Surrogates, non-character code points, and code points that aren't in Unicode are now allowed by default, instead of having to specify a flag to allow them. (Most code did specify those flags anyway.) This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that are specialized interfaces to them. Now there is a new set of flags to disallow those code points. Further, all 66 of the non-character code points are known about and handled consistently, instead of just U+FFFF. Code that requires these code points to be forbidden will have to change to use the new flags. I have looked at all the (few) instances in CPAN where these routines are used, and the only one I found that appears to have need to do this, Encode, has already been patched to accommodate this change. Of course, I may have overlooked some subtleties.
Perl · Jan 10, 2011 · 949cf49 · 949cf49
1 parent 6ee84de
commit 949cf49
Show file tree

Hide file tree

Showing 4 changed files with 228 additions and 110 deletions.
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
@@ -5208,15 +5208,16 @@ C<HERE> was retained; anything to the right was discarded.
 
 =item Unicode surrogate U+%X is illegal in UTF-8
 
-=item UTF-16 surrogate 0x%x
-
-(W utf8) You tried to generate half of a UTF-16 surrogate by
-requesting a Unicode character between the code points 0xD800 and
-0xDFFF (inclusive).  That range is reserved exclusively for the use of
-UTF-16 encoding (by having two 16-bit UCS-2 characters); but Perl
-encodes its characters in UTF-8, so what you got is a very illegal
-character.  If you really really know what you are doing you can turn off
-this warning by C<no warnings 'utf8';>.
+=item UTF-16 surrogate U+%X
+
+(W utf8) You had a UTF-16 surrogate in a context where they are
+not considered acceptable.  These code points, between U+D800 and
+U+DFFF (inclusive), are used by Unicode only for UTF-16.  However, Perl
+internally allows all unsigned integer code points (up to the size limit
+available on your platform), including surrogates.  But these can cause
+problems when being input or output, which is likely where this message
+came from.  If you really really know what you are doing you can turn
+off this warning by C<no warnings 'utf8';>.
 
 =item Value of %s can be "0"; test with defined()
 

diff --git a/t/lib/warnings/utf8 b/t/lib/warnings/utf8
@@ -60,12 +60,6 @@ my $hex5  = chr(0x100000);
 my $maxm1 = chr(0x10FFFE);
 my $max   = chr(0x10FFFF);
 EXPECT
-UTF-16 surrogate 0xd800 at - line 3.
-UTF-16 surrogate 0xdfff at - line 4.
-Unicode non-character 0xfffe is illegal for interchange at - line 8.
-Unicode non-character 0xffff is illegal for interchange at - line 9.
-Unicode non-character 0x10fffe is illegal for interchange at - line 12.
-Unicode non-character 0x10ffff is illegal for interchange at - line 13.
 ########
 use warnings 'utf8';
 my $d7ff  = pack("U", 0xD7FF);
@@ -94,12 +88,6 @@ my $hex5  = pack("U", 0x100000);
 my $maxm1 = pack("U", 0x10FFFE);
 my $max   = pack("U", 0x10FFFF);
 EXPECT
-UTF-16 surrogate 0xd800 at - line 3.
-UTF-16 surrogate 0xdfff at - line 4.
-Unicode non-character 0xfffe is illegal for interchange at - line 8.
-Unicode non-character 0xffff is illegal for interchange at - line 9.
-Unicode non-character 0x10fffe is illegal for interchange at - line 12.
-Unicode non-character 0x10ffff is illegal for interchange at - line 13.
 ########
 use warnings 'utf8';
 my $d7ff  = "\x{D7FF}";
@@ -130,10 +118,3 @@ my $maxm1 = "\x{10FFFE}";
 my $max   = "\x{10FFFF}";
 uc($ffff);
 EXPECT
-UTF-16 surrogate 0xd800 at - line 3.
-UTF-16 surrogate 0xdfff at - line 4.
-Unicode non-character 0xfffe is illegal for interchange at - line 8.
-Unicode non-character 0xffff is illegal for interchange at - line 9.
-Unicode non-character 0x10fffe is illegal for interchange at - line 12.
-Unicode non-character 0x10ffff is illegal for interchange at - line 13.
-Unicode non-character 0xffff is illegal for interchange in uc at - line 14.