Skip to content

Commit

Permalink
utf8.c(): Default to allow problematic code points
Browse files Browse the repository at this point in the history
Surrogates, non-character code points, and code points that aren't in Unicode
are now allowed by default, instead of having to specify a flag to allow them.
(Most code did specify those flags anyway.)

This affects uvuni_to_utf8_flags(), utf8n_to_uvuni() and various routines that
are specialized interfaces to them.

Now there is a new set of flags to disallow those code points.  Further, all 66
of the non-character code points are known about and handled consistently,
instead of just U+FFFF.

Code that requires these code points to be forbidden will have to change to use
the new flags.  I have looked at all the (few) instances in CPAN where these
routines are used, and the only one I found that appears to have need to do
this, Encode, has already been patched to accommodate this change.  Of course,
I may have overlooked some subtleties.
  • Loading branch information
Karl Williamson committed Jan 10, 2011
1 parent 6ee84de commit 949cf49
Show file tree
Hide file tree
Showing 4 changed files with 228 additions and 110 deletions.
19 changes: 10 additions & 9 deletions pod/perldiag.pod
Expand Up @@ -5208,15 +5208,16 @@ C<HERE> was retained; anything to the right was discarded.

=item Unicode surrogate U+%X is illegal in UTF-8

=item UTF-16 surrogate 0x%x

(W utf8) You tried to generate half of a UTF-16 surrogate by
requesting a Unicode character between the code points 0xD800 and
0xDFFF (inclusive). That range is reserved exclusively for the use of
UTF-16 encoding (by having two 16-bit UCS-2 characters); but Perl
encodes its characters in UTF-8, so what you got is a very illegal
character. If you really really know what you are doing you can turn off
this warning by C<no warnings 'utf8';>.
=item UTF-16 surrogate U+%X

(W utf8) You had a UTF-16 surrogate in a context where they are
not considered acceptable. These code points, between U+D800 and
U+DFFF (inclusive), are used by Unicode only for UTF-16. However, Perl
internally allows all unsigned integer code points (up to the size limit
available on your platform), including surrogates. But these can cause
problems when being input or output, which is likely where this message
came from. If you really really know what you are doing you can turn
off this warning by C<no warnings 'utf8';>.

=item Value of %s can be "0"; test with defined()

Expand Down
19 changes: 0 additions & 19 deletions t/lib/warnings/utf8
Expand Up @@ -60,12 +60,6 @@ my $hex5 = chr(0x100000);
my $maxm1 = chr(0x10FFFE);
my $max = chr(0x10FFFF);
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
Unicode non-character 0xfffe is illegal for interchange at - line 8.
Unicode non-character 0xffff is illegal for interchange at - line 9.
Unicode non-character 0x10fffe is illegal for interchange at - line 12.
Unicode non-character 0x10ffff is illegal for interchange at - line 13.
########
use warnings 'utf8';
my $d7ff = pack("U", 0xD7FF);
Expand Down Expand Up @@ -94,12 +88,6 @@ my $hex5 = pack("U", 0x100000);
my $maxm1 = pack("U", 0x10FFFE);
my $max = pack("U", 0x10FFFF);
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
Unicode non-character 0xfffe is illegal for interchange at - line 8.
Unicode non-character 0xffff is illegal for interchange at - line 9.
Unicode non-character 0x10fffe is illegal for interchange at - line 12.
Unicode non-character 0x10ffff is illegal for interchange at - line 13.
########
use warnings 'utf8';
my $d7ff = "\x{D7FF}";
Expand Down Expand Up @@ -130,10 +118,3 @@ my $maxm1 = "\x{10FFFE}";
my $max = "\x{10FFFF}";
uc($ffff);
EXPECT
UTF-16 surrogate 0xd800 at - line 3.
UTF-16 surrogate 0xdfff at - line 4.
Unicode non-character 0xfffe is illegal for interchange at - line 8.
Unicode non-character 0xffff is illegal for interchange at - line 9.
Unicode non-character 0x10fffe is illegal for interchange at - line 12.
Unicode non-character 0x10ffff is illegal for interchange at - line 13.
Unicode non-character 0xffff is illegal for interchange in uc at - line 14.

0 comments on commit 949cf49

Please sign in to comment.