From 54343421d50fdd812b50142b0fc96a8548f69b9a Mon Sep 17 00:00:00 2001 From: Felipe Gasper Date: Thu, 20 May 2021 10:22:07 -0400 Subject: [PATCH] Clarify descriptions of unicode_eval and evalbytes. Issue #18801 --- lib/feature.pm | 4 +-- pod/perlfunc.pod | 79 +++++++++++++++++++----------------------------- regen/feature.pl | 4 +-- 3 files changed, 35 insertions(+), 52 deletions(-) diff --git a/lib/feature.pm b/lib/feature.pm index 5ebb4a3f789c..61261ee41d2f 100644 --- a/lib/feature.pm +++ b/lib/feature.pm @@ -209,8 +209,8 @@ couldn't be changed without breaking some things that had come to rely on them, so the feature can be enabled and disabled. Details are at L. -C is like string C, but operating on a byte stream that is -not UTF-8 encoded. Details are at L. Without a +C is like string C, but it treats its argument as a byte +string. Details are at L. Without a S> nor a S> (or higher) declaration in the current scope, you can still access it by instead writing C. diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 47958b285174..8481946012bd 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -2199,29 +2199,13 @@ format definitions remain afterwards. =item Under the L feature|feature/The 'unicode_eval' and 'evalbytes' features> If this feature is enabled (which is the default under a C or -higher declaration), EXPR is considered to be -in the same encoding as the surrounding program. Thus if -S|utf8>> is in effect, the string will be treated as being -UTF-8 encoded. Otherwise, the string is considered to be a sequence of -independent bytes. Bytes that correspond to ASCII-range code points -will have their normal meanings for operators in the string. The -treatment of the other bytes depends on if the -L feature|feature/The 'unicode_strings' feature> is -in effect. - -In a plain C without an EXPR argument, being in S> or -not is irrelevant; the UTF-8ness of C<$_> itself determines the -behavior. - -Any S> or S> declarations within the string have -no effect, and source filters are forbidden. (C, -however, can appear within the string.) See also the -L|/evalbytes EXPR> operator, which works properly with -source filters. - -Variables defined outside the C and used inside it retain their -original UTF-8ness. Everything inside the string follows the normal -rules for a Perl program with the given state of S>. +higher declaration), Perl assumes that EXPR is a character string. +Any S> or S> declarations within +the string thus have no effect. Source filters are forbidden as well. +(C, however, can appear within the string.) + +See also the L|/evalbytes EXPR> operator, which works properly +with source filters. =item Outside the C<"unicode_eval"> feature @@ -2233,8 +2217,26 @@ breaking existing programs: =item * -It can lose track of whether something should be encoded as UTF-8 or -not. +Perl's internal storage of EXPR affects the behavior of the executed code. +For example: + + my $v = eval "use utf8; '$expr'"; + +If $expr is C<"\xc4\x80"> (U+0100 in UTF-8), then the value stored in C<$v> +will depend on whether Perl stores $expr "upgraded" (cf. L) or +not: + +=over + +=item * If upgraded, C<$v> will be C<"\xc4\x80"> (i.e., the +C has no effect.) + +=item * If non-upgraded, C<$v> will be C<"\x{100}">. + +=back + +This is undesirable since being +upgraded or not should not affect a string's behavior. =item * @@ -2360,30 +2362,11 @@ X This function is similar to a L, except it always parses its argument (or L|perlvar/$_> if EXPR is omitted) -as a string of independent bytes. - -If called when S> is in effect, the string will be assumed -to be encoded in UTF-8, and C will make a temporary copy to -work from, downgraded to non-UTF-8. If this is not possible -(because one or more characters in it require UTF-8), the C -will fail with the error stored in C<$@>. - -Bytes that correspond to ASCII-range code points will have their normal -meanings for operators in the string. The treatment of the other bytes -depends on if the L feature|feature/The -'unicode_strings' feature> is in effect. - -Of course, variables that are UTF-8 and are referred to in the string -retain that: - - my $a = "\x{100}"; - evalbytes 'print ord $a, "\n"'; - -prints - - 256 +as a byte string. If the string contains any code points above 255, then +it cannot be a byte string, and the C will fail with the error +stored in C<$@>. -and C<$@> is empty. +C and C within the string have their usual effect. Source filters activated within the evaluated code apply to the code itself. diff --git a/regen/feature.pl b/regen/feature.pl index 1186cc3d03e5..4c9b57d627df 100755 --- a/regen/feature.pl +++ b/regen/feature.pl @@ -615,8 +615,8 @@ =head2 The 'unicode_eval' and 'evalbytes' features them, so the feature can be enabled and disabled. Details are at L. -C is like string C, but operating on a byte stream that is -not UTF-8 encoded. Details are at L. Without a +C is like string C, but it treats its argument as a byte +string. Details are at L. Without a S> nor a S> (or higher) declaration in the current scope, you can still access it by instead writing C.