Skip to content

Commit

Permalink
Change syntax of script runs
Browse files Browse the repository at this point in the history
The new syntax is (*script_run:...)
and a shortcut (*sr:...)

See http://nntp.perl.org/group/perl.perl5.porters/246762
  • Loading branch information
khwilliamson committed Feb 19, 2018
1 parent 948f26d commit d979061
Show file tree
Hide file tree
Showing 6 changed files with 155 additions and 88 deletions.
11 changes: 11 additions & 0 deletions pod/perldelta.pod
Expand Up @@ -44,6 +44,17 @@ The implication is that you are now free to use locales and changes them
in a threaded environment. Your changes affect only your thread.
See L<perllocale/Multi-threaded operation>

=head2 Script runs now are specified with a different syntax

This isn't really an enhancement, but is being put in this category
because it changes an enhancement from 5.27.8, and there is a new
abbreviated form for it. The syntax is now either of:

(*script_run:...)
(*sr:...)

Previously a C<"+"> was used instead of the C<"*">.

=head1 Security

XXX Any security-related notices go here. In particular, any security
Expand Down
32 changes: 24 additions & 8 deletions pod/perldiag.pod
Expand Up @@ -3009,12 +3009,13 @@ expression pattern should be an indivisible token, with nothing
intervening between the C<"("> and the C<"?">, but you separated them
with whitespace.

=item In '(+...)', the '(' and '+' must be adjacent in regex;
=item In '(*...)', the '(' and '*' must be adjacent in regex;
marked by S<<-- HERE> in m/%s/

(F) The two-character sequence C<"(+"> in this context in a regular
(F) The two-character sequence C<"(*"> in this context in a regular
expression pattern should be an indivisible token, with nothing
intervening between the C<"("> and the C<"+">, but you separated them.
intervening between the C<"("> and the C<"*">, but you separated them.
Fix the pattern and retry.

=item Invalid %s attribute: %s

Expand Down Expand Up @@ -5423,6 +5424,11 @@ terminates. You might use ^# instead. See L<perlform>.
search list. So the additional elements in the replacement list
are meaningless.

=item '(*%s' requires a terminating ':' in regex; marked by <-- HERE in m/%s/

(F) You used a construct that needs a colon and pattern argument.
Supply these or check that you are using the right construct.

=item '%s' resolved to '\o{%s}%d'

(W misc, regexp) You wrote something like C<\08>, or C<\179> in a
Expand Down Expand Up @@ -6625,6 +6631,11 @@ exactly, regardless of whether C<:loose> is used or not.) This error may
also happen if the C<\N{}> is not in the scope of the corresponding
C<S<use charnames>>.

=item Unknown '(*...)' construct '%s' in regex; marked by <-- HERE in m/%s/

(F) The C<(*> was followed by something that the regular expression
compiler does not recognize. Check your spelling.

=item Unknown error

(P) Perl was about to print an error message in C<$@>, but the C<$@> variable
Expand All @@ -6644,11 +6655,6 @@ your needs.
of valid modes: C<< < >>, C<< > >>, C<<< >> >>>, C<< +< >>,
C<< +> >>, C<<< +>> >>>, C<-|>, C<|->, C<< <& >>, C<< >& >>.

=item Unknown (+ pattern in regex; marked by S<<-- HERE> in m/%s/

(F) The C<(+> was followed by something that the regular expression
compiler does not recognize. Check your spelling.

=item Unknown PerlIO layer "%s"

(W layer) An attempt was made to push an unknown layer onto the Perl I/O
Expand Down Expand Up @@ -6841,6 +6847,11 @@ declares it to be in a Unicode encoding that Perl cannot read.
(F) Your machine doesn't support the Berkeley socket mechanism, or at
least that's what Configure thought.

=item Unterminated '(*...' argument in regex; marked by <-- HERE in m/%s/

(F) You used a pattern of the form C<(*...:...)> but did not terminate
the pattern with a C<)>. Fix the pattern and retry.

=item Unterminated attribute list

(F) The lexer found something other than a simple identifier at the
Expand All @@ -6861,6 +6872,11 @@ character to get your parentheses to balance. See L<attributes>.
compressed integer format and could not be converted to an integer.
See L<perlfunc/pack>.

=item Unterminated '(*...' construct in regex; marked by <-- HERE in m/%s/

(F) You used a pattern of the form C<(*...)> but did not terminate
the pattern with a C<)>. Fix the pattern and retry.

=item Unterminated delimiter for here document

(F) This message occurs when a here document label has an initial
Expand Down
13 changes: 8 additions & 5 deletions pod/perlre.pod
Expand Up @@ -708,7 +708,7 @@ the pattern uses L</C<(?[ ])>>

=item 8

the pattern uses L<C<(+script_run: ...)>|/Script Runs>
the pattern uses L<C<(*script_run: ...)>|/Script Runs>

=back

Expand Down Expand Up @@ -2421,6 +2421,7 @@ where side-effects of lookahead I<might> have influenced the
following match, see L</C<< (?>pattern) >>>.

=head2 Script Runs
X<(*script_run:...)> X<(sr:...)>

A script run is basically a sequence of characters, all from the same
Unicode script (see L<perlunicode/Scripts>), such as Latin or Greek. In
Expand All @@ -2438,9 +2439,11 @@ the real Paypal website, but an attacker would craft a look-alike one to
attempt to gather sensitive information from the person.

Starting in Perl 5.28, it is now easy to detect strings that aren't
script runs. Simply enclose just about any pattern like this:
script runs. Simply enclose just about any pattern like either of
these:

(+script_run:pattern)
(*script_run:pattern)
(*sr:pattern)

What happens is that after I<pattern> succeeds in matching, it is
subjected to the additional criterion that every character in it must be
Expand All @@ -2451,7 +2454,7 @@ backtracking, but generally, only malicious input will result in this,
though the slow down could cause a denial of service attack. If your
needs permit, it is best to make the pattern atomic.

(+script_run:(?>pattern))
(*script_run:(?>pattern))

(See L</C<(?E<gt>pattern)>>.)

Expand All @@ -2470,7 +2473,7 @@ own set. This is because these are often used in commerce even in such
scripts. But any mixing of the ASCII and other digits will cause the
sequence to not be a script run, failing the match. As an example,

qr/(+script_run: \d+ \b )/x
qr/(*script_run: \d+ \b )/x

guarantees that the digits matched will all be from the same set of 10.
You won't get a look-alike digit from a different script that has a
Expand Down
164 changes: 97 additions & 67 deletions regcomp.c
Expand Up @@ -10699,45 +10699,48 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
* here (if paren ==2). The forms '(*VERB' and '(?...' disallow such
* intervening space, as the sequence is a token, and a token should be
* indivisible */
bool has_intervening_patws = (paren == 2 || paren == 's')
bool has_intervening_patws = (paren == 2)
&& *(RExC_parse - 1) != '(';

if (RExC_parse >= RExC_end) {
vFAIL("Unmatched (");
}

if (paren == 's') {

/* A nested script run is a no-op besides clustering */
if (RExC_in_script_run) {
paren = ':';
nextchar(pRExC_state);
ret = NULL;
goto parse_rest;
}
RExC_in_script_run = 1;

ret = reg_node(pRExC_state, SROPEN);
is_open = 1;
}
else if ( *RExC_parse == '*') { /* (*VERB:ARG) */
if ( *RExC_parse == '*') { /* (*VERB:ARG), (*construct:...) */
char *start_verb = RExC_parse + 1;
STRLEN verb_len;
char *start_arg = NULL;
unsigned char op = 0;
int arg_required = 0;
int internal_argval = -1; /* if >-1 we are not allowed an argument*/
bool has_upper = FALSE;

if (has_intervening_patws) {
RExC_parse++; /* past the '*' */
vFAIL("In '(*VERB...)', the '(' and '*' must be adjacent");

/* For strict backwards compatibility, don't change the message
* now that we also have lowercase operands */
if (isUPPER(*RExC_parse)) {
vFAIL("In '(*VERB...)', the '(' and '*' must be adjacent");
}
else {
vFAIL("In '(*...)', the '(' and '*' must be adjacent");
}
}
while (RExC_parse < RExC_end && *RExC_parse != ')' ) {
if ( *RExC_parse == ':' ) {
start_arg = RExC_parse + 1;
break;
}
RExC_parse += UTF ? UTF8SKIP(RExC_parse) : 1;
else if (! UTF) {
if (isUPPER(*RExC_parse)) {
has_upper = TRUE;
}
RExC_parse++;
}
else {
RExC_parse += UTF8SKIP(RExC_parse);
}
}
verb_len = RExC_parse - start_verb;
if ( start_arg ) {
Expand All @@ -10746,16 +10749,27 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
}

RExC_parse += UTF ? UTF8SKIP(RExC_parse) : 1;
while ( RExC_parse < RExC_end && *RExC_parse != ')' )
while ( RExC_parse < RExC_end && *RExC_parse != ')' ) {
RExC_parse += UTF ? UTF8SKIP(RExC_parse) : 1;
if ( RExC_parse >= RExC_end || *RExC_parse != ')' )
}
if ( RExC_parse >= RExC_end || *RExC_parse != ')' ) {
unterminated_verb_pattern:
vFAIL("Unterminated verb pattern argument");
if ( RExC_parse == start_arg )
start_arg = NULL;
if (has_upper) {
vFAIL("Unterminated verb pattern argument");
}
else {
vFAIL("Unterminated '(*...' argument");
}
}
} else {
if ( RExC_parse >= RExC_end || *RExC_parse != ')' )
vFAIL("Unterminated verb pattern");
if ( RExC_parse >= RExC_end || *RExC_parse != ')' ) {
if (has_upper) {
vFAIL("Unterminated verb pattern");
}
else {
vFAIL("Unterminated '(*...' construct");
}
}
}

/* Here, we know that RExC_parse < RExC_end */
Expand Down Expand Up @@ -10798,13 +10812,68 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
RExC_seen |= REG_CUTGROUP_SEEN;
}
break;
}
case 's':
if ( memEQs(start_verb, verb_len, "sr")
|| memEQs(start_verb, verb_len, "script_run"))
{
paren = 's';

/* This indicates Unicode rules. */
REQUIRE_UNI_RULES(flagp, NULL);

if (! start_arg) {
goto no_colon;
}

RExC_parse = start_arg;

if (PASS2) {
Perl_ck_warner_d(aTHX_
packWARN(WARN_EXPERIMENTAL__SCRIPT_RUN),
"The script_run feature is experimental"
REPORT_LOCATION, REPORT_LOCATION_ARGS(RExC_parse));

}

if (RExC_in_script_run) {
paren = ':';
nextchar(pRExC_state);
ret = NULL;
goto parse_rest;
}
RExC_in_script_run = 1;

ret = reg_node(pRExC_state, SROPEN);

is_open = 1;
goto parse_rest;
}

break;

no_colon:
vFAIL2utf8f(
"'(*%" UTF8f "' requires a terminating ':'",
UTF8fARG(UTF, verb_len, start_verb));
NOT_REACHED; /*NOTREACHED*/

} /* End of switch */
if ( ! op ) {
RExC_parse += UTF ? UTF8SKIP(RExC_parse) : 1;
vFAIL2utf8f(
if (has_upper || verb_len == 0) {
vFAIL2utf8f(
"Unknown verb pattern '%" UTF8f "'",
UTF8fARG(UTF, verb_len, start_verb));
}
else {
vFAIL2utf8f(
"Unknown '(*...)' construct '%" UTF8f "'",
UTF8fARG(UTF, verb_len, start_verb));
}
}
if ( RExC_parse == start_arg ) {
start_arg = NULL;
}
if ( arg_required && !start_arg ) {
vFAIL3("Verb pattern '%.*s' has a mandatory argument",
verb_len, start_verb);
Expand Down Expand Up @@ -10832,45 +10901,6 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
nextchar(pRExC_state);
return ret;
}
else if (*RExC_parse == '+') { /* (+...) */
RExC_parse++;

if (has_intervening_patws) {
/* XXX Note that a potential gotcha is that outside of /x '( +
* ...)' means to match a space at least once ... This is a
* problem elsewhere too */
vFAIL("In '(+...)', the '(' and '+' must be adjacent");
}

if (! memBEGINPs(RExC_parse, (STRLEN) (RExC_end - RExC_parse),
"script_run:"))
{
RExC_parse += strcspn(RExC_parse, ":)");
vFAIL("Unknown (+ pattern");
}
else {

/* This indicates Unicode rules. */
REQUIRE_UNI_RULES(flagp, NULL);

RExC_parse += sizeof("script_run:") - 1;

if (PASS2) {
Perl_ck_warner_d(aTHX_
packWARN(WARN_EXPERIMENTAL__SCRIPT_RUN),
"The script_run feature is experimental"
REPORT_LOCATION, REPORT_LOCATION_ARGS(RExC_parse));
}

ret = reg(pRExC_state, 's', &flags, depth+1);
if (flags & (RESTART_PASS1|NEED_UTF8)) {
*flagp = flags & (RESTART_PASS1|NEED_UTF8);
return NULL;
}

return ret;
}
}
else if (*RExC_parse == '?') { /* (?...) */
bool is_logical = 0;
const char * const seqstart = RExC_parse;
Expand Down Expand Up @@ -11476,7 +11506,7 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
paren = ':';
ret = NULL;
}
}
}
}
else /* ! paren */
ret = NULL;
Expand Down

0 comments on commit d979061

Please sign in to comment.