-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document formal grammar for regex sets #16779
Comments
From @khwilliamsonThis is a bug report for perl from khw@khw-xps-8930.(none), A message that was part of the security ticket [perl #133649] asked that https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131649#txn-1471088 Flags: Site configuration information for perl 5.29.6: Configured by khw at Fri Nov 30 08:02:34 MST 2018. Summary of my perl5 (revision 5 version 29 subversion 6) configuration: @INC for perl 5.29.6: Environment for perl 5.29.6: PATH=/usr/lib/ccache:/home/khw/bin:/home/khw/perl5/perlbrew/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/usr/local/games:/snap/bin:/home/khw/iands/www:/home/khw/cxoffice/bin |
Here is my try at a formal grammar. I haven't done this kind of thing in a long time, so it may be flawed. primary-expression: unary-expression: multiplicative-expression: additive-expression: expression: My sense is that @demerphq was expecting there to be lots of gotchas when he requested this, around the fact that you can interpolate in an already-compiled regex set expression. And in fact it doesn't check that the code came from interpolation of such. Only that it is a syntactically valid compiled expression displayed out '(?^:...)' I would like to make the regex sets feature non-experimental. And I believe this is the final road block to that. |
On Mon, 9 Dec 2019, 04:27 Karl Williamson, ***@***.***> wrote:
Here is my try at a formal grammar. I haven't done this kind of thing in a
long time, so it may be flawed.
primary-expression:
compiled set-expression, which is typically an interpolated scalar
[ perl character class ]
[:posix:], which is one of [:alpha:], ... [:digit:], ... [:word:]
\perl backslash escape sequence like \n, \p{...} \N{...} \x{...}, etc
( expression )
Normally a grammar is expressed as a set of rules, with quoted text being
literals, and unquoted text being other rules, usually with at least the
'|' alternation meta-character similar to regex'es, and sometime with
additional support for '+' (one or more), '*' (zero or more) and '?' (zero
or one) meta-characters [grammars which support these additional items are
MUCH easier to read and write even if they aren't strictly necessary from a
logical point of view]. In the following ':' is used to separate a rule
name from its definition, and the rule definition is terminated with a ';',
I used '...' to mean "not completed yet" as one might expect.
any_char_class:
traditional_char_class |
extended_char_class;
traditional_char_class: '[' traditional_cc_spec* ']';
extended_char_class: '(?[' extended_cc_spec* '])';
traditional_cc_spec: trad_cc_range |
trad_cc_codepoint |
trad_cc_posix |
trad_cc_meta |
... ;
trad_cc_range: trad_cc_codepoint '-' trad_cc_codepoint;
trad_cc_codepoint: 'a' | 'b' | ... | '\n' | '\t' | '\x{' hex-digits '}' |
... ;
trad_cc_posix: '[:' posix_negate? posix_name ':]';
posix_negate: '^'
posix_name: 'digit' | ... | 'lower';
trad_cc_meta: '\d' | ... | '\s';
extended_cc_spec: ... ;
The part that is really interesting IMO/IIRC is the exact rules for set
operations and nested set definitions. IIRC they were sufficiently
surprising the exact rules weren't intuitive, so they may be difficult to
express as a grammar - which I suspect would suggest that the rule is
difficult to remember, and thus should be changed to simplify things.
Cheers,
Yves
… |
There have been various changes over the years to make things behave properly. I am unaware of anything at this time that would likely be a surprise. It doesn't sound like you have any examples immediately at hand. My guess is that the ones you knew of have been fixed. |
On Mon, 9 Dec 2019 at 19:02, Karl Williamson ***@***.***> wrote:
There have been various changes over the years to make things behave
properly. I am unaware of anything at this time that would likely be a
surprise. It doesn't sound like you have any examples immediately at hand.
My guess is that the ones you knew of have been fixed.
Maybe. :-)
But lets write the rules down properly so we know what you expect, and then
we can generate test cases to see if there are any surprises. I remember
being very surprised about where (?[]) were required or not.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
On 12/9/19 11:11 AM, Yves Orton wrote:
I remember
being very surprised about where (?[]) were required or not.
That doesn't make sense to me. They are always required for this
feature to kick in.
|
On Mon, 9 Dec 2019 at 19:31, Karl Williamson ***@***.***> wrote:
On 12/9/19 11:11 AM, Yves Orton wrote:
> I remember
> being very surprised about where (?[]) were required or not.
That doesn't make sense to me. They are always required for this
feature to kick in.
IIRC, they are also required sometimes inside of a (?[]) as well. That
is the part that is surprising.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
On Mon, 9 Dec 2019 at 19:32, demerphq ***@***.***> wrote:
On Mon, 9 Dec 2019 at 19:31, Karl Williamson ***@***.***> wrote:
>
> On 12/9/19 11:11 AM, Yves Orton wrote:
> > I remember
> > being very surprised about where (?[]) were required or not.
>
> That doesn't make sense to me. They are always required for this
> feature to kick in.
IIRC, they are also required sometimes inside of a (?[]) as well. That
is the part that is surprising.
FWIW, this discussion and me saying we should write a grammar is related to
19a498a
commit 19a498a (yves/fix_131649)
Author: Yves Orton <demerphq@gmail.com>
Date: Mon Jun 26 13:19:55 2017 +0200
fix #131649 - extended charclass can trigger assert
The extended charclass parser makes some assumptions during the
first pass which are only true on well structured input, and it
does not properly catch various errors. later on the code assumes
that things the first pass will let through are valid, when in
fact they should trigger errors.
While I was debugging that ticket and fixing those issues I noticed
the rules about how regexes inside of (?[ ... ]) are not as intuitive
as I thought, and that wrappers are required inside sometimes. For
instance I think I encountered these issues while leading up to the
creation of the following error messages:
+=item Unexpected ']' with no following ')' in (?[... in regex; marked
by <-- HERE in m/%s/
+
+(F) While parsing an extended character class a ']' character was encountered
+at a point in the definition where the only legal use of ']' is to close the
+character class definition as part of a '])', you may have forgotten the close
+paren, or otherwise confused the parser.
+
+=item Expecting close paren for nested extended charclass in regex;
marked by <-- HERE in m/%s/
+
+(F) While parsing a nested extended character class like:
+
+ (?[ ... (?flags:(?[ ... ])) ... ])
+ ^
+
+we expected to see a close paren ')' (marked by ^) but did not.
+
+=item Expecting close paren for wrapper for nested extended charclass
in regex; marked by <-- HERE in m/%s/
+
+(F) While parsing a nested extended character class like:
+
+ (?[ ... (?flags:(?[ ... ])) ... ])
+ ^
+
+we expected to see a close paren ')' (marked by ^) but did not.
I dont remember more details, which is why I would like you to write
the grammar down. For instance your original does not seem to account
for (?[ (?flags:(?[ ... ])) ]) at all.
What flags are legal? What things are illegal after a (?flags: ...)
that is inside of an extended char-class, etc.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
On 12/9/19 11:32 AM, Yves Orton wrote:
On Mon, 9 Dec 2019 at 19:31, Karl Williamson ***@***.***>
wrote:
>
> On 12/9/19 11:11 AM, Yves Orton wrote:
> > I remember
> > being very surprised about where (?[]) were required or not.
>
> That doesn't make sense to me. They are always required for this
> feature to kick in.
IIRC, they are also required sometimes inside of a (?[]) as well. That
is the part that is surprising.
That would be surprising to me too. But looking at the code, and doing
a few experiments, I don't see how that could happen.
…
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16779?email_source=notifications&email_token=AAA2DH5Y337R66HMS3FNVMTQX2FMDA5CNFSM4JYCTEFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKFVCQ#issuecomment-563370634>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2DHZQGOYWSO4LCDBL3IDQX2FMDANCNFSM4JYCTEFA>.
|
On 12/9/19 11:42 AM, Yves Orton wrote:
On Mon, 9 Dec 2019 at 19:32, demerphq ***@***.***> wrote:
>
> On Mon, 9 Dec 2019 at 19:31, Karl Williamson
***@***.***> wrote:
> >
> > On 12/9/19 11:11 AM, Yves Orton wrote:
> > > I remember
> > > being very surprised about where (?[]) were required or not.
> >
> > That doesn't make sense to me. They are always required for this
> > feature to kick in.
>
> IIRC, they are also required sometimes inside of a (?[]) as well. That
> is the part that is surprising.
FWIW, this discussion and me saying we should write a grammar is related to
19a498a
commit 19a498a (yves/fix_131649)
Author: Yves Orton ***@***.***>
Date: Mon Jun 26 13:19:55 2017 +0200
fix #131649 - extended charclass can trigger assert
The extended charclass parser makes some assumptions during the
first pass which are only true on well structured input, and it
does not properly catch various errors. later on the code assumes
that things the first pass will let through are valid, when in
fact they should trigger errors.
While I was debugging that ticket and fixing those issues I noticed
the rules about how regexes inside of (?[ ... ]) are not as intuitive
as I thought, and that wrappers are required inside sometimes. For
instance I think I encountered these issues while leading up to the
creation of the following error messages:
+=item Unexpected ']' with no following ')' in (?[... in regex; marked
by <-- HERE in m/%s/
+
+(F) While parsing an extended character class a ']' character was
encountered
+at a point in the definition where the only legal use of ']' is to
close the
+character class definition as part of a '])', you may have forgotten
the close
+paren, or otherwise confused the parser.
+
+=item Expecting close paren for nested extended charclass in regex;
marked by <-- HERE in m/%s/
+
+(F) While parsing a nested extended character class like:
+
+ (?[ ... (?flags:(?[ ... ])) ... ])
+ ^
+
+we expected to see a close paren ')' (marked by ^) but did not.
+
+=item Expecting close paren for wrapper for nested extended charclass
in regex; marked by <-- HERE in m/%s/
+
+(F) While parsing a nested extended character class like:
+
+ (?[ ... (?flags:(?[ ... ])) ... ])
+ ^
+
+we expected to see a close paren ')' (marked by ^) but did not.
I dont remember more details, which is why I would like you to write
the grammar down. For instance your original does not seem to account
for (?[ (?flags:(?[ ... ])) ]) at all.
What flags are legal? What things are illegal after a (?flags: ...)
that is inside of an extended char-class, etc.
Yves
So, the grammar I wrote earlier in this thread has this rule:
primary-expression:
compiled set-expression, which is typically an interpolated scalar
...
By that, I mean that you could say something like
my $foo = qr/([ ... ])/;
and later include $foo in another set expression
qr/([ ... + $foo + ... ])/;
But by the time I see the interpolated $foo, it looks like
qr/([ ... + (?^flags:(?[...])) + ... ])/;
The boiler plate of the expansion of $foo, including the flags, are
placed there by regcomp.c during the previous compilation of $foo. It
is essential that the pattern accept an interpolated previously compiled
set expression, but the programmer should not be constructing such
things manually. So the grammar merely refers to the interpolated
scalar without specifying the details, such as the flags. Those details
have changed before, and its possible they could change again.
Since you wrote that ticket, there's been a tightening of the checking
that the interpolated code is actually from a previously compiled regex,
but AFAIK, there's no way for me to know its genesis for sure. If
someone does have a way, I'm all ears.
Maybe the diagnostic messages and explanations can be fixed to be more
clear. But the bottom line is that the programmer should not be
specifying flags, etc inside one of these expressions, and hence the
grammar shouldn't include that.
…
--
perl -Mre=debug -e "/just|another|perl|hacker/"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16779?email_source=notifications&email_token=AAA2DH5HBGRYZOS6CECZUXTQX2GRBA5CNFSM4JYCTEFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKGRYQ#issuecomment-563374306>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2DH6JUV6BHLF3RP7DHILQX2GRBANCNFSM4JYCTEFA>.
|
On Mon, 9 Dec 2019 at 20:26, Karl Williamson <notifications@github.com>
wrote:
On 12/9/19 11:42 AM, Yves Orton wrote:
> On Mon, 9 Dec 2019 at 19:32, demerphq ***@***.***> wrote:
> >
> > On Mon, 9 Dec 2019 at 19:31, Karl Williamson
> ***@***.***> wrote:
> > >
> > > On 12/9/19 11:11 AM, Yves Orton wrote:
> > > > I remember
> > > > being very surprised about where (?[]) were required or not.
> > >
> > > That doesn't make sense to me. They are always required for this
> > > feature to kick in.
> >
> > IIRC, they are also required sometimes inside of a (?[]) as well. That
> > is the part that is surprising.
>
> FWIW, this discussion and me saying we should write a grammar is related
to
>
> 19a498a
>
> commit 19a498a (yves/fix_131649)
> Author: Yves Orton ***@***.***>
> Date: Mon Jun 26 13:19:55 2017 +0200
>
> fix #131649 - extended charclass can trigger assert
>
> The extended charclass parser makes some assumptions during the
> first pass which are only true on well structured input, and it
> does not properly catch various errors. later on the code assumes
> that things the first pass will let through are valid, when in
> fact they should trigger errors.
>
> While I was debugging that ticket and fixing those issues I noticed
> the rules about how regexes inside of (?[ ... ]) are not as intuitive
> as I thought, and that wrappers are required inside sometimes. For
> instance I think I encountered these issues while leading up to the
> creation of the following error messages:
>
> +=item Unexpected ']' with no following ')' in (?[... in regex; marked
> by <-- HERE in m/%s/
> +
> +(F) While parsing an extended character class a ']' character was
> encountered
> +at a point in the definition where the only legal use of ']' is to
> close the
> +character class definition as part of a '])', you may have forgotten
> the close
> +paren, or otherwise confused the parser.
> +
> +=item Expecting close paren for nested extended charclass in regex;
> marked by <-- HERE in m/%s/
> +
> +(F) While parsing a nested extended character class like:
> +
> + (?[ ... (?flags:(?[ ... ])) ... ])
> + ^
> +
> +we expected to see a close paren ')' (marked by ^) but did not.
> +
> +=item Expecting close paren for wrapper for nested extended charclass
> in regex; marked by <-- HERE in m/%s/
> +
> +(F) While parsing a nested extended character class like:
> +
> + (?[ ... (?flags:(?[ ... ])) ... ])
> + ^
> +
> +we expected to see a close paren ')' (marked by ^) but did not.
>
> I dont remember more details, which is why I would like you to write
> the grammar down. For instance your original does not seem to account
> for (?[ (?flags:(?[ ... ])) ]) at all.
>
> What flags are legal? What things are illegal after a (?flags: ...)
> that is inside of an extended char-class, etc.
>
> Yves
So, the grammar I wrote earlier in this thread has this rule:
primary-expression:
compiled set-expression, which is typically an interpolated scalar
...
By that, I mean that you could say something like
my $foo = qr/([ ... ])/;
and later include $foo in another set expression
qr/([ ... + $foo + ... ])/;
But by the time I see the interpolated $foo, it looks like
qr/([ ... + (?^flags:(?[...])) + ... ])/;
The boiler plate of the expansion of $foo, including the flags, are
placed there by regcomp.c during the previous compilation of $foo. It
is essential that the pattern accept an interpolated previously compiled
set expression, but the programmer should not be constructing such
things manually. So the grammar merely refers to the interpolated
scalar without specifying the details, such as the flags. Those details
have changed before, and its possible they could change again.
Since you wrote that ticket, there's been a tightening of the checking
that the interpolated code is actually from a previously compiled regex,
but AFAIK, there's no way for me to know its genesis for sure. If
someone does have a way, I'm all ears.
Maybe the diagnostic messages and explanations can be fixed to be more
clear. But the bottom line is that the programmer should not be
specifying flags, etc inside one of these expressions, and hence the
grammar shouldn't include that.
It is allowed, and implied by the fact that concatenation of text to
construct a pattern and concatenation of regex objects are and must be
indistinguishable.
./perl -Ilib -lE'print "x"=~/(?[ (?^ux:(?[ [x] ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE [ x ] ])/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^ux:(?[ [x] ])) ])/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ (?^ux:(?[ <-- HERE [x] ])) ])/ at -e line 1.
yes
So IMO it needs to be in the grammar. I have to say I find the rules of the
syntax really confusing and the error messages that are produced relatively
unhelpful. I actually feel this sufficiently to say we shouldnt change the
experimental syntax status until we are sure that we are happy with the
rules. The basis of this is great, some of the rough edges are IMO too
rough. Do we need to see multiple warnings about the experimental status
from a single pattern?
./perl -Ilib -lE'print "x"=~/(?[ (?x:(?[ ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?x:(?[ ])) ])/ at -e line 1.
Unexpected character in regex; marked by <-- HERE in m/(?[ (? <-- HERE
x:(?[ ])) ])/ at -e line 1.
$ ./perl -le'my $s=qr/(?x:(?[ [ x ] + [ y ] ]))/'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?x:(?[ <-- HERE [ x ] + [ y ] ]))/ at -e line 1.
$ ./perl -le'my $s=qr/(?x:(?[ [ x ] + [ y ] ]))/; print "x"=~/(?[ $s ])/'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?x:(?[ <-- HERE [ x ] + [ y ] ]))/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^:(?x:(?[ [ x ] + [ y ] ]))) ])/ at -e line 1.
Expecting '(?flags:(?[...' in regex; marked by <-- HERE in m/(?[ (?^:(?x
<-- HERE :(?[ [ x ] + [ y ] ]))) ])/ at -e line 1.
./perl -le'my $s=qr/(?x:(?[ [ x ] ]))/; print "x"=~/(?[ $s ])/'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?x:(?[ <-- HERE [ x ] ]))/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^:(?x:(?[ [ x ] ]))) ])/ at -e line 1.
Expecting '(?flags:(?[...' in regex; marked by <-- HERE in m/(?[ (?^:(?x
<-- HERE :(?[ [ x ] ]))) ])/ at -e line 1.
$ ./perl -le'my $s=qr/(?[ [ x ] ])/x; print "x"=~/(?[ $s ])/'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE [ x ] ])/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^x:(?[ [ x ] ])) ])/ at -e line 1.
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ (?^x:(?[ <-- HERE [ x ] ])) ])/ at -e line 1.
1
The last two for me are *extremely* problematic. Saying:
qr/PAT/x
and saying
qr/(?x:PAT)/
should produce identical results. Given these issues IMO unfortunately this
feature is not ready to be marked un-experimental, and i think writing down
a grammar and figuring out what is and should be allowed in a formal sense
is required to wrap our heads around it and understand if it is right for
us to support long term.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
That could be fixed, but it will be fixed automatically without extra effort by making it non-experimental. As to the rest, @demerphq found some real bugs. I have a WIP at smoke-me/khw-sets which I believe addresses his examples. I apologize for the awful error messages. I think this branch cleans them up. The stuff about flags, etc. were extremely misleading. Before I proceed down that branch, I'm hoping @demerphq will play with it a little to see if he can find some results that are still confusing. Flags should now be irrelevant. |
A set operations expression can contain a previously-compiled one interpolated in. Prior to this commit, some heuristics were employed to verify it actually was such a thing, and not a sort of look-alike that wasn't necessarily valid. The heuristics actually forbade legal ones. I don't know of any illegal ones that were let through, but it is certainly possible. Also, the error/warning messages referred to the heuristics, and were unhelpful at best. The technique used instead in this commit is to return a regop only used by this feature for any nested compilations. This guarantees that the caller can determine if the result is valid, and what that result is without having to do any heuristics or inspecting any flags. The error/warning messages are changed to reflect this, and I believe are now helpful. This fixes the bugs in #16779 #16779 (comment)
A set operations expression can contain a previously-compiled one interpolated in. Prior to this commit, some heuristics were employed to verify it actually was such a thing, and not a sort of look-alike that wasn't necessarily valid. The heuristics actually forbade legal ones. I don't know of any illegal ones that were let through, but it is certainly possible. Also, the error/warning messages referred to the heuristics, and were unhelpful at best. The technique used instead in this commit is to return a regop only used by this feature for any nested compilations. This guarantees that the caller can determine if the result is valid, and what that result is without having to do any heuristics or inspecting any flags. The error/warning messages are changed to reflect this, and I believe are now helpful. This fixes the bugs in #16779 #16779 (comment)
It seems a bit weird that I cant do this: my $cc1= qr/[a-c]/; but I can do this: my $cc3= qr/(?[ [a-c] + [x-z] ])/; and I can do this: my $ecc1= qr/(?[ [a-c] ])/; Also shouldn't these two produce the same errors? $ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ ])) ])/ ? "yes" : "no";' $ ./perl -Ilib -lE'print "x"=~/(?[ (?x:(?[ ])) ])/ ? "yes" : "no";' Why does (?x:(?[ ])) produce an "Unexpected character in regex" and (?^x:(?[ ])) produce a Incomplete expression within. Similarly why does: (?x:(?[ [x] ])) produce an exception but (?^x:(?[ [x] ])) does not? The docs say "(?^x:foo) is equivalent to (?x-imns:foo)" so why is one accepted and one is not? $ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ [x] ])) ])/ ? "yes" : "no";' $ ./perl -Ilib -lE'print "x"=~/(?[ (?x-imns:(?[ [x] ])) ])/ ? "yes" : "no";' |
On 2/19/22 06:22, Yves Orton wrote:
It seems a bit weird that I cant do this:
my $cc1= qr/[a-c]/;
my $cc2= qr/(?[ $cc1 + [x-z] ])/;
but I can do this:
my $cc3= qr/(?[ [a-c] + [x-z] ])/;
and I can do this:
my $ecc1= qr/(?[ [a-c] ])/;
my $ecc2= qr/(?[ $ecc1 + [x-z] ])/;
This has caused me to realize that the pod for this feature (in
perlrecharclass) should be rewritten as now in
#19459
The idea of assembling a pattern from smaller interpolated components
(usually scalars) led to problems in implementation and use. We had to
add \g{} beyond the initial \g, as an example, so as to allow the
programmer to write something that wouldn't get an entirely different
meaning depending on the interpolated string.
regex sets are even more vulnerable to this than other pattern
constructs. There are various operators, with precedence to worry
about. After several false starts, the current implementation only
guarantees compilation of interpolated strings inside a regex set if
each such string is itself a legal regex set. It makes sense that a
larger set would be constructed out of the basic building blocks of
smaller sets.
Your tests below all violate that (not until now documented) rule. I
don't think this constraint actually hinders building up a set to do the
complete set of its capabilities. It might be nice to extend things so
it also could accept other interpolated strings, but this would be a
nicety, not necessary for operation, and doing so would not change the
behavior of anything successfully working on the current regime. Thus I
don't think that should stand in the way of de-experimentalizing these.
Also shouldn't these two produce the same errors?
$ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^x:(?[ ])) ])/ at -e line 1.
Incomplete expression within '(?[ ])' in regex; marked by <-- HERE in
m/(?[ (?^x:(?[ <-- HERE ])) ])/ at -e line 1.
$ ./perl -Ilib -lE'print "x"=~/(?[ (?x:(?[ ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?x:(?[ ])) ])/ at -e line 1.
Unexpected character in regex; marked by <-- HERE in m/(?[ (? <-- HERE
x:(?[ ])) ])/ at -e line 1.
Why does (?x:(?[ ])) produce an "Unexpected character in regex" and
(?^x:(?[ ])) produce a Incomplete expression within.
Similarly why does:
(?x:(?[ [x] ]))
produce an exception but
(?^x:(?[ [x] ]))
does not?
I don't get an exception with either of those in blead. Both produce
Final program:
1: EXACT <x> (3)
3: END (0)
anchored "x" at 0..0 (checking anchored isall) minlen 1
The docs say "(?^x:foo) is equivalent to (?x-imns:foo)" so why is one
accepted and one is not?
$ ./perl -Ilib -lE'print "x"=~/(?[ (?^x:(?[ [x] ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?^x:(?[ [x] ])) ])/ at -e line 1.
yes
$ ./perl -Ilib -lE'print "x"=~/(?[ (?x-imns:(?[ [x] ])) ])/ ? "yes" : "no";'
The regex_sets feature is experimental in regex; marked by <-- HERE in
m/(?[ <-- HERE (?x-imns:(?[ [x] ])) ])/ at -e line 1.
Unexpected character in regex; marked by <-- HERE in m/(?[ (? <-- HERE
x-imns:(?[ [x] ])) ])/ at -e line 1.
I thought I had sufficiently fixed up the pod, but I haven't.
It should say that it is illegal to nest these. But you can achieve the
same effect by interpolating one that has already been compiled. In the
case you present:
my $nested = qr/(?x-imns:(?[ [x] ]))/;
print "x"=~/(?[ $nested ])/ ? "yes" : "no";
…
—
|
On Mon, 9 Dec 2019 at 20:26, Karl Williamson ***@***.***>
wrote:
On 12/9/19 11:42 AM, Yves Orton wrote:
> On Mon, 9 Dec 2019 at 19:32, demerphq ***@***.***> wrote:
> >
> > On Mon, 9 Dec 2019 at 19:31, Karl Williamson
> ***@***.***> wrote:
> > >
> > > On 12/9/19 11:11 AM, Yves Orton wrote:
> > > > I remember
> > > > being very surprised about where (?[]) were required or not.
> > >
> > > That doesn't make sense to me. They are always required for this
> > > feature to kick in.
> >
> > IIRC, they are also required sometimes inside of a (?[]) as well. That
> > is the part that is surprising.
>
> FWIW, this discussion and me saying we should write a grammar is related
to
>
> 19a498a
>
> commit 19a498a (yves/fix_131649)
> Author: Yves Orton ***@***.***>
> Date: Mon Jun 26 13:19:55 2017 +0200
>
> fix #131649 - extended charclass can trigger assert
>
> The extended charclass parser makes some assumptions during the
> first pass which are only true on well structured input, and it
> does not properly catch various errors. later on the code assumes
> that things the first pass will let through are valid, when in
> fact they should trigger errors.
>
> While I was debugging that ticket and fixing those issues I noticed
> the rules about how regexes inside of (?[ ... ]) are not as intuitive
> as I thought, and that wrappers are required inside sometimes. For
> instance I think I encountered these issues while leading up to the
> creation of the following error messages:
>
> +=item Unexpected ']' with no following ')' in (?[... in regex; marked
> by <-- HERE in m/%s/
> +
> +(F) While parsing an extended character class a ']' character was
> encountered
> +at a point in the definition where the only legal use of ']' is to
> close the
> +character class definition as part of a '])', you may have forgotten
> the close
> +paren, or otherwise confused the parser.
> +
> +=item Expecting close paren for nested extended charclass in regex;
> marked by <-- HERE in m/%s/
> +
> +(F) While parsing a nested extended character class like:
> +
> + (?[ ... (?flags:(?[ ... ])) ... ])
> + ^
> +
> +we expected to see a close paren ')' (marked by ^) but did not.
> +
> +=item Expecting close paren for wrapper for nested extended charclass
> in regex; marked by <-- HERE in m/%s/
> +
> +(F) While parsing a nested extended character class like:
> +
> + (?[ ... (?flags:(?[ ... ])) ... ])
> + ^
> +
> +we expected to see a close paren ')' (marked by ^) but did not.
>
> I dont remember more details, which is why I would like you to write
> the grammar down. For instance your original does not seem to account
> for (?[ (?flags:(?[ ... ])) ]) at all.
>
> What flags are legal? What things are illegal after a (?flags: ...)
> that is inside of an extended char-class, etc.
>
> Yves
So, the grammar I wrote earlier in this thread has this rule:
primary-expression:
compiled set-expression, which is typically an interpolated scalar
...
By that, I mean that you could say something like
my $foo = qr/([ ... ])/;
and later include $foo in another set expression
qr/([ ... + $foo + ... ])/;
But by the time I see the interpolated $foo, it looks like
qr/([ ... + (?^flags:(?[...])) + ... ])/;
But this assumes that it is coming from a stringified qr// object from the
current process.
Something like Sereal will extract the modifiers and then store them
independently from the pattern so that when it constructs a qr// it is
constructed properly using a function like regexp_pattern.
use re qw(regexp_pattern);
my $qr=qr/[a-z]/ix;
my ($pat,$mods)= re::regexp_pattern($qr);
my $new="(?$mods:$pat)";
should be a legal way to construct a qr// like but non-object based
equivalent of the original. It wont be (?^ix:...) it will be (?ix:...) and
I think it should work the same.
I really dont understand why you are objecting to this kind of thing. Why
should the caret notation be ok but the non-caret is not? That is the core
of my objection here.
The boiler plate of the expansion of $foo, including the flags, are
placed there by regcomp.c during the previous compilation of $foo. It
is essential that the pattern accept an interpolated previously compiled
set expression, but the programmer should not be constructing such
things manually. So the grammar merely refers to the interpolated
scalar without specifying the details, such as the flags. Those details
have changed before, and its possible they could change again.
Which is why I think the expectation that they can only contain (?^flags:
... ) is wrong.
(?^flags: ) is defined to be equivalent to a specific set of flags. I dont
see why the construct that does not use the '^' abbreviation should behave
differently from one that explicitly spells out the flags. That is my
concern here.
Since you wrote that ticket, there's been a tightening of the checking
that the interpolated code is actually from a previously compiled regex,
but AFAIK, there's no way for me to know its genesis for sure. If
someone does have a way, I'm all ears.
I dont think you should care. If I stringify the pattern to save it on
disk, and then bring it back as a string instead of a qr// object it should
still work. If I use our apis to compose the string into modifiers and
pattern then construct it back as a string it should operate just the same
as if it was qr// object.
Maybe the diagnostic messages and explanations can be fixed to be more
clear. But the bottom line is that the programmer should not be
specifying flags, etc inside one of these expressions, and hence the
grammar shouldn't include that.
That just doesn't make sense to me. If we can parse the flags properly, and
we can, then we should. Lets say we introduce a new flag in the future
that changes how character classes are introduced (lets NOT, but let us say
we did for the sake of the argument) then that flag should change how the
pattern is processed.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
See Perl#16779 (comment) This commit extends the flags accepted by nested calls to regex sets to any legal set. Previously it allowed only '^', as that is what an actual compilation would return. But as pointed out in the conversation in that ticket, it's possible to detach flags from the rest of the pattern, and to write equivalent ones in multiple ways. This commit just changes one line to look for any legal flag, not restricting it to '^'.
On 2/24/22 21:36, Yves Orton wrote:
On Mon, 9 Dec 2019 at 20:26, Karl Williamson ***@***.***>
wrote:
> On 12/9/19 11:42 AM, Yves Orton wrote:
> > On Mon, 9 Dec 2019 at 19:32, demerphq ***@***.***> wrote:
> > >
> > > On Mon, 9 Dec 2019 at 19:31, Karl Williamson
> > ***@***.***> wrote:
> > > >
> > > > On 12/9/19 11:11 AM, Yves Orton wrote:
> > > > > I remember
> > > > > being very surprised about where (?[]) were required or not.
> > > >
> > > > That doesn't make sense to me. They are always required for this
> > > > feature to kick in.
> > >
> > > IIRC, they are also required sometimes inside of a (?[]) as well.
That
> > > is the part that is surprising.
> >
> > FWIW, this discussion and me saying we should write a grammar is
related
> to
> >
> > 19a498a
> >
> > commit 19a498a (yves/fix_131649)
> > Author: Yves Orton ***@***.***>
> > Date: Mon Jun 26 13:19:55 2017 +0200
> >
> > fix #131649 - extended charclass can trigger assert
> >
> > The extended charclass parser makes some assumptions during the
> > first pass which are only true on well structured input, and it
> > does not properly catch various errors. later on the code assumes
> > that things the first pass will let through are valid, when in
> > fact they should trigger errors.
> >
> > While I was debugging that ticket and fixing those issues I noticed
> > the rules about how regexes inside of (?[ ... ]) are not as intuitive
> > as I thought, and that wrappers are required inside sometimes. For
> > instance I think I encountered these issues while leading up to the
> > creation of the following error messages:
> >
> > +=item Unexpected ']' with no following ')' in (?[... in regex; marked
> > by <-- HERE in m/%s/
> > +
> > +(F) While parsing an extended character class a ']' character was
> > encountered
> > +at a point in the definition where the only legal use of ']' is to
> > close the
> > +character class definition as part of a '])', you may have forgotten
> > the close
> > +paren, or otherwise confused the parser.
> > +
> > +=item Expecting close paren for nested extended charclass in regex;
> > marked by <-- HERE in m/%s/
> > +
> > +(F) While parsing a nested extended character class like:
> > +
> > + (?[ ... (?flags:(?[ ... ])) ... ])
> > + ^
> > +
> > +we expected to see a close paren ')' (marked by ^) but did not.
> > +
> > +=item Expecting close paren for wrapper for nested extended charclass
> > in regex; marked by <-- HERE in m/%s/
> > +
> > +(F) While parsing a nested extended character class like:
> > +
> > + (?[ ... (?flags:(?[ ... ])) ... ])
> > + ^
> > +
> > +we expected to see a close paren ')' (marked by ^) but did not.
> >
> > I dont remember more details, which is why I would like you to write
> > the grammar down. For instance your original does not seem to account
> > for (?[ (?flags:(?[ ... ])) ]) at all.
> >
> > What flags are legal? What things are illegal after a (?flags: ...)
> > that is inside of an extended char-class, etc.
> >
> > Yves
>
> So, the grammar I wrote earlier in this thread has this rule:
>
> primary-expression:
> compiled set-expression, which is typically an interpolated scalar
> ...
>
> By that, I mean that you could say something like
>
> my $foo = qr/([ ... ])/;
>
> and later include $foo in another set expression
>
> qr/([ ... + $foo + ... ])/;
>
> But by the time I see the interpolated $foo, it looks like
>
> qr/([ ... + (?^flags:(?[...])) + ... ])/;
>
But this assumes that it is coming from a stringified qr// object from the
current process.
Something like Sereal will extract the modifiers and then store them
independently from the pattern so that when it constructs a qr// it is
constructed properly using a function like regexp_pattern.
use re qw(regexp_pattern);
my $qr=qr/[a-z]/ix;
my ($pat,$mods)= re::regexp_pattern($qr);
my $new="(?$mods:$pat)";
should be a legal way to construct a qr// like but non-object based
equivalent of the original. It wont be (?^ix:...) it will be (?ix:...) and
I think it should work the same.
I really dont understand why you are objecting to this kind of thing. Why
should the caret notation be ok but the non-caret is not? That is the core
of my objection here.
I now think I understand your motivations, and have pushed a new commit
to address your issues.
>
> The boiler plate of the expansion of $foo, including the flags, are
> placed there by regcomp.c during the previous compilation of $foo. It
> is essential that the pattern accept an interpolated previously compiled
> set expression, but the programmer should not be constructing such
> things manually. So the grammar merely refers to the interpolated
> scalar without specifying the details, such as the flags. Those details
> have changed before, and its possible they could change again.
>
Which is why I think the expectation that they can only contain (?^flags:
... ) is wrong.
Fixed
(?^flags: ) is defined to be equivalent to a specific set of flags. I dont
see why the construct that does not use the '^' abbreviation should behave
differently from one that explicitly spells out the flags. That is my
concern here.
Fixed
>
> Since you wrote that ticket, there's been a tightening of the checking
> that the interpolated code is actually from a previously compiled regex,
> but AFAIK, there's no way for me to know its genesis for sure. If
> someone does have a way, I'm all ears.
>
I dont think you should care. If I stringify the pattern to save it on
disk, and then bring it back as a string instead of a qr// object it should
still work. If I use our apis to compose the string into modifiers and
pattern then construct it back as a string it should operate just the same
as if it was qr// object.
The reason I care is that the sets code builds up larger sets from
simple components or other sets. Each such set must be a single
character wide. Most compiled patterns don't fit that. Extra work
could allow some of that, but I don't see that as necessary to remove
experimental status
>
> Maybe the diagnostic messages and explanations can be fixed to be more
> clear. But the bottom line is that the programmer should not be
> specifying flags, etc inside one of these expressions, and hence the
> grammar shouldn't include that.
>
That just doesn't make sense to me. If we can parse the flags properly, and
we can, then we should. Lets say we introduce a new flag in the future
that changes how character classes are introduced (lets NOT, but let us say
we did for the sake of the argument) then that flag should change how the
pattern is processed.
You're right; fixed
…
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
—
Reply to this email directly, view it on GitHub
<#16779 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA2DH6ZRCKRG2ZFZ32DJRDU44BONANCNFSM4JYCTEFA>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
On Tue, 1 Mar 2022 at 00:48, Karl Williamson ***@***.***>
wrote:
On 2/24/22 21:36, Yves Orton wrote:
> ... Why
> should the caret notation be ok but the non-caret is not? That is the
core
> of my objection here.
I now think I understand your motivations, and have pushed a new commit
to address your issues.
Yay! Thank you Karl. Sorry I didn't manage to explain myself better
earlier.
> Which is why I think the expectation that they can only contain (?^flags:
> ... ) is wrong.
Fixed
Summary response to all the "Fixed" responses you made: Yay!
> I dont think you should care. If I stringify the pattern to save it on
> disk, and then bring it back as a string instead of a qr// object it
should
> still work. If I use our apis to compose the string into modifiers and
> pattern then construct it back as a string it should operate just the
same
> as if it was qr// object.
The reason I care is that the sets code builds up larger sets from
simple components or other sets. Each such set must be a single
character wide. Most compiled patterns don't fit that. Extra work
could allow some of that, but I don't see that as necessary to remove
experimental status
Of course I meant *only* such patterns that ONLY contain a charclass
definition. At least now. I think it would be nice if we could handle
/(a|b|c)/ and realize it is the same as /[abc]/. But I agree such a thing
would be a bonus, and is not necessary for removing the experimental status.
My general concern is that we don't do anything that breaks the "a string
is a regex just as much as qr// object is, provided they are textually
logically equivalent". (I include the word logical because of equivalencies
like (?^i: ... ) and (?i: ... ), etc, but that is not meant to meant to
include the pattern matching equivalence of /(a|b|c)/ and /[abc]/. And of
course this comment overlooks such things as (?{ }). I would prefer a
better term but I couldn't think of one. )
I believe with your latest patches I have no further objections to removing
the experimental status, although I say that without trying the latest
patch sequence out. I'll do my best to do so today. But based on your
responses to this mail I expect I will have no further objections.
Thanks again for your patience on this and apologies again for not coming
up with a better way to express my concerns that would have sped the
process up.
Cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
See #16779 (comment) This commit extends the flags accepted by nested calls to regex sets to any legal set. Previously it allowed only '^', as that is what an actual compilation would return. But as pointed out in the conversation in that ticket, it's possible to detach flags from the rest of the pattern, and to write equivalent ones in multiple ways. This commit just changes one line to look for any legal flag, not restricting it to '^'.
Migrated from rt.perl.org#133707 (status was 'new')
Searchable as RT133707$
The text was updated successfully, but these errors were encountered: