From faafdfb357bd58c5ef82de6fef358bb6a3baea1e Mon Sep 17 00:00:00 2001
From: Nicholas Wilson
Date: Fri, 28 Mar 2025 12:04:24 +0000
Subject: [PATCH] Add documentation for subroutine return values
---
doc/html/pcre2pattern.html | 84 ++++--
doc/html/pcre2syntax.html | 26 +-
doc/pcre2.txt | 574 ++++++++++++++++++++-----------------
doc/pcre2pattern.3 | 47 ++-
doc/pcre2syntax.3 | 26 +-
testdata/testinput2 | 41 +++
testdata/testoutput2 | 72 +++++
7 files changed, 592 insertions(+), 278 deletions(-)
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 62406efaa..3279ddcfa 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -41,14 +41,12 @@ pcre2pattern man page
CONDITIONAL GROUPS
COMMENTS
RECURSIVE PATTERNS
-GROUPS AS SUBROUTINES
-ONIGURUMA SUBROUTINE SYNTAX
-CALLOUTS
-BACKTRACKING CONTROL
-EBCDIC ENVIRONMENTS
-SEE ALSO
-AUTHOR
-REVISION
+CALLOUTS
+BACKTRACKING CONTROL
+EBCDIC ENVIRONMENTS
+SEE ALSO
+AUTHOR
+REVISION
@@ -3399,7 +3397,9 @@
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
-
+
+Groups as subroutines
+
If the syntax for a recursive group call (either by number or by name) is used
outside the parentheses to which it refers, it operates a bit like a subroutine
@@ -3446,8 +3446,60 @@
in groups when called as subroutines is described in the section entitled
"Backtracking verbs in subroutines"
below.
+
+
+Recursion and subroutines with returned capture groups
+
+
+Since PCRE2 10.46, recursion and subroutine calls may also specify a list of
+capture groups to return. This is a PCRE2 syntax extension not supported by
+Perl. The pattern matching recurses into the referenced expression as described
+above, however, when the recursion returns to the calling expression the
+subgroups captured during the recursion can be retained when the calling
+expression's context is restored.
+
+
+When used as a subroutine, this allows the subroutine's capture groups to
+be used as return values.
+
+
+Only the specific capture groups listed by the caller will be retained, using
+the following syntax:
+
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (?n(grouplist)) )
+ (?+n(grouplist)) )
+ (?-n(grouplist)) ) call subroutine, returning capture groups
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+
+
+
+The list of capture groups "grouplist" is a comma-separated list of (absolute
+or relative) group numbers, and group names enclosed in single quotes or angle
+brackets.
+
+
+Here is an example which first uses the DEFINE condition to create a re-usable
+routine for matching a weekday, then calls that subroutine and retains the
+groups it captures for use later:
+
+ (?x: # ignore whitespace for clarity
+ # Define the routine "weekendday" which matches Saturday or
+ # Sunday, and returns the Sat/Sun prefix as \k<short>.
+ (?(DEFINE) (?<weekendday>
+ (?|(?<short>Sat)urday|(?<short>Sun)day) ) )
+ # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
+ (?&weekendday(<short>)),\k<short> )
+
+
+
+This feature is not available using the Oniguruma syntax \g<...> or \g'...'
+below.
-
+
+Oniguruma subroutine syntax
+
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
@@ -3465,7 +3517,7 @@
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
synonymous. The former is a backreference; the latter is a subroutine call.
-
+
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
@@ -3543,7 +3595,7 @@
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
@@ -4071,7 +4123,7 @@
is no such group within the subroutine's group, the subroutine match fails and
there is a backtrack at the outer level.
-EBCDIC ENVIRONMENTS
+EBCDIC ENVIRONMENTS
Differences in the way PCRE behaves when it is running in an EBCDIC environment
are covered in this section.
@@ -4115,12 +4167,12 @@
points. However, if the range is specified numerically, for example,
[\x88-\x92] or [h-\x92], all code points are included.
-SEE ALSO
+SEE ALSO
pcre2api(3), pcre2callout(3), pcre2matching(3),
pcre2syntax(3), pcre2(3).
-AUTHOR
+AUTHOR
Philip Hazel
@@ -4129,7 +4181,7 @@
AUTHOR
Cambridge, England.
-REVISION
+REVISION
Last updated: 27 November 2024
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index 72a23214f..f5e757427 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -566,14 +566,14 @@
SUBSTRING SCAN ASSERTION
(*scan_substring:(grouplist)...) scan captured substring
(*scs:(grouplist)...) scan captured substring
-The comma-separated list may identify groups in any of the following ways:
+The comma-separated list "grouplist" may identify groups in any of the
+following ways:
n absolute reference
+n relative reference
-n relative reference
<name> name
'name' name
-
SCRIPT RUNS
@@ -621,6 +621,28 @@ SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)<
\g<-n> call subroutine by relative number (PCRE2 extension)
\g'-n' call subroutine by relative number (PCRE2 extension)
+The variants using parentheses (?...) may also specify a list of capture groups
+to return, which shall be retained in the calling subexpression if set during
+the recursion (this feature is not supported by Perl).
+
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (PCRE2 extension)
+ (?n(grouplist)) )
+ (?+n(grouplist)) ) call subroutine, returning capture groups
+ (?-n(grouplist)) ) (PCRE2 extension)
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+
+The comma-separated list "grouplist" uses the same syntax as
+(*scan_substring:(grouplist)...), and may identify groups in any of the
+following ways:
+
+ n absolute reference
+ +n relative reference
+ -n relative reference
+ <name> name
+ 'name' name
+
CONDITIONAL PATTERNS
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index 46c2364e4..df707b6b0 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -10018,8 +10018,7 @@ RECURSIVE PATTERNS
\1 does now match "b" and so the whole match succeeds. This match used
to fail in Perl, but in later versions (I tried 5.024) it now works.
-
-GROUPS AS SUBROUTINES
+ Groups as subroutines
If the syntax for a recursive group call (either by number or by name)
is used outside the parentheses to which it refers, it operates a bit
@@ -10064,80 +10063,120 @@ GROUPS AS SUBROUTINES
subroutines is described in the section entitled "Backtracking verbs in
subroutines" below.
+ Recursion and subroutines with returned capture groups
-ONIGURUMA SUBROUTINE SYNTAX
+ Since PCRE2 10.46, recursion and subroutine calls may also specify a
+ list of capture groups to return. This is a PCRE2 syntax extension not
+ supported by Perl. The pattern matching recurses into the referenced
+ expression as described above, however, when the recursion returns to
+ the calling expression the subgroups captured during the recursion can
+ be retained when the calling expression's context is restored.
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ When used as a subroutine, this allows the subroutine's capture groups
+ to be used as return values.
+
+ Only the specific capture groups listed by the caller will be retained,
+ using the following syntax:
+
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (?n(grouplist)) )
+ (?+n(grouplist)) )
+ (?-n(grouplist)) ) call subroutine, returning capture groups
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+
+ The list of capture groups "grouplist" is a comma-separated list of
+ (absolute or relative) group numbers, and group names enclosed in sin-
+ gle quotes or angle brackets.
+
+ Here is an example which first uses the DEFINE condition to create a
+ re-usable routine for matching a weekday, then calls that subroutine
+ and retains the groups it captures for use later:
+
+ (?x: # ignore whitespace for clarity
+ # Define the routine "weekendday" which matches Saturday or
+ # Sunday, and returns the Sat/Sun prefix as \k.
+ (?(DEFINE) (?
+ (?|(?Sat)urday|(?Sun)day) ) )
+ # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
+ (?&weekendday()),\k )
+
+ This feature is not available using the Oniguruma syntax \g<...> or
+ \g'...' below.
+
+ Oniguruma subroutine syntax
+
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
an alternative syntax for calling a group as a subroutine, possibly re-
- cursively. Here are two of the examples used above, rewritten using
+ cursively. Here are two of the examples used above, rewritten using
this syntax:
(? \( ( (?>[^()]+) | \g )* \) )
(sens|respons)e and \g'1'ibility
- PCRE2 supports an extension to Oniguruma: if a number is preceded by a
+ PCRE2 supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a backreference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a backreference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
- PCRE2 provides a similar feature, but of course it cannot obey arbi-
- trary Perl code. The feature is called "callout". The caller of PCRE2
- provides an external function by putting its entry point in a match
- context using the function pcre2_set_callout(), and then passing that
- context to pcre2_match() or pcre2_dfa_match(). If no match context is
- passed, or if the callout entry point is set to NULL, callout points
- will be passed over silently during matching. To disallow callouts in
+ PCRE2 provides a similar feature, but of course it cannot obey arbi-
+ trary Perl code. The feature is called "callout". The caller of PCRE2
+ provides an external function by putting its entry point in a match
+ context using the function pcre2_set_callout(), and then passing that
+ context to pcre2_match() or pcre2_dfa_match(). If no match context is
+ passed, or if the callout entry point is set to NULL, callout points
+ will be passed over silently during matching. To disallow callouts in
the pattern syntax, you may use the PCRE2_EXTRA_NEVER_CALLOUT option.
- Within a regular expression, (?C) indicates a point at which the
- external function is to be called. There are two kinds of callout:
- those with a numerical argument and those with a string argument. (?C)
- on its own with no argument is treated as (?C0). A numerical argument
- allows the application to distinguish between different callouts.
- String arguments were added for release 10.20 to make it possible for
- script languages that use PCRE2 to embed short scripts within patterns
+ Within a regular expression, (?C) indicates a point at which the
+ external function is to be called. There are two kinds of callout:
+ those with a numerical argument and those with a string argument. (?C)
+ on its own with no argument is treated as (?C0). A numerical argument
+ allows the application to distinguish between different callouts.
+ String arguments were added for release 10.20 to make it possible for
+ script languages that use PCRE2 to embed short scripts within patterns
in a similar way to Perl.
During matching, when PCRE2 reaches a callout point, the external func-
- tion is called. It is provided with the number or string argument of
- the callout, the position in the pattern, and one item of data that is
+ tion is called. It is provided with the number or string argument of
+ the callout, the position in the pattern, and one item of data that is
also set in the match block. The callout function may cause matching to
proceed, to backtrack, or to fail.
- By default, PCRE2 implements a number of optimizations at matching
- time, and one side-effect is that sometimes callouts are skipped. If
- you need all possible callouts to happen, you need to set options that
- disable the relevant optimizations. More details, including a complete
- description of the programming interface to the callout function, are
+ By default, PCRE2 implements a number of optimizations at matching
+ time, and one side-effect is that sometimes callouts are skipped. If
+ you need all possible callouts to happen, you need to set options that
+ disable the relevant optimizations. More details, including a complete
+ description of the programming interface to the callout function, are
given in the pcre2callout documentation.
Callouts with numerical arguments
- If you just want to have a means of identifying different callout
- points, put a number less than 256 after the letter C. For example,
+ If you just want to have a means of identifying different callout
+ points, put a number less than 256 after the letter C. For example,
this pattern has two callout points:
(?C1)abc(?C2)def
- If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
- callouts are automatically installed before each item in the pattern.
- They are all numbered 255. If there is a conditional group in the pat-
+ If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
+ callouts are automatically installed before each item in the pattern.
+ They are all numbered 255. If there is a conditional group in the pat-
tern whose condition is an assertion, an additional callout is inserted
- just before the condition. An explicit callout may also be set at this
+ just before the condition. An explicit callout may also be set at this
position, as in this example:
(?(?C9)(?=a)abc|def)
@@ -10147,79 +10186,79 @@ CALLOUTS
Callouts with string arguments
- A delimited string may be used instead of a number as a callout argu-
- ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
+ A delimited string may be used instead of a number as a callout argu-
+ ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
ending delimiter is the same as the start, except for {, where the end-
- ing delimiter is }. If the ending delimiter is needed within the
+ ing delimiter is }. If the ending delimiter is needed within the
string, it must be doubled. For example:
(?C'ab ''c'' d')xyz(?C{any text})pqr
- The doubling is removed before the string is passed to the callout
+ The doubling is removed before the string is passed to the callout
function.
BACKTRACKING CONTROL
- There are a number of special "Backtracking Control Verbs" (to use
- Perl's terminology) that modify the behaviour of backtracking during
- matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
+ There are a number of special "Backtracking Control Verbs" (to use
+ Perl's terminology) that modify the behaviour of backtracking during
+ matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
verbs take either form, and may behave differently depending on whether
- or not a name argument is present. The names are not required to be
+ or not a name argument is present. The names are not required to be
unique within the pattern.
- By default, for compatibility with Perl, a name is any sequence of
+ By default, for compatibility with Perl, a name is any sequence of
characters that does not include a closing parenthesis. The name is not
- processed in any way, and it is not possible to include a closing
- parenthesis in the name. This can be changed by setting the
- PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
+ processed in any way, and it is not possible to include a closing
+ parenthesis in the name. This can be changed by setting the
+ PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
ble.
- When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
- verb names and only an unescaped closing parenthesis terminates the
- name. However, the only backslash items that are permitted are \Q, \E,
- and sequences such as \x{100} that define character code points. Char-
+ When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
+ verb names and only an unescaped closing parenthesis terminates the
+ name. However, the only backslash items that are permitted are \Q, \E,
+ and sequences such as \x{100} that define character code points. Char-
acter type escapes such as \d are faulted.
A closing parenthesis can be included in a name either as \) or between
- \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
- or PCRE2_EXTENDED_MORE option is also set, unescaped white space in
+ \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
+ or PCRE2_EXTENDED_MORE option is also set, unescaped white space in
verb names is skipped, and #-comments are recognized, exactly as in the
rest of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not af-
fect verb names unless PCRE2_ALT_VERBNAMES is also set.
- The maximum length of a name is 255 in the 8-bit library and 65535 in
- the 16-bit and 32-bit libraries. If the name is empty, that is, if the
- closing parenthesis immediately follows the colon, the effect is as if
+ The maximum length of a name is 255 in the 8-bit library and 65535 in
+ the 16-bit and 32-bit libraries. If the name is empty, that is, if the
+ closing parenthesis immediately follows the colon, the effect is as if
the colon were not there. Any number of these verbs may occur in a pat-
tern. Except for (*ACCEPT), they may not be quantified.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using the tra-
- ditional matching function or JIT, because they use backtracking algo-
- rithms. With the exception of (*FAIL), which behaves like a failing
- negative assertion, the backtracking control verbs cause an error if
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using the tra-
+ ditional matching function or JIT, because they use backtracking algo-
+ rithms. With the exception of (*FAIL), which behaves like a failing
+ negative assertion, the backtracking control verbs cause an error if
encountered by the DFA matching function.
- The behaviour of these verbs in repeated groups, assertions, and in
- capture groups called as subroutines (whether or not recursively) is
+ The behaviour of these verbs in repeated groups, assertions, and in
+ capture groups called as subroutines (whether or not recursively) is
documented below.
Optimizations that affect backtracking verbs
PCRE2 contains some optimizations that are used to speed up matching by
running some checks at the start of each match attempt. For example, it
- may know the minimum length of matching subject, or that a particular
+ may know the minimum length of matching subject, or that a particular
character must be present. When one of these optimizations bypasses the
- running of a match, any included backtracking verbs will not, of
+ running of a match, any included backtracking verbs will not, of
course, be processed. You can suppress the start-of-match optimizations
- by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
+ by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
pile(), by calling pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF
- directive, or by starting the pattern with (*NO_START_OPT). There is
- more discussion of this option in the section entitled "Compiling a
+ directive, or by starting the pattern with (*NO_START_OPT). There is
+ more discussion of this option in the section entitled "Compiling a
pattern" in the pcre2api documentation.
- Experiments with Perl suggest that it too has similar optimizations,
+ Experiments with Perl suggest that it too has similar optimizations,
and like PCRE2, turning them off can change the result of a match.
Verbs that act immediately
@@ -10228,77 +10267,77 @@ BACKTRACKING CONTROL
(*ACCEPT) or (*ACCEPT:NAME)
- This verb causes the match to end successfully, skipping the remainder
- of the pattern. However, when it is inside a capture group that is
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. However, when it is inside a capture group that is
called as a subroutine, only that group is ended successfully. Matching
then continues at the outer level. If (*ACCEPT) in triggered in a posi-
- tive assertion, the assertion succeeds; in a negative assertion, the
+ tive assertion, the assertion succeeds; in a negative assertion, the
assertion fails.
- If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
+ If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
tured. For example:
A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
- (*ACCEPT) is the only backtracking verb that is allowed to be quanti-
- fied because an ungreedy quantification with a minimum of zero acts
+ (*ACCEPT) is the only backtracking verb that is allowed to be quanti-
+ fied because an ungreedy quantification with a minimum of zero acts
only when a backtrack happens. Consider, for example,
(A(*ACCEPT)??B)C
- where A, B, and C may be complex expressions. After matching "A", the
- matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
- is triggered and the match succeeds. In both cases, all but C is cap-
- tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re-
+ where A, B, and C may be complex expressions. After matching "A", the
+ matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT)
+ is triggered and the match succeeds. In both cases, all but C is cap-
+ tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re-
peated (*ACCEPT) of this type means "succeed on backtrack".
- Warning: (*ACCEPT) should not be used within a script run group, be-
- cause it causes an immediate exit from the group, bypassing the script
+ Warning: (*ACCEPT) should not be used within a script run group, be-
+ cause it causes an immediate exit from the group, bypassing the script
run checking.
(*FAIL) or (*FAIL:NAME)
- This verb causes a matching failure, forcing backtracking to occur. It
- may be abbreviated to (*F). It is equivalent to (?!) but easier to
+ This verb causes a matching failure, forcing backtracking to occur. It
+ may be abbreviated to (*F). It is equivalent to (?!) but easier to
read. The Perl documentation notes that it is probably useful only when
combined with (?{}) or (??{}). Those are, of course, Perl features that
- are not present in PCRE2. The nearest equivalent is the callout fea-
+ are not present in PCRE2. The nearest equivalent is the callout fea-
ture, as for example in this pattern:
a+(?C)(*FAIL)
- A match with the string "aaaa" always fails, but the callout is taken
+ A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
- (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC-
- CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is
+ (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC-
+ CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is
recorded just before the verb acts.
Recording which path was taken
- There is one verb whose main purpose is to track how a match was ar-
- rived at, though it also has a secondary use in conjunction with ad-
+ There is one verb whose main purpose is to track how a match was ar-
+ rived at, though it also has a secondary use in conjunction with ad-
vancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME)
- A name is always required with this verb. For all the other backtrack-
+ A name is always required with this verb. For all the other backtrack-
ing control verbs, a NAME argument is optional.
- When a match succeeds, the name of the last-encountered mark name on
+ When a match succeeds, the name of the last-encountered mark name on
the matching path is passed back to the caller as described in the sec-
tion entitled "Other information about the match" in the pcre2api docu-
- mentation. This applies to all instances of (*MARK) and other verbs,
+ mentation. This applies to all instances of (*MARK) and other verbs,
including those inside assertions and atomic groups. However, there are
- differences in those cases when (*MARK) is used in conjunction with
+ differences in those cases when (*MARK) is used in conjunction with
(*SKIP) as described below.
- The mark name that was last encountered on the matching path is passed
- back. A verb without a NAME argument is ignored for this purpose. Here
- is an example of pcre2test output, where the "mark" modifier requests
+ The mark name that was last encountered on the matching path is passed
+ back. A verb without a NAME argument is ignored for this purpose. Here
+ is an example of pcre2test output, where the "mark" modifier requests
the retrieval and outputting of (*MARK) data:
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
@@ -10310,77 +10349,77 @@ BACKTRACKING CONTROL
MK: B
The (*MARK) name is tagged with "MK:" in this output, and in this exam-
- ple it indicates which of the two alternatives matched. This is a more
- efficient way of obtaining this information than putting each alterna-
+ ple it indicates which of the two alternatives matched. This is a more
+ efficient way of obtaining this information than putting each alterna-
tive in its own capturing parentheses.
- If a verb with a name is encountered in a positive assertion that is
- true, the name is recorded and passed back if it is the last-encoun-
+ If a verb with a name is encountered in a positive assertion that is
+ true, the name is recorded and passed back if it is the last-encoun-
tered. This does not happen for negative assertions or failing positive
assertions.
- After a partial match or a failed match, the last encountered name in
+ After a partial match or a failed match, the last encountered name in
the entire match process is returned. For example:
re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
data> XP
No match, mark = B
- Note that in this unanchored example the mark is retained from the
+ Note that in this unanchored example the mark is retained from the
match attempt that started at the letter "X" in the subject. Subsequent
match attempts starting at "P" and then with an empty string do not get
as far as the (*MARK) item, but nevertheless do not reset it.
- If you are interested in (*MARK) values after failed matches, you
- should probably either set the PCRE2_NO_START_OPTIMIZE option or call
- pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF directive (see
+ If you are interested in (*MARK) values after failed matches, you
+ should probably either set the PCRE2_NO_START_OPTIMIZE option or call
+ pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF directive (see
above) to ensure that the match is always attempted.
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
- tinues with what follows, but if there is a subsequent match failure,
- causing a backtrack to the verb, a failure is forced. That is, back-
- tracking cannot pass to the left of the verb. However, when one of
- these verbs appears inside an atomic group or in an atomic lookaround
- assertion that is true, its effect is confined to that group, because
- once the group has been matched, there is never any backtracking into
- it. Backtracking from beyond an atomic assertion or group ignores the
+ tinues with what follows, but if there is a subsequent match failure,
+ causing a backtrack to the verb, a failure is forced. That is, back-
+ tracking cannot pass to the left of the verb. However, when one of
+ these verbs appears inside an atomic group or in an atomic lookaround
+ assertion that is true, its effect is confined to that group, because
+ once the group has been matched, there is never any backtracking into
+ it. Backtracking from beyond an atomic assertion or group ignores the
entire group, and seeks a preceding backtracking point.
- These verbs differ in exactly what kind of failure occurs when back-
- tracking reaches them. The behaviour described below is what happens
- when the verb is not in a subroutine or an assertion. Subsequent sec-
+ These verbs differ in exactly what kind of failure occurs when back-
+ tracking reaches them. The behaviour described below is what happens
+ when the verb is not in a subroutine or an assertion. Subsequent sec-
tions cover these special cases.
(*COMMIT) or (*COMMIT:NAME)
- This verb causes the whole match to fail outright if there is a later
+ This verb causes the whole match to fail outright if there is a later
matching failure that causes backtracking to reach it. Even if the pat-
- tern is unanchored, no further attempts to find a match by advancing
- the starting point take place. If (*COMMIT) is the only backtracking
+ tern is unanchored, no further attempts to find a match by advancing
+ the starting point take place. If (*COMMIT) is the only backtracking
verb that is encountered, once it has been passed pcre2_match() is com-
mitted to finding a match at the current starting point, or not at all.
For example:
a+(*COMMIT)b
- This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish."
- The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
- MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
- ing back to the caller. However, (*SKIP:NAME) searches only for names
+ The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
+ MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
+ ing back to the caller. However, (*SKIP:NAME) searches only for names
that are set with (*MARK), ignoring those set by any of the other back-
tracking verbs.
- If there is more than one backtracking verb in a pattern, a different
- one that follows (*COMMIT) may be triggered first, so merely passing
+ If there is more than one backtracking verb in a pattern, a different
+ one that follows (*COMMIT) may be triggered first, so merely passing
(*COMMIT) during a match does not always guarantee that a match must be
at this starting point.
Note that (*COMMIT) at the start of a pattern is not the same as an an-
- chor, unless PCRE2's start-of-match optimizations are turned off, as
+ chor, unless PCRE2's start-of-match optimizations are turned off, as
shown in this output from pcre2test:
re> /(*COMMIT)abc/
@@ -10391,68 +10430,68 @@ BACKTRACKING CONTROL
data> xyzabc
No match
- For the first pattern, PCRE2 knows that any match must start with "a",
- so the optimization skips along the subject to "a" before applying the
- pattern to the first set of data. The match attempt then succeeds. The
- second pattern disables the optimization that skips along to the first
- character. The pattern is now applied starting at "x", and so the
- (*COMMIT) causes the match to fail without trying any other starting
+ For the first pattern, PCRE2 knows that any match must start with "a",
+ so the optimization skips along the subject to "a" before applying the
+ pattern to the first set of data. The match attempt then succeeds. The
+ second pattern disables the optimization that skips along to the first
+ character. The pattern is now applied starting at "x", and so the
+ (*COMMIT) causes the match to fail without trying any other starting
points.
(*PRUNE) or (*PRUNE:NAME)
- This verb causes the match to fail at the current starting position in
+ This verb causes the match to fail at the current starting position in
the subject if there is a later matching failure that causes backtrack-
- ing to reach it. If the pattern is unanchored, the normal "bumpalong"
- advance to the next starting character then happens. Backtracking can
- occur as usual to the left of (*PRUNE), before it is reached, or when
- matching to the right of (*PRUNE), but if there is no match to the
- right, backtracking cannot cross (*PRUNE). In simple cases, the use of
- (*PRUNE) is just an alternative to an atomic group or possessive quan-
+ ing to reach it. If the pattern is unanchored, the normal "bumpalong"
+ advance to the next starting character then happens. Backtracking can
+ occur as usual to the left of (*PRUNE), before it is reached, or when
+ matching to the right of (*PRUNE), but if there is no match to the
+ right, backtracking cannot cross (*PRUNE). In simple cases, the use of
+ (*PRUNE) is just an alternative to an atomic group or possessive quan-
tifier, but there are some uses of (*PRUNE) that cannot be expressed in
- any other way. In an anchored pattern (*PRUNE) has the same effect as
+ any other way. In an anchored pattern (*PRUNE) has the same effect as
(*COMMIT).
The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
It is like (*MARK:NAME) in that the name is remembered for passing back
- to the caller. However, (*SKIP:NAME) searches only for names set with
+ to the caller. However, (*SKIP:NAME) searches only for names set with
(*MARK), ignoring those set by other backtracking verbs.
(*SKIP)
- This verb, when given without a name, is like (*PRUNE), except that if
- the pattern is unanchored, the "bumpalong" advance is not to the next
+ This verb, when given without a name, is like (*PRUNE), except that if
+ the pattern is unanchored, the "bumpalong" advance is not to the next
character, but to the position in the subject where (*SKIP) was encoun-
- tered. (*SKIP) signifies that whatever text was matched leading up to
- it cannot be part of a successful match if there is a later mismatch.
+ tered. (*SKIP) signifies that whatever text was matched leading up to
+ it cannot be part of a successful match if there is a later mismatch.
Consider:
a+(*SKIP)b
- If the subject is "aaaac...", after the first match attempt fails
- (starting at the first character in the string), the starting point
+ If the subject is "aaaac...", after the first match attempt fails
+ (starting at the first character in the string), the starting point
skips on to start the next attempt at "c". Note that a possessive quan-
tifier does not have the same effect as this example; although it would
- suppress backtracking during the first match attempt, the second at-
- tempt would start at the second character instead of skipping on to
+ suppress backtracking during the first match attempt, the second at-
+ tempt would start at the second character instead of skipping on to
"c".
- If (*SKIP) is used to specify a new starting position that is the same
- as the starting position of the current match, or (by being inside a
- lookbehind) earlier, the position specified by (*SKIP) is ignored, and
+ If (*SKIP) is used to specify a new starting position that is the same
+ as the starting position of the current match, or (by being inside a
+ lookbehind) earlier, the position specified by (*SKIP) is ignored, and
instead the normal "bumpalong" occurs.
(*SKIP:NAME)
- When (*SKIP) has an associated name, its behaviour is modified. When
- such a (*SKIP) is triggered, the previous path through the pattern is
- searched for the most recent (*MARK) that has the same name. If one is
- found, the "bumpalong" advance is to the subject position that corre-
- sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
+ When (*SKIP) has an associated name, its behaviour is modified. When
+ such a (*SKIP) is triggered, the previous path through the pattern is
+ searched for the most recent (*MARK) that has the same name. If one is
+ found, the "bumpalong" advance is to the subject position that corre-
+ sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
no (*MARK) with a matching name is found, the (*SKIP) is ignored.
- The search for a (*MARK) name uses the normal backtracking mechanism,
- which means that it does not see (*MARK) settings that are inside
+ The search for a (*MARK) name uses the normal backtracking mechanism,
+ which means that it does not see (*MARK) settings that are inside
atomic groups or assertions, because they are never re-entered by back-
tracking. Compare the following pcre2test examples:
@@ -10466,105 +10505,105 @@ BACKTRACKING CONTROL
0: b
1: b
- In the first example, the (*MARK) setting is in an atomic group, so it
+ In the first example, the (*MARK) setting is in an atomic group, so it
is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
- This allows the second branch of the pattern to be tried at the first
- character position. In the second example, the (*MARK) setting is not
- in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
+ This allows the second branch of the pattern to be tried at the first
+ character position. In the second example, the (*MARK) setting is not
+ in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
backtracks, and this causes a new matching attempt to start at the sec-
- ond character. This time, the (*MARK) is never seen because "a" does
+ ond character. This time, the (*MARK) is never seen because "a" does
not match "b", so the matcher immediately jumps to the second branch of
the pattern.
- Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
+ Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
ignores names that are set by other backtracking verbs.
(*THEN) or (*THEN:NAME)
- This verb causes a skip to the next innermost alternative when back-
- tracking reaches it. That is, it cancels any further backtracking
- within the current alternative. Its name comes from the observation
+ This verb causes a skip to the next innermost alternative when back-
+ tracking reaches it. That is, it cancels any further backtracking
+ within the current alternative. Its name comes from the observation
that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
- If the COND1 pattern matches, FOO is tried (and possibly further items
- after the end of the group if FOO succeeds); on failure, the matcher
- skips to the second alternative and tries COND2, without backtracking
- into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
- quently BAZ fails, there are no more alternatives, so there is a back-
- track to whatever came before the entire group. If (*THEN) is not in-
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure, the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
+ quently BAZ fails, there are no more alternatives, so there is a back-
+ track to whatever came before the entire group. If (*THEN) is not in-
side an alternation, it acts like (*PRUNE).
- The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
+ The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
It is like (*MARK:NAME) in that the name is remembered for passing back
- to the caller. However, (*SKIP:NAME) searches only for names set with
+ to the caller. However, (*SKIP:NAME) searches only for names set with
(*MARK), ignoring those set by other backtracking verbs.
- A group that does not contain a | character is just a part of the en-
- closing alternative; it is not a nested alternation with only one al-
+ A group that does not contain a | character is just a part of the en-
+ closing alternative; it is not a nested alternation with only one al-
ternative. The effect of (*THEN) extends beyond such a group to the en-
- closing alternative. Consider this pattern, where A, B, etc. are com-
- plex pattern fragments that do not contain any | characters at this
+ closing alternative. Consider this pattern, where A, B, etc. are com-
+ plex pattern fragments that do not contain any | characters at this
level:
A (B(*THEN)C) | D
- If A and B are matched, but there is a failure in C, matching does not
+ If A and B are matched, but there is a failure in C, matching does not
backtrack into A; instead it moves to the next alternative, that is, D.
- However, if the group containing (*THEN) is given an alternative, it
+ However, if the group containing (*THEN) is given an alternative, it
behaves differently:
A (B(*THEN)C | (*FAIL)) | D
The effect of (*THEN) is now confined to the inner group. After a fail-
- ure in C, matching moves to (*FAIL), which causes the whole group to
- fail because there are no more alternatives to try. In this case,
+ ure in C, matching moves to (*FAIL), which causes the whole group to
+ fail because there are no more alternatives to try. In this case,
matching does backtrack into A.
- Note that a conditional group is not considered as having two alterna-
- tives, because only one is ever used. In other words, the | character
- in a conditional group has a different meaning. Ignoring white space,
+ Note that a conditional group is not considered as having two alterna-
+ tives, because only one is ever used. In other words, the | character
+ in a conditional group has a different meaning. Ignoring white space,
consider:
^.*? (?(?=a) a | b(*THEN)c )
If the subject is "ba", this pattern does not match. Because .*? is un-
- greedy, it initially matches zero characters. The condition (?=a) then
- fails, the character "b" is matched, but "c" is not. At this point,
- matching does not backtrack to .*? as might perhaps be expected from
- the presence of the | character. The conditional group is part of the
- single alternative that comprises the whole pattern, and so the match
- fails. (If there was a backtrack into .*?, allowing it to match "b",
+ greedy, it initially matches zero characters. The condition (?=a) then
+ fails, the character "b" is matched, but "c" is not. At this point,
+ matching does not backtrack to .*? as might perhaps be expected from
+ the presence of the | character. The conditional group is part of the
+ single alternative that comprises the whole pattern, and so the match
+ fails. (If there was a backtrack into .*?, allowing it to match "b",
the match would succeed.)
- The verbs just described provide four different "strengths" of control
+ The verbs just described provide four different "strengths" of control
when subsequent matching fails. (*THEN) is the weakest, carrying on the
- match at the next alternative. (*PRUNE) comes next, failing the match
- at the current starting position, but allowing an advance to the next
- character (for an unanchored pattern). (*SKIP) is similar, except that
+ match at the next alternative. (*PRUNE) comes next, failing the match
+ at the current starting position, but allowing an advance to the next
+ character (for an unanchored pattern). (*SKIP) is similar, except that
the advance may be more than one character. (*COMMIT) is the strongest,
causing the entire match to fail.
More than one backtracking verb
- If more than one backtracking verb is present in a pattern, the one
- that is backtracked onto first acts. For example, consider this pat-
+ If more than one backtracking verb is present in a pattern, the one
+ that is backtracked onto first acts. For example, consider this pat-
tern, where A, B, etc. are complex pattern fragments:
(A(*COMMIT)B(*THEN)C|ABD)
- If A matches but B fails, the backtrack to (*COMMIT) causes the entire
+ If A matches but B fails, the backtrack to (*COMMIT) causes the entire
match to fail. However, if A and B match, but C fails, the backtrack to
- (*THEN) causes the next alternative (ABD) to be tried. This behaviour
- is consistent, but is not always the same as Perl's. It means that if
- two or more backtracking verbs appear in succession, all but the last
+ (*THEN) causes the next alternative (ABD) to be tried. This behaviour
+ is consistent, but is not always the same as Perl's. It means that if
+ two or more backtracking verbs appear in succession, all but the last
of them has no effect. Consider this example:
...(*COMMIT)(*PRUNE)...
If there is a matching failure to the right, backtracking onto (*PRUNE)
- causes it to be triggered, and its action is taken. There can never be
+ causes it to be triggered, and its action is taken. There can never be
a backtrack onto (*COMMIT).
Backtracking verbs in repeated groups
@@ -10574,52 +10613,52 @@ BACKTRACKING CONTROL
/(a(*COMMIT)b)+ac/
- If the subject is "abac", Perl matches unless its optimizations are
- disabled, but PCRE2 always fails because the (*COMMIT) in the second
+ If the subject is "abac", Perl matches unless its optimizations are
+ disabled, but PCRE2 always fails because the (*COMMIT) in the second
repeat of the group acts.
Backtracking verbs in assertions
- (*FAIL) in any assertion has its normal effect: it forces an immediate
- backtrack. The behaviour of the other backtracking verbs depends on
- whether or not the assertion is standalone or acting as the condition
+ (*FAIL) in any assertion has its normal effect: it forces an immediate
+ backtrack. The behaviour of the other backtracking verbs depends on
+ whether or not the assertion is standalone or acting as the condition
in a conditional group.
- (*ACCEPT) in a standalone positive assertion causes the assertion to
- succeed without any further processing; captured strings and a mark
- name (if set) are retained. In a standalone negative assertion, (*AC-
+ (*ACCEPT) in a standalone positive assertion causes the assertion to
+ succeed without any further processing; captured strings and a mark
+ name (if set) are retained. In a standalone negative assertion, (*AC-
CEPT) causes the assertion to fail without any further processing; cap-
tured substrings and any mark name are discarded.
- If the assertion is a condition, (*ACCEPT) causes the condition to be
- true for a positive assertion and false for a negative one; captured
+ If the assertion is a condition, (*ACCEPT) causes the condition to be
+ true for a positive assertion and false for a negative one; captured
substrings are retained in both cases.
The remaining verbs act only when a later failure causes a backtrack to
- reach them. This means that, for the Perl-compatible assertions, their
+ reach them. This means that, for the Perl-compatible assertions, their
effect is confined to the assertion, because Perl lookaround assertions
are atomic. A backtrack that occurs after such an assertion is complete
- does not jump back into the assertion. Note in particular that a
- (*MARK) name that is set in an assertion is not "seen" by an instance
+ does not jump back into the assertion. Note in particular that a
+ (*MARK) name that is set in an assertion is not "seen" by an instance
of (*SKIP:NAME) later in the pattern.
- PCRE2 now supports non-atomic positive assertions and also "scan sub-
- string" assertions, as described in the sections entitled "Non-atomic
- assertions" and "Scan substring assertions" above. These assertions
+ PCRE2 now supports non-atomic positive assertions and also "scan sub-
+ string" assertions, as described in the sections entitled "Non-atomic
+ assertions" and "Scan substring assertions" above. These assertions
must be standalone (not used as conditions). They are not Perl-compati-
- ble. For these assertions, a later backtrack does jump back into the
- assertion, and therefore verbs such as (*COMMIT) can be triggered by
+ ble. For these assertions, a later backtrack does jump back into the
+ assertion, and therefore verbs such as (*COMMIT) can be triggered by
backtracks from later in the pattern.
- The effect of (*THEN) is not allowed to escape beyond an assertion. If
- there are no more branches to try, (*THEN) causes a positive assertion
- to be false, and a negative assertion to be true. This behaviour dif-
+ The effect of (*THEN) is not allowed to escape beyond an assertion. If
+ there are no more branches to try, (*THEN) causes a positive assertion
+ to be false, and a negative assertion to be true. This behaviour dif-
fers from Perl when the assertion has only one branch.
- The other backtracking verbs are not treated specially if they appear
- in a standalone positive assertion. In a conditional positive asser-
+ The other backtracking verbs are not treated specially if they appear
+ in a standalone positive assertion. In a conditional positive asser-
tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
- or (*PRUNE) causes the condition to be false. However, for both stand-
+ or (*PRUNE) causes the condition to be false. However, for both stand-
alone and conditional negative assertions, backtracking into (*COMMIT),
(*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
ing any further alternative branches.
@@ -10629,19 +10668,19 @@ BACKTRACKING CONTROL
These behaviours occur whether or not the group is called recursively.
(*ACCEPT) in a group called as a subroutine causes the subroutine match
- to succeed without any further processing. Matching then continues af-
- ter the subroutine call. Perl documents this behaviour. Perl's treat-
+ to succeed without any further processing. Matching then continues af-
+ ter the subroutine call. Perl documents this behaviour. Perl's treat-
ment of the other verbs in subroutines is different in some cases.
- (*FAIL) in a group called as a subroutine has its normal effect: it
+ (*FAIL) in a group called as a subroutine has its normal effect: it
forces an immediate backtrack.
- (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
- when triggered by being backtracked to in a group called as a subrou-
+ (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
+ when triggered by being backtracked to in a group called as a subrou-
tine. There is then a backtrack at the outer level.
(*THEN), when triggered, skips to the next alternative in the innermost
- enclosing group that has alternatives (its normal behaviour). However,
+ enclosing group that has alternatives (its normal behaviour). However,
if there is no such group within the subroutine's group, the subroutine
match fails and there is a backtrack at the outer level.
@@ -10653,44 +10692,44 @@ EBCDIC ENVIRONMENTS
Escape sequences
- When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
+ When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
\a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
The \c escape is processed as specified for Perl in the perlebcdic doc-
- ument. The only characters that are allowed after \c are A-Z, a-z, or
- one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
- time error. The sequence \c@ encodes character code 0; after \c the
- letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
- \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
+ ument. The only characters that are allowed after \c are A-Z, a-z, or
+ one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
+ time error. The sequence \c@ encodes character code 0; after \c the
+ letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
+ \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be-
comes either 255 (hex FF) or 95 (hex 5F).
- Thus, apart from \c?, these escapes generate the same character code
- values as they do in an ASCII or Unicode environment, though the mean-
- ings of the values mostly differ. For example, \cG always generates
+ Thus, apart from \c?, these escapes generate the same character code
+ values as they do in an ASCII or Unicode environment, though the mean-
+ ings of the values mostly differ. For example, \cG always generates
code value 7, which is BEL in ASCII but DEL in EBCDIC.
- The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
- but because 127 is not a control character in EBCDIC, Perl makes it
- generate the APC character. Unfortunately, there are several variants
- of EBCDIC. In most of them the APC character has the value 255 (hex
- FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
+ The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
+ but because 127 is not a control character in EBCDIC, Perl makes it
+ generate the APC character. Unfortunately, there are several variants
+ of EBCDIC. In most of them the APC character has the value 255 (hex
+ FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
certain other characters have POSIX-BC values, PCRE2 makes \c? generate
95; otherwise it generates 255.
Character classes
In character classes there is a special case in EBCDIC environments for
- ranges whose end points are both specified as literal letters in the
- same case. For compatibility with Perl, EBCDIC code points within the
+ ranges whose end points are both specified as literal letters in the
+ same case. For compatibility with Perl, EBCDIC code points within the
range that are not letters are omitted. For example, [h-k] matches only
- four characters, even though the EBCDIC codes for h and k are 0x88 and
+ four characters, even though the EBCDIC codes for h and k are 0x88 and
0x92, a range of 11 code points. However, if the range is specified nu-
- merically, for example, [\x88-\x92] or [h-\x92], all code points are
+ merically, for example, [\x88-\x92] or [h-\x92], all code points are
included.
SEE ALSO
- pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
+ pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
pcre2(3).
@@ -12127,8 +12166,8 @@ SUBSTRING SCAN ASSERTION
(*scan_substring:(grouplist)...) scan captured substring
(*scs:(grouplist)...) scan captured substring
- The comma-separated list may identify groups in any of the following
- ways:
+ The comma-separated list "grouplist" may identify groups in any of the
+ following ways:
n absolute reference
+n relative reference
@@ -12179,6 +12218,29 @@ SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
\g<-n> call subroutine by relative number (PCRE2 extension)
\g'-n' call subroutine by relative number (PCRE2 extension)
+ The variants using parentheses (?...) may also specify a list of cap-
+ ture groups to return, which shall be retained in the calling subex-
+ pression if set during the recursion (this feature is not supported by
+ Perl).
+
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (PCRE2 extension)
+ (?n(grouplist)) )
+ (?+n(grouplist)) ) call subroutine, returning capture groups
+ (?-n(grouplist)) ) (PCRE2 extension)
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+
+ The comma-separated list "grouplist" uses the same syntax as
+ (*scan_substring:(grouplist)...), and may identify groups in any of the
+ following ways:
+
+ n absolute reference
+ +n relative reference
+ -n relative reference
+ name
+ 'name' name
+
CONDITIONAL PATTERNS
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 54e86f190..e0d8ca639 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -3442,7 +3442,7 @@ later versions (I tried 5.024) it now works.
.
.
.\" HTML
-.SH "GROUPS AS SUBROUTINES"
+.SS "Groups as subroutines"
.rs
.sp
If the syntax for a recursive group call (either by number or by name) is used
@@ -3495,8 +3495,51 @@ in groups when called as subroutines is described in the section entitled
below.
.
.
+.SS "Recursion and subroutines with returned capture groups"
+.rs
+.sp
+Since PCRE2 10.46, recursion and subroutine calls may also specify a list of
+capture groups to return. This is a PCRE2 syntax extension not supported by
+Perl. The pattern matching recurses into the referenced expression as described
+above, however, when the recursion returns to the calling expression the
+subgroups captured during the recursion can be retained when the calling
+expression's context is restored.
+.P
+When used as a subroutine, this allows the subroutine's capture groups to
+be used as return values.
+.P
+Only the specific capture groups listed by the caller will be retained, using
+the following syntax:
+.sp
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (?n(grouplist)) )
+ (?+n(grouplist)) )
+ (?-n(grouplist)) ) call subroutine, returning capture groups
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+.P
+The list of capture groups "grouplist" is a comma-separated list of (absolute
+or relative) group numbers, and group names enclosed in single quotes or angle
+brackets.
+.P
+Here is an example which first uses the DEFINE condition to create a re-usable
+routine for matching a weekday, then calls that subroutine and retains the
+groups it captures for use later:
+.sp
+ (?x: # ignore whitespace for clarity
+ # Define the routine "weekendday" which matches Saturday or
+ # Sunday, and returns the Sat/Sun prefix as \ek.
+ (?(DEFINE) (?
+ (?|(?Sat)urday|(?Sun)day) ) )
+ # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
+ (?&weekendday()),\ek )
+.P
+This feature is not available using the Oniguruma syntax \eg<...> or \eg'...'
+below.
+.
+.
.\" HTML
-.SH "ONIGURUMA SUBROUTINE SYNTAX"
+.SS "Oniguruma subroutine syntax"
.rs
.sp
For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3
index bc3168aeb..cb62a3b20 100644
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@@ -543,14 +543,14 @@ This feature is not Perl-compatible.
(*scan_substring:(grouplist)...) scan captured substring
(*scs:(grouplist)...) scan captured substring
.sp
-The comma-separated list may identify groups in any of the following ways:
+The comma-separated list "grouplist" may identify groups in any of the
+following ways:
.sp
n absolute reference
+n relative reference
-n relative reference
name
'name' name
-.sp
.
.
.SH "SCRIPT RUNS"
@@ -597,6 +597,28 @@ The comma-separated list may identify groups in any of the following ways:
\eg'+n' call subroutine by relative number (PCRE2 extension)
\eg<-n> call subroutine by relative number (PCRE2 extension)
\eg'-n' call subroutine by relative number (PCRE2 extension)
+.sp
+The variants using parentheses (?...) may also specify a list of capture groups
+to return, which shall be retained in the calling subexpression if set during
+the recursion (this feature is not supported by Perl).
+.sp
+ (?R(grouplist)) recurse whole pattern, returning capture groups
+ (PCRE2 extension)
+ (?n(grouplist)) )
+ (?+n(grouplist)) ) call subroutine, returning capture groups
+ (?-n(grouplist)) ) (PCRE2 extension)
+ (?&name(grouplist)) )
+ (?P>name(grouplist)) )
+.sp
+The comma-separated list "grouplist" uses the same syntax as
+(*scan_substring:(grouplist)...), and may identify groups in any of the
+following ways:
+.sp
+ n absolute reference
+ +n relative reference
+ -n relative reference
+ name
+ 'name' name
.
.
.SH "CONDITIONAL PATTERNS"
diff --git a/testdata/testinput2 b/testdata/testinput2
index 1105e96bc..72a4864a6 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -7951,6 +7951,47 @@ a)"xI
abc#abcdef#defghi#ghijkl
abc#abcdef#defghi#ghXjkl#
+% # Define the routine "weekendday" which matches Saturday or Sunday, and
+ # returns the Sat/Sun prefix as \k.
+ (?(DEFINE)(?(?|(?Sat)urday|(?Sun)day)))
+ # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
+ (?&weekendday()),\k %x
+ Saturday,Sat
+ Sunday,Sun
+\= Expect no match
+ Saturday,Sun
+
+# Test each syntax used for recursion
+
+/(?(R)(Sat)urday|(?R(1)),\1)/
+ Saturday,Sat
+
+/(?(DEFINE)((Sat)urday))(?1(2)),\2/
+ Saturday,Sat
+
+/(?(DEFINE)((Sat)urday))(?-2(-1)),\2/
+ Saturday,Sat
+
+/(?+1(+2)),\2(?(DEFINE)((Sat)urday))/
+ Saturday,Sat
+
+/(?(DEFINE)(?(?Sat)urday))(?&fn('ret')),\k/
+ Saturday,Sat
+
+/(?(DEFINE)(?(?Sat)urday))(?P>fn()),\k/
+ Saturday,Sat
+
+/(?(DEFINE)(?(?Sat)urday))\g,\k/
+
+/(?(DEFINE)((Sat)urday))(?1),\2/
+\= Expect no match
+ Saturday,Sat
+
+/(?(DEFINE)((Sat)urday))(?1()),\2/
+
+/(?(DEFINE)((Sat)(urday)))(?1(2,3)),\2,\3/
+ Saturday,Sat,urday
+
# --------------
# End of testinput2
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index f1317b4d1..c92ae3285 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -22391,6 +22391,78 @@ No match
abc#abcdef#defghi#ghXjkl#
No match
+% # Define the routine "weekendday" which matches Saturday or Sunday, and
+ # returns the Sat/Sun prefix as \k.
+ (?(DEFINE)(?(?|(?Sat)urday|(?Sun)day)))
+ # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
+ (?&weekendday()),\k %x
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+ Sunday,Sun
+ 0: Sunday,Sun
+ 1:
+ 2: Sun
+\= Expect no match
+ Saturday,Sun
+No match
+
+# Test each syntax used for recursion
+
+/(?(R)(Sat)urday|(?R(1)),\1)/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1: Sat
+
+/(?(DEFINE)((Sat)urday))(?1(2)),\2/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+
+/(?(DEFINE)((Sat)urday))(?-2(-1)),\2/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+
+/(?+1(+2)),\2(?(DEFINE)((Sat)urday))/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+
+/(?(DEFINE)(?(?Sat)urday))(?&fn('ret')),\k/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+
+/(?(DEFINE)(?(?Sat)urday))(?P>fn()),\k/
+ Saturday,Sat
+ 0: Saturday,Sat
+ 1:
+ 2: Sat
+
+/(?(DEFINE)(?(?Sat)urday))\g,\k/
+Failed: error 142 at offset 39: syntax error in subpattern name (missing terminator?)
+
+/(?(DEFINE)((Sat)urday))(?1),\2/
+\= Expect no match
+ Saturday,Sat
+No match
+
+/(?(DEFINE)((Sat)urday))(?1()),\2/
+Failed: error 217 at offset 27: expected capture group number or name
+
+/(?(DEFINE)((Sat)(urday)))(?1(2,3)),\2,\3/
+ Saturday,Sat,urday
+ 0: Saturday,Sat,urday
+ 1:
+ 2: Sat
+ 3: urday
+
# --------------
# End of testinput2