From faafdfb357bd58c5ef82de6fef358bb6a3baea1e Mon Sep 17 00:00:00 2001 From: Nicholas Wilson Date: Fri, 28 Mar 2025 12:04:24 +0000 Subject: [PATCH] Add documentation for subroutine return values --- doc/html/pcre2pattern.html | 84 ++++-- doc/html/pcre2syntax.html | 26 +- doc/pcre2.txt | 574 ++++++++++++++++++++----------------- doc/pcre2pattern.3 | 47 ++- doc/pcre2syntax.3 | 26 +- testdata/testinput2 | 41 +++ testdata/testoutput2 | 72 +++++ 7 files changed, 592 insertions(+), 278 deletions(-) diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 62406efaa..3279ddcfa 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -41,14 +41,12 @@

pcre2pattern man page

  • CONDITIONAL GROUPS
  • COMMENTS
  • RECURSIVE PATTERNS -
  • GROUPS AS SUBROUTINES -
  • ONIGURUMA SUBROUTINE SYNTAX -
  • CALLOUTS -
  • BACKTRACKING CONTROL -
  • EBCDIC ENVIRONMENTS -
  • SEE ALSO -
  • AUTHOR -
  • REVISION +
  • CALLOUTS +
  • BACKTRACKING CONTROL +
  • EBCDIC ENVIRONMENTS +
  • SEE ALSO +
  • AUTHOR +
  • REVISION

    PCRE2 REGULAR EXPRESSION DETAILS

    @@ -3399,7 +3397,9 @@

    "b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works.

    -

    GROUPS AS SUBROUTINES

    +

    +Groups as subroutines +

    If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit like a subroutine @@ -3446,8 +3446,60 @@

    GROUPS AS SUBROUTINES

    in groups when called as subroutines is described in the section entitled "Backtracking verbs in subroutines" below. +

    +

    +Recursion and subroutines with returned capture groups +

    +

    +Since PCRE2 10.46, recursion and subroutine calls may also specify a list of +capture groups to return. This is a PCRE2 syntax extension not supported by +Perl. The pattern matching recurses into the referenced expression as described +above, however, when the recursion returns to the calling expression the +subgroups captured during the recursion can be retained when the calling +expression's context is restored. +

    +

    +When used as a subroutine, this allows the subroutine's capture groups to +be used as return values. +

    +

    +Only the specific capture groups listed by the caller will be retained, using +the following syntax: +

    +  (?R(grouplist))       recurse whole pattern, returning capture groups
    +  (?n(grouplist))       )
    +  (?+n(grouplist))      )
    +  (?-n(grouplist))      ) call subroutine, returning capture groups
    +  (?&name(grouplist))   )
    +  (?P>name(grouplist))  )
    +
    +

    +

    +The list of capture groups "grouplist" is a comma-separated list of (absolute +or relative) group numbers, and group names enclosed in single quotes or angle +brackets. +

    +

    +Here is an example which first uses the DEFINE condition to create a re-usable +routine for matching a weekday, then calls that subroutine and retains the +groups it captures for use later: +

    +  (?x: # ignore whitespace for clarity
    +    # Define the routine "weekendday" which matches Saturday or
    +    # Sunday, and returns the Sat/Sun prefix as \k<short>.
    +    (?(DEFINE) (?<weekendday>
    +        (?|(?<short>Sat)urday|(?<short>Sun)day) ) )
    +    # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun".
    +    (?&weekendday(<short>)),\k<short> )
    +
    +

    +

    +This feature is not available using the Oniguruma syntax \g<...> or \g'...' +below.

    -

    ONIGURUMA SUBROUTINE SYNTAX

    +

    +Oniguruma subroutine syntax +

    For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative @@ -3465,7 +3517,7 @@

    ONIGURUMA SUBROUTINE SYNTAX

    Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not synonymous. The former is a backreference; the latter is a subroutine call.

    -

    CALLOUTS

    +

    CALLOUTS

    Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -3543,7 +3595,7 @@

    The doubling is removed before the string is passed to the callout function.

    -

    BACKTRACKING CONTROL

    +

    BACKTRACKING CONTROL

    There are a number of special "Backtracking Control Verbs" (to use Perl's terminology) that modify the behaviour of backtracking during matching. They @@ -4071,7 +4123,7 @@

    is no such group within the subroutine's group, the subroutine match fails and there is a backtrack at the outer level.

    -

    EBCDIC ENVIRONMENTS

    +

    EBCDIC ENVIRONMENTS

    Differences in the way PCRE behaves when it is running in an EBCDIC environment are covered in this section. @@ -4115,12 +4167,12 @@

    points. However, if the range is specified numerically, for example, [\x88-\x92] or [h-\x92], all code points are included.

    -

    SEE ALSO

    +

    SEE ALSO

    pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3).

    -

    AUTHOR

    +

    AUTHOR

    Philip Hazel
    @@ -4129,7 +4181,7 @@

    AUTHOR

    Cambridge, England.

    -

    REVISION

    +

    REVISION

    Last updated: 27 November 2024
    diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index 72a23214f..f5e757427 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -566,14 +566,14 @@

    SUBSTRING SCAN ASSERTION

    (*scan_substring:(grouplist)...) scan captured substring (*scs:(grouplist)...) scan captured substring -The comma-separated list may identify groups in any of the following ways: +The comma-separated list "grouplist" may identify groups in any of the +following ways:
       n       absolute reference
       +n      relative reference
       -n      relative reference
       <name>  name
       'name'  name
    -
     

    SCRIPT RUNS

    @@ -621,6 +621,28 @@

    SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)< \g<-n> call subroutine by relative number (PCRE2 extension) \g'-n' call subroutine by relative number (PCRE2 extension) +The variants using parentheses (?...) may also specify a list of capture groups +to return, which shall be retained in the calling subexpression if set during +the recursion (this feature is not supported by Perl). +
    +  (?R(grouplist))       recurse whole pattern, returning capture groups
    +                          (PCRE2 extension)
    +  (?n(grouplist))       )
    +  (?+n(grouplist))      ) call subroutine, returning capture groups
    +  (?-n(grouplist))      )   (PCRE2 extension)
    +  (?&name(grouplist))   )
    +  (?P>name(grouplist))  )
    +
    +The comma-separated list "grouplist" uses the same syntax as +(*scan_substring:(grouplist)...), and may identify groups in any of the +following ways: +
    +  n       absolute reference
    +  +n      relative reference
    +  -n      relative reference
    +  <name>  name
    +  'name'  name
    +

    CONDITIONAL PATTERNS

    diff --git a/doc/pcre2.txt b/doc/pcre2.txt index 46c2364e4..df707b6b0 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -10018,8 +10018,7 @@ RECURSIVE PATTERNS \1 does now match "b" and so the whole match succeeds. This match used to fail in Perl, but in later versions (I tried 5.024) it now works. - -GROUPS AS SUBROUTINES + Groups as subroutines If the syntax for a recursive group call (either by number or by name) is used outside the parentheses to which it refers, it operates a bit @@ -10064,80 +10063,120 @@ GROUPS AS SUBROUTINES subroutines is described in the section entitled "Backtracking verbs in subroutines" below. + Recursion and subroutines with returned capture groups -ONIGURUMA SUBROUTINE SYNTAX + Since PCRE2 10.46, recursion and subroutine calls may also specify a + list of capture groups to return. This is a PCRE2 syntax extension not + supported by Perl. The pattern matching recurses into the referenced + expression as described above, however, when the recursion returns to + the calling expression the subgroups captured during the recursion can + be retained when the calling expression's context is restored. - For compatibility with Oniguruma, the non-Perl syntax \g followed by a + When used as a subroutine, this allows the subroutine's capture groups + to be used as return values. + + Only the specific capture groups listed by the caller will be retained, + using the following syntax: + + (?R(grouplist)) recurse whole pattern, returning capture groups + (?n(grouplist)) ) + (?+n(grouplist)) ) + (?-n(grouplist)) ) call subroutine, returning capture groups + (?&name(grouplist)) ) + (?P>name(grouplist)) ) + + The list of capture groups "grouplist" is a comma-separated list of + (absolute or relative) group numbers, and group names enclosed in sin- + gle quotes or angle brackets. + + Here is an example which first uses the DEFINE condition to create a + re-usable routine for matching a weekday, then calls that subroutine + and retains the groups it captures for use later: + + (?x: # ignore whitespace for clarity + # Define the routine "weekendday" which matches Saturday or + # Sunday, and returns the Sat/Sun prefix as \k. + (?(DEFINE) (? + (?|(?Sat)urday|(?Sun)day) ) ) + # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun". + (?&weekendday()),\k ) + + This feature is not available using the Oniguruma syntax \g<...> or + \g'...' below. + + Oniguruma subroutine syntax + + For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative syntax for calling a group as a subroutine, possibly re- - cursively. Here are two of the examples used above, rewritten using + cursively. Here are two of the examples used above, rewritten using this syntax: (? \( ( (?>[^()]+) | \g )* \) ) (sens|respons)e and \g'1'ibility - PCRE2 supports an extension to Oniguruma: if a number is preceded by a + PCRE2 supports an extension to Oniguruma: if a number is preceded by a plus or a minus sign it is taken as a relative reference. For example: (abc)(?i:\g<-1>) - Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not - synonymous. The former is a backreference; the latter is a subroutine + Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not + synonymous. The former is a backreference; the latter is a subroutine call. CALLOUTS Perl has a feature whereby using the sequence (?{...}) causes arbitrary - Perl code to be obeyed in the middle of matching a regular expression. + Perl code to be obeyed in the middle of matching a regular expression. This makes it possible, amongst other things, to extract different sub- strings that match the same pair of parentheses when there is a repeti- tion. - PCRE2 provides a similar feature, but of course it cannot obey arbi- - trary Perl code. The feature is called "callout". The caller of PCRE2 - provides an external function by putting its entry point in a match - context using the function pcre2_set_callout(), and then passing that - context to pcre2_match() or pcre2_dfa_match(). If no match context is - passed, or if the callout entry point is set to NULL, callout points - will be passed over silently during matching. To disallow callouts in + PCRE2 provides a similar feature, but of course it cannot obey arbi- + trary Perl code. The feature is called "callout". The caller of PCRE2 + provides an external function by putting its entry point in a match + context using the function pcre2_set_callout(), and then passing that + context to pcre2_match() or pcre2_dfa_match(). If no match context is + passed, or if the callout entry point is set to NULL, callout points + will be passed over silently during matching. To disallow callouts in the pattern syntax, you may use the PCRE2_EXTRA_NEVER_CALLOUT option. - Within a regular expression, (?C) indicates a point at which the - external function is to be called. There are two kinds of callout: - those with a numerical argument and those with a string argument. (?C) - on its own with no argument is treated as (?C0). A numerical argument - allows the application to distinguish between different callouts. - String arguments were added for release 10.20 to make it possible for - script languages that use PCRE2 to embed short scripts within patterns + Within a regular expression, (?C) indicates a point at which the + external function is to be called. There are two kinds of callout: + those with a numerical argument and those with a string argument. (?C) + on its own with no argument is treated as (?C0). A numerical argument + allows the application to distinguish between different callouts. + String arguments were added for release 10.20 to make it possible for + script languages that use PCRE2 to embed short scripts within patterns in a similar way to Perl. During matching, when PCRE2 reaches a callout point, the external func- - tion is called. It is provided with the number or string argument of - the callout, the position in the pattern, and one item of data that is + tion is called. It is provided with the number or string argument of + the callout, the position in the pattern, and one item of data that is also set in the match block. The callout function may cause matching to proceed, to backtrack, or to fail. - By default, PCRE2 implements a number of optimizations at matching - time, and one side-effect is that sometimes callouts are skipped. If - you need all possible callouts to happen, you need to set options that - disable the relevant optimizations. More details, including a complete - description of the programming interface to the callout function, are + By default, PCRE2 implements a number of optimizations at matching + time, and one side-effect is that sometimes callouts are skipped. If + you need all possible callouts to happen, you need to set options that + disable the relevant optimizations. More details, including a complete + description of the programming interface to the callout function, are given in the pcre2callout documentation. Callouts with numerical arguments - If you just want to have a means of identifying different callout - points, put a number less than 256 after the letter C. For example, + If you just want to have a means of identifying different callout + points, put a number less than 256 after the letter C. For example, this pattern has two callout points: (?C1)abc(?C2)def - If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical - callouts are automatically installed before each item in the pattern. - They are all numbered 255. If there is a conditional group in the pat- + If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical + callouts are automatically installed before each item in the pattern. + They are all numbered 255. If there is a conditional group in the pat- tern whose condition is an assertion, an additional callout is inserted - just before the condition. An explicit callout may also be set at this + just before the condition. An explicit callout may also be set at this position, as in this example: (?(?C9)(?=a)abc|def) @@ -10147,79 +10186,79 @@ CALLOUTS Callouts with string arguments - A delimited string may be used instead of a number as a callout argu- - ment. The starting delimiter must be one of ` ' " ^ % # $ { and the + A delimited string may be used instead of a number as a callout argu- + ment. The starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is the same as the start, except for {, where the end- - ing delimiter is }. If the ending delimiter is needed within the + ing delimiter is }. If the ending delimiter is needed within the string, it must be doubled. For example: (?C'ab ''c'' d')xyz(?C{any text})pqr - The doubling is removed before the string is passed to the callout + The doubling is removed before the string is passed to the callout function. BACKTRACKING CONTROL - There are a number of special "Backtracking Control Verbs" (to use - Perl's terminology) that modify the behaviour of backtracking during - matching. They are generally of the form (*VERB) or (*VERB:NAME). Some + There are a number of special "Backtracking Control Verbs" (to use + Perl's terminology) that modify the behaviour of backtracking during + matching. They are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, and may behave differently depending on whether - or not a name argument is present. The names are not required to be + or not a name argument is present. The names are not required to be unique within the pattern. - By default, for compatibility with Perl, a name is any sequence of + By default, for compatibility with Perl, a name is any sequence of characters that does not include a closing parenthesis. The name is not - processed in any way, and it is not possible to include a closing - parenthesis in the name. This can be changed by setting the - PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- + processed in any way, and it is not possible to include a closing + parenthesis in the name. This can be changed by setting the + PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati- ble. - When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to - verb names and only an unescaped closing parenthesis terminates the - name. However, the only backslash items that are permitted are \Q, \E, - and sequences such as \x{100} that define character code points. Char- + When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to + verb names and only an unescaped closing parenthesis terminates the + name. However, the only backslash items that are permitted are \Q, \E, + and sequences such as \x{100} that define character code points. Char- acter type escapes such as \d are faulted. A closing parenthesis can be included in a name either as \) or between - \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED - or PCRE2_EXTENDED_MORE option is also set, unescaped white space in + \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED + or PCRE2_EXTENDED_MORE option is also set, unescaped white space in verb names is skipped, and #-comments are recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not af- fect verb names unless PCRE2_ALT_VERBNAMES is also set. - The maximum length of a name is 255 in the 8-bit library and 65535 in - the 16-bit and 32-bit libraries. If the name is empty, that is, if the - closing parenthesis immediately follows the colon, the effect is as if + The maximum length of a name is 255 in the 8-bit library and 65535 in + the 16-bit and 32-bit libraries. If the name is empty, that is, if the + closing parenthesis immediately follows the colon, the effect is as if the colon were not there. Any number of these verbs may occur in a pat- tern. Except for (*ACCEPT), they may not be quantified. - Since these verbs are specifically related to backtracking, most of - them can be used only when the pattern is to be matched using the tra- - ditional matching function or JIT, because they use backtracking algo- - rithms. With the exception of (*FAIL), which behaves like a failing - negative assertion, the backtracking control verbs cause an error if + Since these verbs are specifically related to backtracking, most of + them can be used only when the pattern is to be matched using the tra- + ditional matching function or JIT, because they use backtracking algo- + rithms. With the exception of (*FAIL), which behaves like a failing + negative assertion, the backtracking control verbs cause an error if encountered by the DFA matching function. - The behaviour of these verbs in repeated groups, assertions, and in - capture groups called as subroutines (whether or not recursively) is + The behaviour of these verbs in repeated groups, assertions, and in + capture groups called as subroutines (whether or not recursively) is documented below. Optimizations that affect backtracking verbs PCRE2 contains some optimizations that are used to speed up matching by running some checks at the start of each match attempt. For example, it - may know the minimum length of matching subject, or that a particular + may know the minimum length of matching subject, or that a particular character must be present. When one of these optimizations bypasses the - running of a match, any included backtracking verbs will not, of + running of a match, any included backtracking verbs will not, of course, be processed. You can suppress the start-of-match optimizations - by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- + by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- pile(), by calling pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF - directive, or by starting the pattern with (*NO_START_OPT). There is - more discussion of this option in the section entitled "Compiling a + directive, or by starting the pattern with (*NO_START_OPT). There is + more discussion of this option in the section entitled "Compiling a pattern" in the pcre2api documentation. - Experiments with Perl suggest that it too has similar optimizations, + Experiments with Perl suggest that it too has similar optimizations, and like PCRE2, turning them off can change the result of a match. Verbs that act immediately @@ -10228,77 +10267,77 @@ BACKTRACKING CONTROL (*ACCEPT) or (*ACCEPT:NAME) - This verb causes the match to end successfully, skipping the remainder - of the pattern. However, when it is inside a capture group that is + This verb causes the match to end successfully, skipping the remainder + of the pattern. However, when it is inside a capture group that is called as a subroutine, only that group is ended successfully. Matching then continues at the outer level. If (*ACCEPT) in triggered in a posi- - tive assertion, the assertion succeeds; in a negative assertion, the + tive assertion, the assertion succeeds; in a negative assertion, the assertion fails. - If (*ACCEPT) is inside capturing parentheses, the data so far is cap- + If (*ACCEPT) is inside capturing parentheses, the data so far is cap- tured. For example: A((?:A|B(*ACCEPT)|C)D) - This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- + This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- tured by the outer parentheses. - (*ACCEPT) is the only backtracking verb that is allowed to be quanti- - fied because an ungreedy quantification with a minimum of zero acts + (*ACCEPT) is the only backtracking verb that is allowed to be quanti- + fied because an ungreedy quantification with a minimum of zero acts only when a backtrack happens. Consider, for example, (A(*ACCEPT)??B)C - where A, B, and C may be complex expressions. After matching "A", the - matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT) - is triggered and the match succeeds. In both cases, all but C is cap- - tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re- + where A, B, and C may be complex expressions. After matching "A", the + matcher processes "BC"; if that fails, causing a backtrack, (*ACCEPT) + is triggered and the match succeeds. In both cases, all but C is cap- + tured. Whereas (*COMMIT) (see below) means "fail on backtrack", a re- peated (*ACCEPT) of this type means "succeed on backtrack". - Warning: (*ACCEPT) should not be used within a script run group, be- - cause it causes an immediate exit from the group, bypassing the script + Warning: (*ACCEPT) should not be used within a script run group, be- + cause it causes an immediate exit from the group, bypassing the script run checking. (*FAIL) or (*FAIL:NAME) - This verb causes a matching failure, forcing backtracking to occur. It - may be abbreviated to (*F). It is equivalent to (?!) but easier to + This verb causes a matching failure, forcing backtracking to occur. It + may be abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl documentation notes that it is probably useful only when combined with (?{}) or (??{}). Those are, of course, Perl features that - are not present in PCRE2. The nearest equivalent is the callout fea- + are not present in PCRE2. The nearest equivalent is the callout fea- ture, as for example in this pattern: a+(?C)(*FAIL) - A match with the string "aaaa" always fails, but the callout is taken + A match with the string "aaaa" always fails, but the callout is taken before each backtrack happens (in this example, 10 times). - (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC- - CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is + (*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*AC- + CEPT) and (*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before the verb acts. Recording which path was taken - There is one verb whose main purpose is to track how a match was ar- - rived at, though it also has a secondary use in conjunction with ad- + There is one verb whose main purpose is to track how a match was ar- + rived at, though it also has a secondary use in conjunction with ad- vancing the match starting point (see (*SKIP) below). (*MARK:NAME) or (*:NAME) - A name is always required with this verb. For all the other backtrack- + A name is always required with this verb. For all the other backtrack- ing control verbs, a NAME argument is optional. - When a match succeeds, the name of the last-encountered mark name on + When a match succeeds, the name of the last-encountered mark name on the matching path is passed back to the caller as described in the sec- tion entitled "Other information about the match" in the pcre2api docu- - mentation. This applies to all instances of (*MARK) and other verbs, + mentation. This applies to all instances of (*MARK) and other verbs, including those inside assertions and atomic groups. However, there are - differences in those cases when (*MARK) is used in conjunction with + differences in those cases when (*MARK) is used in conjunction with (*SKIP) as described below. - The mark name that was last encountered on the matching path is passed - back. A verb without a NAME argument is ignored for this purpose. Here - is an example of pcre2test output, where the "mark" modifier requests + The mark name that was last encountered on the matching path is passed + back. A verb without a NAME argument is ignored for this purpose. Here + is an example of pcre2test output, where the "mark" modifier requests the retrieval and outputting of (*MARK) data: re> /X(*MARK:A)Y|X(*MARK:B)Z/mark @@ -10310,77 +10349,77 @@ BACKTRACKING CONTROL MK: B The (*MARK) name is tagged with "MK:" in this output, and in this exam- - ple it indicates which of the two alternatives matched. This is a more - efficient way of obtaining this information than putting each alterna- + ple it indicates which of the two alternatives matched. This is a more + efficient way of obtaining this information than putting each alterna- tive in its own capturing parentheses. - If a verb with a name is encountered in a positive assertion that is - true, the name is recorded and passed back if it is the last-encoun- + If a verb with a name is encountered in a positive assertion that is + true, the name is recorded and passed back if it is the last-encoun- tered. This does not happen for negative assertions or failing positive assertions. - After a partial match or a failed match, the last encountered name in + After a partial match or a failed match, the last encountered name in the entire match process is returned. For example: re> /X(*MARK:A)Y|X(*MARK:B)Z/mark data> XP No match, mark = B - Note that in this unanchored example the mark is retained from the + Note that in this unanchored example the mark is retained from the match attempt that started at the letter "X" in the subject. Subsequent match attempts starting at "P" and then with an empty string do not get as far as the (*MARK) item, but nevertheless do not reset it. - If you are interested in (*MARK) values after failed matches, you - should probably either set the PCRE2_NO_START_OPTIMIZE option or call - pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF directive (see + If you are interested in (*MARK) values after failed matches, you + should probably either set the PCRE2_NO_START_OPTIMIZE option or call + pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF directive (see above) to ensure that the match is always attempted. Verbs that act after backtracking The following verbs do nothing when they are encountered. Matching con- - tinues with what follows, but if there is a subsequent match failure, - causing a backtrack to the verb, a failure is forced. That is, back- - tracking cannot pass to the left of the verb. However, when one of - these verbs appears inside an atomic group or in an atomic lookaround - assertion that is true, its effect is confined to that group, because - once the group has been matched, there is never any backtracking into - it. Backtracking from beyond an atomic assertion or group ignores the + tinues with what follows, but if there is a subsequent match failure, + causing a backtrack to the verb, a failure is forced. That is, back- + tracking cannot pass to the left of the verb. However, when one of + these verbs appears inside an atomic group or in an atomic lookaround + assertion that is true, its effect is confined to that group, because + once the group has been matched, there is never any backtracking into + it. Backtracking from beyond an atomic assertion or group ignores the entire group, and seeks a preceding backtracking point. - These verbs differ in exactly what kind of failure occurs when back- - tracking reaches them. The behaviour described below is what happens - when the verb is not in a subroutine or an assertion. Subsequent sec- + These verbs differ in exactly what kind of failure occurs when back- + tracking reaches them. The behaviour described below is what happens + when the verb is not in a subroutine or an assertion. Subsequent sec- tions cover these special cases. (*COMMIT) or (*COMMIT:NAME) - This verb causes the whole match to fail outright if there is a later + This verb causes the whole match to fail outright if there is a later matching failure that causes backtracking to reach it. Even if the pat- - tern is unanchored, no further attempts to find a match by advancing - the starting point take place. If (*COMMIT) is the only backtracking + tern is unanchored, no further attempts to find a match by advancing + the starting point take place. If (*COMMIT) is the only backtracking verb that is encountered, once it has been passed pcre2_match() is com- mitted to finding a match at the current starting point, or not at all. For example: a+(*COMMIT)b - This matches "xxaab" but not "aacaab". It can be thought of as a kind + This matches "xxaab" but not "aacaab". It can be thought of as a kind of dynamic anchor, or "I've started, so I must finish." - The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- - MIT). It is like (*MARK:NAME) in that the name is remembered for pass- - ing back to the caller. However, (*SKIP:NAME) searches only for names + The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM- + MIT). It is like (*MARK:NAME) in that the name is remembered for pass- + ing back to the caller. However, (*SKIP:NAME) searches only for names that are set with (*MARK), ignoring those set by any of the other back- tracking verbs. - If there is more than one backtracking verb in a pattern, a different - one that follows (*COMMIT) may be triggered first, so merely passing + If there is more than one backtracking verb in a pattern, a different + one that follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a match does not always guarantee that a match must be at this starting point. Note that (*COMMIT) at the start of a pattern is not the same as an an- - chor, unless PCRE2's start-of-match optimizations are turned off, as + chor, unless PCRE2's start-of-match optimizations are turned off, as shown in this output from pcre2test: re> /(*COMMIT)abc/ @@ -10391,68 +10430,68 @@ BACKTRACKING CONTROL data> xyzabc No match - For the first pattern, PCRE2 knows that any match must start with "a", - so the optimization skips along the subject to "a" before applying the - pattern to the first set of data. The match attempt then succeeds. The - second pattern disables the optimization that skips along to the first - character. The pattern is now applied starting at "x", and so the - (*COMMIT) causes the match to fail without trying any other starting + For the first pattern, PCRE2 knows that any match must start with "a", + so the optimization skips along the subject to "a" before applying the + pattern to the first set of data. The match attempt then succeeds. The + second pattern disables the optimization that skips along to the first + character. The pattern is now applied starting at "x", and so the + (*COMMIT) causes the match to fail without trying any other starting points. (*PRUNE) or (*PRUNE:NAME) - This verb causes the match to fail at the current starting position in + This verb causes the match to fail at the current starting position in the subject if there is a later matching failure that causes backtrack- - ing to reach it. If the pattern is unanchored, the normal "bumpalong" - advance to the next starting character then happens. Backtracking can - occur as usual to the left of (*PRUNE), before it is reached, or when - matching to the right of (*PRUNE), but if there is no match to the - right, backtracking cannot cross (*PRUNE). In simple cases, the use of - (*PRUNE) is just an alternative to an atomic group or possessive quan- + ing to reach it. If the pattern is unanchored, the normal "bumpalong" + advance to the next starting character then happens. Backtracking can + occur as usual to the left of (*PRUNE), before it is reached, or when + matching to the right of (*PRUNE), but if there is no match to the + right, backtracking cannot cross (*PRUNE). In simple cases, the use of + (*PRUNE) is just an alternative to an atomic group or possessive quan- tifier, but there are some uses of (*PRUNE) that cannot be expressed in - any other way. In an anchored pattern (*PRUNE) has the same effect as + any other way. In an anchored pattern (*PRUNE) has the same effect as (*COMMIT). The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name is remembered for passing back - to the caller. However, (*SKIP:NAME) searches only for names set with + to the caller. However, (*SKIP:NAME) searches only for names set with (*MARK), ignoring those set by other backtracking verbs. (*SKIP) - This verb, when given without a name, is like (*PRUNE), except that if - the pattern is unanchored, the "bumpalong" advance is not to the next + This verb, when given without a name, is like (*PRUNE), except that if + the pattern is unanchored, the "bumpalong" advance is not to the next character, but to the position in the subject where (*SKIP) was encoun- - tered. (*SKIP) signifies that whatever text was matched leading up to - it cannot be part of a successful match if there is a later mismatch. + tered. (*SKIP) signifies that whatever text was matched leading up to + it cannot be part of a successful match if there is a later mismatch. Consider: a+(*SKIP)b - If the subject is "aaaac...", after the first match attempt fails - (starting at the first character in the string), the starting point + If the subject is "aaaac...", after the first match attempt fails + (starting at the first character in the string), the starting point skips on to start the next attempt at "c". Note that a possessive quan- tifier does not have the same effect as this example; although it would - suppress backtracking during the first match attempt, the second at- - tempt would start at the second character instead of skipping on to + suppress backtracking during the first match attempt, the second at- + tempt would start at the second character instead of skipping on to "c". - If (*SKIP) is used to specify a new starting position that is the same - as the starting position of the current match, or (by being inside a - lookbehind) earlier, the position specified by (*SKIP) is ignored, and + If (*SKIP) is used to specify a new starting position that is the same + as the starting position of the current match, or (by being inside a + lookbehind) earlier, the position specified by (*SKIP) is ignored, and instead the normal "bumpalong" occurs. (*SKIP:NAME) - When (*SKIP) has an associated name, its behaviour is modified. When - such a (*SKIP) is triggered, the previous path through the pattern is - searched for the most recent (*MARK) that has the same name. If one is - found, the "bumpalong" advance is to the subject position that corre- - sponds to that (*MARK) instead of to where (*SKIP) was encountered. If + When (*SKIP) has an associated name, its behaviour is modified. When + such a (*SKIP) is triggered, the previous path through the pattern is + searched for the most recent (*MARK) that has the same name. If one is + found, the "bumpalong" advance is to the subject position that corre- + sponds to that (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a matching name is found, the (*SKIP) is ignored. - The search for a (*MARK) name uses the normal backtracking mechanism, - which means that it does not see (*MARK) settings that are inside + The search for a (*MARK) name uses the normal backtracking mechanism, + which means that it does not see (*MARK) settings that are inside atomic groups or assertions, because they are never re-entered by back- tracking. Compare the following pcre2test examples: @@ -10466,105 +10505,105 @@ BACKTRACKING CONTROL 0: b 1: b - In the first example, the (*MARK) setting is in an atomic group, so it + In the first example, the (*MARK) setting is in an atomic group, so it is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. - This allows the second branch of the pattern to be tried at the first - character position. In the second example, the (*MARK) setting is not - in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it + This allows the second branch of the pattern to be tried at the first + character position. In the second example, the (*MARK) setting is not + in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new matching attempt to start at the sec- - ond character. This time, the (*MARK) is never seen because "a" does + ond character. This time, the (*MARK) is never seen because "a" does not match "b", so the matcher immediately jumps to the second branch of the pattern. - Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It + Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores names that are set by other backtracking verbs. (*THEN) or (*THEN:NAME) - This verb causes a skip to the next innermost alternative when back- - tracking reaches it. That is, it cancels any further backtracking - within the current alternative. Its name comes from the observation + This verb causes a skip to the next innermost alternative when back- + tracking reaches it. That is, it cancels any further backtracking + within the current alternative. Its name comes from the observation that it can be used for a pattern-based if-then-else block: ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... - If the COND1 pattern matches, FOO is tried (and possibly further items - after the end of the group if FOO succeeds); on failure, the matcher - skips to the second alternative and tries COND2, without backtracking - into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- - quently BAZ fails, there are no more alternatives, so there is a back- - track to whatever came before the entire group. If (*THEN) is not in- + If the COND1 pattern matches, FOO is tried (and possibly further items + after the end of the group if FOO succeeds); on failure, the matcher + skips to the second alternative and tries COND2, without backtracking + into COND1. If that succeeds and BAR fails, COND3 is tried. If subse- + quently BAZ fails, there are no more alternatives, so there is a back- + track to whatever came before the entire group. If (*THEN) is not in- side an alternation, it acts like (*PRUNE). - The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). + The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is remembered for passing back - to the caller. However, (*SKIP:NAME) searches only for names set with + to the caller. However, (*SKIP:NAME) searches only for names set with (*MARK), ignoring those set by other backtracking verbs. - A group that does not contain a | character is just a part of the en- - closing alternative; it is not a nested alternation with only one al- + A group that does not contain a | character is just a part of the en- + closing alternative; it is not a nested alternation with only one al- ternative. The effect of (*THEN) extends beyond such a group to the en- - closing alternative. Consider this pattern, where A, B, etc. are com- - plex pattern fragments that do not contain any | characters at this + closing alternative. Consider this pattern, where A, B, etc. are com- + plex pattern fragments that do not contain any | characters at this level: A (B(*THEN)C) | D - If A and B are matched, but there is a failure in C, matching does not + If A and B are matched, but there is a failure in C, matching does not backtrack into A; instead it moves to the next alternative, that is, D. - However, if the group containing (*THEN) is given an alternative, it + However, if the group containing (*THEN) is given an alternative, it behaves differently: A (B(*THEN)C | (*FAIL)) | D The effect of (*THEN) is now confined to the inner group. After a fail- - ure in C, matching moves to (*FAIL), which causes the whole group to - fail because there are no more alternatives to try. In this case, + ure in C, matching moves to (*FAIL), which causes the whole group to + fail because there are no more alternatives to try. In this case, matching does backtrack into A. - Note that a conditional group is not considered as having two alterna- - tives, because only one is ever used. In other words, the | character - in a conditional group has a different meaning. Ignoring white space, + Note that a conditional group is not considered as having two alterna- + tives, because only one is ever used. In other words, the | character + in a conditional group has a different meaning. Ignoring white space, consider: ^.*? (?(?=a) a | b(*THEN)c ) If the subject is "ba", this pattern does not match. Because .*? is un- - greedy, it initially matches zero characters. The condition (?=a) then - fails, the character "b" is matched, but "c" is not. At this point, - matching does not backtrack to .*? as might perhaps be expected from - the presence of the | character. The conditional group is part of the - single alternative that comprises the whole pattern, and so the match - fails. (If there was a backtrack into .*?, allowing it to match "b", + greedy, it initially matches zero characters. The condition (?=a) then + fails, the character "b" is matched, but "c" is not. At this point, + matching does not backtrack to .*? as might perhaps be expected from + the presence of the | character. The conditional group is part of the + single alternative that comprises the whole pattern, and so the match + fails. (If there was a backtrack into .*?, allowing it to match "b", the match would succeed.) - The verbs just described provide four different "strengths" of control + The verbs just described provide four different "strengths" of control when subsequent matching fails. (*THEN) is the weakest, carrying on the - match at the next alternative. (*PRUNE) comes next, failing the match - at the current starting position, but allowing an advance to the next - character (for an unanchored pattern). (*SKIP) is similar, except that + match at the next alternative. (*PRUNE) comes next, failing the match + at the current starting position, but allowing an advance to the next + character (for an unanchored pattern). (*SKIP) is similar, except that the advance may be more than one character. (*COMMIT) is the strongest, causing the entire match to fail. More than one backtracking verb - If more than one backtracking verb is present in a pattern, the one - that is backtracked onto first acts. For example, consider this pat- + If more than one backtracking verb is present in a pattern, the one + that is backtracked onto first acts. For example, consider this pat- tern, where A, B, etc. are complex pattern fragments: (A(*COMMIT)B(*THEN)C|ABD) - If A matches but B fails, the backtrack to (*COMMIT) causes the entire + If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to fail. However, if A and B match, but C fails, the backtrack to - (*THEN) causes the next alternative (ABD) to be tried. This behaviour - is consistent, but is not always the same as Perl's. It means that if - two or more backtracking verbs appear in succession, all but the last + (*THEN) causes the next alternative (ABD) to be tried. This behaviour + is consistent, but is not always the same as Perl's. It means that if + two or more backtracking verbs appear in succession, all but the last of them has no effect. Consider this example: ...(*COMMIT)(*PRUNE)... If there is a matching failure to the right, backtracking onto (*PRUNE) - causes it to be triggered, and its action is taken. There can never be + causes it to be triggered, and its action is taken. There can never be a backtrack onto (*COMMIT). Backtracking verbs in repeated groups @@ -10574,52 +10613,52 @@ BACKTRACKING CONTROL /(a(*COMMIT)b)+ac/ - If the subject is "abac", Perl matches unless its optimizations are - disabled, but PCRE2 always fails because the (*COMMIT) in the second + If the subject is "abac", Perl matches unless its optimizations are + disabled, but PCRE2 always fails because the (*COMMIT) in the second repeat of the group acts. Backtracking verbs in assertions - (*FAIL) in any assertion has its normal effect: it forces an immediate - backtrack. The behaviour of the other backtracking verbs depends on - whether or not the assertion is standalone or acting as the condition + (*FAIL) in any assertion has its normal effect: it forces an immediate + backtrack. The behaviour of the other backtracking verbs depends on + whether or not the assertion is standalone or acting as the condition in a conditional group. - (*ACCEPT) in a standalone positive assertion causes the assertion to - succeed without any further processing; captured strings and a mark - name (if set) are retained. In a standalone negative assertion, (*AC- + (*ACCEPT) in a standalone positive assertion causes the assertion to + succeed without any further processing; captured strings and a mark + name (if set) are retained. In a standalone negative assertion, (*AC- CEPT) causes the assertion to fail without any further processing; cap- tured substrings and any mark name are discarded. - If the assertion is a condition, (*ACCEPT) causes the condition to be - true for a positive assertion and false for a negative one; captured + If the assertion is a condition, (*ACCEPT) causes the condition to be + true for a positive assertion and false for a negative one; captured substrings are retained in both cases. The remaining verbs act only when a later failure causes a backtrack to - reach them. This means that, for the Perl-compatible assertions, their + reach them. This means that, for the Perl-compatible assertions, their effect is confined to the assertion, because Perl lookaround assertions are atomic. A backtrack that occurs after such an assertion is complete - does not jump back into the assertion. Note in particular that a - (*MARK) name that is set in an assertion is not "seen" by an instance + does not jump back into the assertion. Note in particular that a + (*MARK) name that is set in an assertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern. - PCRE2 now supports non-atomic positive assertions and also "scan sub- - string" assertions, as described in the sections entitled "Non-atomic - assertions" and "Scan substring assertions" above. These assertions + PCRE2 now supports non-atomic positive assertions and also "scan sub- + string" assertions, as described in the sections entitled "Non-atomic + assertions" and "Scan substring assertions" above. These assertions must be standalone (not used as conditions). They are not Perl-compati- - ble. For these assertions, a later backtrack does jump back into the - assertion, and therefore verbs such as (*COMMIT) can be triggered by + ble. For these assertions, a later backtrack does jump back into the + assertion, and therefore verbs such as (*COMMIT) can be triggered by backtracks from later in the pattern. - The effect of (*THEN) is not allowed to escape beyond an assertion. If - there are no more branches to try, (*THEN) causes a positive assertion - to be false, and a negative assertion to be true. This behaviour dif- + The effect of (*THEN) is not allowed to escape beyond an assertion. If + there are no more branches to try, (*THEN) causes a positive assertion + to be false, and a negative assertion to be true. This behaviour dif- fers from Perl when the assertion has only one branch. - The other backtracking verbs are not treated specially if they appear - in a standalone positive assertion. In a conditional positive asser- + The other backtracking verbs are not treated specially if they appear + in a standalone positive assertion. In a conditional positive asser- tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), - or (*PRUNE) causes the condition to be false. However, for both stand- + or (*PRUNE) causes the condition to be false. However, for both stand- alone and conditional negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- ing any further alternative branches. @@ -10629,19 +10668,19 @@ BACKTRACKING CONTROL These behaviours occur whether or not the group is called recursively. (*ACCEPT) in a group called as a subroutine causes the subroutine match - to succeed without any further processing. Matching then continues af- - ter the subroutine call. Perl documents this behaviour. Perl's treat- + to succeed without any further processing. Matching then continues af- + ter the subroutine call. Perl documents this behaviour. Perl's treat- ment of the other verbs in subroutines is different in some cases. - (*FAIL) in a group called as a subroutine has its normal effect: it + (*FAIL) in a group called as a subroutine has its normal effect: it forces an immediate backtrack. - (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail - when triggered by being backtracked to in a group called as a subrou- + (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail + when triggered by being backtracked to in a group called as a subrou- tine. There is then a backtrack at the outer level. (*THEN), when triggered, skips to the next alternative in the innermost - enclosing group that has alternatives (its normal behaviour). However, + enclosing group that has alternatives (its normal behaviour). However, if there is no such group within the subroutine's group, the subroutine match fails and there is a backtrack at the outer level. @@ -10653,44 +10692,44 @@ EBCDIC ENVIRONMENTS Escape sequences - When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. + When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c escape is processed as specified for Perl in the perlebcdic doc- - ument. The only characters that are allowed after \c are A-Z, a-z, or - one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- - time error. The sequence \c@ encodes character code 0; after \c the - letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, - \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- + ument. The only characters that are allowed after \c are A-Z, a-z, or + one of @, [, \, ], ^, _, or ?. Any other character provokes a compile- + time error. The sequence \c@ encodes character code 0; after \c the + letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [, + \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- comes either 255 (hex FF) or 95 (hex 5F). - Thus, apart from \c?, these escapes generate the same character code - values as they do in an ASCII or Unicode environment, though the mean- - ings of the values mostly differ. For example, \cG always generates + Thus, apart from \c?, these escapes generate the same character code + values as they do in an ASCII or Unicode environment, though the mean- + ings of the values mostly differ. For example, \cG always generates code value 7, which is BEL in ASCII but DEL in EBCDIC. - The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, - but because 127 is not a control character in EBCDIC, Perl makes it - generate the APC character. Unfortunately, there are several variants - of EBCDIC. In most of them the APC character has the value 255 (hex - FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If + The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, + but because 127 is not a control character in EBCDIC, Perl makes it + generate the APC character. Unfortunately, there are several variants + of EBCDIC. In most of them the APC character has the value 255 (hex + FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC values, PCRE2 makes \c? generate 95; otherwise it generates 255. Character classes In character classes there is a special case in EBCDIC environments for - ranges whose end points are both specified as literal letters in the - same case. For compatibility with Perl, EBCDIC code points within the + ranges whose end points are both specified as literal letters in the + same case. For compatibility with Perl, EBCDIC code points within the range that are not letters are omitted. For example, [h-k] matches only - four characters, even though the EBCDIC codes for h and k are 0x88 and + four characters, even though the EBCDIC codes for h and k are 0x88 and 0x92, a range of 11 code points. However, if the range is specified nu- - merically, for example, [\x88-\x92] or [h-\x92], all code points are + merically, for example, [\x88-\x92] or [h-\x92], all code points are included. SEE ALSO - pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), + pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3), pcre2(3). @@ -12127,8 +12166,8 @@ SUBSTRING SCAN ASSERTION (*scan_substring:(grouplist)...) scan captured substring (*scs:(grouplist)...) scan captured substring - The comma-separated list may identify groups in any of the following - ways: + The comma-separated list "grouplist" may identify groups in any of the + following ways: n absolute reference +n relative reference @@ -12179,6 +12218,29 @@ SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) \g<-n> call subroutine by relative number (PCRE2 extension) \g'-n' call subroutine by relative number (PCRE2 extension) + The variants using parentheses (?...) may also specify a list of cap- + ture groups to return, which shall be retained in the calling subex- + pression if set during the recursion (this feature is not supported by + Perl). + + (?R(grouplist)) recurse whole pattern, returning capture groups + (PCRE2 extension) + (?n(grouplist)) ) + (?+n(grouplist)) ) call subroutine, returning capture groups + (?-n(grouplist)) ) (PCRE2 extension) + (?&name(grouplist)) ) + (?P>name(grouplist)) ) + + The comma-separated list "grouplist" uses the same syntax as + (*scan_substring:(grouplist)...), and may identify groups in any of the + following ways: + + n absolute reference + +n relative reference + -n relative reference + name + 'name' name + CONDITIONAL PATTERNS diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 54e86f190..e0d8ca639 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -3442,7 +3442,7 @@ later versions (I tried 5.024) it now works. . . .\" HTML -.SH "GROUPS AS SUBROUTINES" +.SS "Groups as subroutines" .rs .sp If the syntax for a recursive group call (either by number or by name) is used @@ -3495,8 +3495,51 @@ in groups when called as subroutines is described in the section entitled below. . . +.SS "Recursion and subroutines with returned capture groups" +.rs +.sp +Since PCRE2 10.46, recursion and subroutine calls may also specify a list of +capture groups to return. This is a PCRE2 syntax extension not supported by +Perl. The pattern matching recurses into the referenced expression as described +above, however, when the recursion returns to the calling expression the +subgroups captured during the recursion can be retained when the calling +expression's context is restored. +.P +When used as a subroutine, this allows the subroutine's capture groups to +be used as return values. +.P +Only the specific capture groups listed by the caller will be retained, using +the following syntax: +.sp + (?R(grouplist)) recurse whole pattern, returning capture groups + (?n(grouplist)) ) + (?+n(grouplist)) ) + (?-n(grouplist)) ) call subroutine, returning capture groups + (?&name(grouplist)) ) + (?P>name(grouplist)) ) +.P +The list of capture groups "grouplist" is a comma-separated list of (absolute +or relative) group numbers, and group names enclosed in single quotes or angle +brackets. +.P +Here is an example which first uses the DEFINE condition to create a re-usable +routine for matching a weekday, then calls that subroutine and retains the +groups it captures for use later: +.sp + (?x: # ignore whitespace for clarity + # Define the routine "weekendday" which matches Saturday or + # Sunday, and returns the Sat/Sun prefix as \ek. + (?(DEFINE) (? + (?|(?Sat)urday|(?Sun)day) ) ) + # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun". + (?&weekendday()),\ek ) +.P +This feature is not available using the Oniguruma syntax \eg<...> or \eg'...' +below. +. +. .\" HTML -.SH "ONIGURUMA SUBROUTINE SYNTAX" +.SS "Oniguruma subroutine syntax" .rs .sp For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index bc3168aeb..cb62a3b20 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -543,14 +543,14 @@ This feature is not Perl-compatible. (*scan_substring:(grouplist)...) scan captured substring (*scs:(grouplist)...) scan captured substring .sp -The comma-separated list may identify groups in any of the following ways: +The comma-separated list "grouplist" may identify groups in any of the +following ways: .sp n absolute reference +n relative reference -n relative reference name 'name' name -.sp . . .SH "SCRIPT RUNS" @@ -597,6 +597,28 @@ The comma-separated list may identify groups in any of the following ways: \eg'+n' call subroutine by relative number (PCRE2 extension) \eg<-n> call subroutine by relative number (PCRE2 extension) \eg'-n' call subroutine by relative number (PCRE2 extension) +.sp +The variants using parentheses (?...) may also specify a list of capture groups +to return, which shall be retained in the calling subexpression if set during +the recursion (this feature is not supported by Perl). +.sp + (?R(grouplist)) recurse whole pattern, returning capture groups + (PCRE2 extension) + (?n(grouplist)) ) + (?+n(grouplist)) ) call subroutine, returning capture groups + (?-n(grouplist)) ) (PCRE2 extension) + (?&name(grouplist)) ) + (?P>name(grouplist)) ) +.sp +The comma-separated list "grouplist" uses the same syntax as +(*scan_substring:(grouplist)...), and may identify groups in any of the +following ways: +.sp + n absolute reference + +n relative reference + -n relative reference + name + 'name' name . . .SH "CONDITIONAL PATTERNS" diff --git a/testdata/testinput2 b/testdata/testinput2 index 1105e96bc..72a4864a6 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -7951,6 +7951,47 @@ a)"xI abc#abcdef#defghi#ghijkl abc#abcdef#defghi#ghXjkl# +% # Define the routine "weekendday" which matches Saturday or Sunday, and + # returns the Sat/Sun prefix as \k. + (?(DEFINE)(?(?|(?Sat)urday|(?Sun)day))) + # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun". + (?&weekendday()),\k %x + Saturday,Sat + Sunday,Sun +\= Expect no match + Saturday,Sun + +# Test each syntax used for recursion + +/(?(R)(Sat)urday|(?R(1)),\1)/ + Saturday,Sat + +/(?(DEFINE)((Sat)urday))(?1(2)),\2/ + Saturday,Sat + +/(?(DEFINE)((Sat)urday))(?-2(-1)),\2/ + Saturday,Sat + +/(?+1(+2)),\2(?(DEFINE)((Sat)urday))/ + Saturday,Sat + +/(?(DEFINE)(?(?Sat)urday))(?&fn('ret')),\k/ + Saturday,Sat + +/(?(DEFINE)(?(?Sat)urday))(?P>fn()),\k/ + Saturday,Sat + +/(?(DEFINE)(?(?Sat)urday))\g,\k/ + +/(?(DEFINE)((Sat)urday))(?1),\2/ +\= Expect no match + Saturday,Sat + +/(?(DEFINE)((Sat)urday))(?1()),\2/ + +/(?(DEFINE)((Sat)(urday)))(?1(2,3)),\2,\3/ + Saturday,Sat,urday + # -------------- # End of testinput2 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index f1317b4d1..c92ae3285 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -22391,6 +22391,78 @@ No match abc#abcdef#defghi#ghXjkl# No match +% # Define the routine "weekendday" which matches Saturday or Sunday, and + # returns the Sat/Sun prefix as \k. + (?(DEFINE)(?(?|(?Sat)urday|(?Sun)day))) + # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun". + (?&weekendday()),\k %x + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + Sunday,Sun + 0: Sunday,Sun + 1: + 2: Sun +\= Expect no match + Saturday,Sun +No match + +# Test each syntax used for recursion + +/(?(R)(Sat)urday|(?R(1)),\1)/ + Saturday,Sat + 0: Saturday,Sat + 1: Sat + +/(?(DEFINE)((Sat)urday))(?1(2)),\2/ + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + +/(?(DEFINE)((Sat)urday))(?-2(-1)),\2/ + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + +/(?+1(+2)),\2(?(DEFINE)((Sat)urday))/ + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + +/(?(DEFINE)(?(?Sat)urday))(?&fn('ret')),\k/ + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + +/(?(DEFINE)(?(?Sat)urday))(?P>fn()),\k/ + Saturday,Sat + 0: Saturday,Sat + 1: + 2: Sat + +/(?(DEFINE)(?(?Sat)urday))\g,\k/ +Failed: error 142 at offset 39: syntax error in subpattern name (missing terminator?) + +/(?(DEFINE)((Sat)urday))(?1),\2/ +\= Expect no match + Saturday,Sat +No match + +/(?(DEFINE)((Sat)urday))(?1()),\2/ +Failed: error 217 at offset 27: expected capture group number or name + +/(?(DEFINE)((Sat)(urday)))(?1(2,3)),\2,\3/ + Saturday,Sat,urday + 0: Saturday,Sat,urday + 1: + 2: Sat + 3: urday + # -------------- # End of testinput2