Skip to content

Incorrect backtracking over lookahead with captures which are later backreferenced #355

@liquidcake1

Description

@liquidcake1

This does not match:

PCRE2 version 10.42 2022-12-11
  re> /(?=.*(.))(?=.*\1.*\1).*(?!\1)(.).*\2/ 
data> 22331
No match

This matches:

  re> /(?=.*(.).*\1)(?=.*\1.*\1).*(?!\1)(.).*\2/ 
data> 22331
 0: 22
 1: 3
 2: 2

The only difference is an additional constraint which forces the initial lookahead not to pick 1.

Basically I think PCRE doesn't understand that the initial lookahead is a dependency of subsequent behaviour, and hence the matcher must exhaust its possible matches before declaring a failure. Interestingly, changing the matcher does seem to support backtracking to some extent, as for example the following works:

PCRE2 version 10.42 2022-12-11
  re> /(?=.*(.))(?=(.)).*\2(?!.*\1$)/
data> 23321
 0: 1
 1: 1
 2: 1

Weirdly, moving the \2 to the end, which shouldn't affect the match (again, this should in principle actually increase the options available to the matcher as it allows the .* in the negative lookahead to start with the \2 while in the matching variant, it couldn't) causes this to no longer match.

I assume I'm hitting some kind of backtracking safety mechanism, but if so it'd be preferable for the engine to just refuse to compile such monstrosities, or at least error at runtime when the constraint is hit. Better that than execute them incorrectly.

Also tested on latest master (10.43-DEV 2023-04-14). Same results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions