Skip to content

Conversation

@NWilson
Copy link
Member

@NWilson NWilson commented Mar 18, 2025

Fixes #685

  • The primary purpose of pcre2_next_match() is to make it much easier for
    PCRE2 clients to iterate over matches, without needing an advanced knowledge
    of regular expressions. The current amount of code, and comments, in
    pcre2demo makes it look like we are placing a heavy burden on clients, when
    we could take that burden away by providing a function to do the work for them.
  • Secondly, we can simplify our own code by merging the three duplicate
    implementations of the /g global match behaviour: pcre2demo, pcre2_substitute,
    and pcre2test.
  • Thirdly, as I look closely at the issue, I can improve the documentation.
  • Fourthly, I would like to actually simplify the logic, removing a complex loop
    which makes several match attempts, swallows duplicate matches, and more.
    We can have identical behaviour with a simple retry using
    PCRE2_NOTEMPTY_ATSTART.
  • Implementation:
    • Add a new field start_offset to pcre2_match_data, so that pcre2_next_match()
      can inspect where the match attempt began, and ensure that the next attempt
      advances through the subject.

@zherczeg
Copy link
Collaborator

This is a good idea

@NWilson NWilson force-pushed the user/niwilson/match-iterator branch from da3d047 to 0136d7e Compare March 19, 2025 20:31
@NWilson
Copy link
Member Author

NWilson commented Mar 19, 2025

PR description updated.

I have polished the documentation, which I think is now clearer and more helpful.

The improvement in readability in pcre2demo is very nice: clients can actually use this API without pain!

The improved handling in pcre2_substitute is an extra bonus.

I am finished with this now.

  • Check coverage report to make sure the tests hit all the branches
  • Maybe add an additional testcase where the matches are found backwards? Eg /(?(?=\Gc)(?<=\Kab)|(?<=\Kb))/g,allow_lookaround_bsk applied to "abc".
    • First match will match "b" at offset 2, beginning from start_offset=0 (evil \K lookbehind)
    • Second match will match "ab" at offset 2, beginning from start_offset=2 (evil \K lookbehind)
    • The disgusting thing here is that the second match found is before the first match, because we can use \G to detect that start_offset has advanced and conditionally reach back further into the string with \K.

@NWilson NWilson requested a review from zherczeg March 24, 2025 10:10
@NWilson
Copy link
Member Author

NWilson commented Mar 24, 2025

Zoltan, are you happy for me to merge this new API function? I want to make sure you have a chance to check/veto it.

I'm happy to self-review the implementation, if you're OK with the API design. (I have also added a field to match_data, and I think I have correctly updated the three locations that need to set it, whenever one of the three matching functions returns.)

@zherczeg
Copy link
Collaborator

I haven't checked everything in detail, but I think this is a useful user function, and the patch looks good. The overhead of setting the start_offset probably negligible.

@NWilson
Copy link
Member Author

NWilson commented Mar 24, 2025

Thank you very much Zoltan!

@NWilson NWilson merged commit eb3bd3c into master Mar 24, 2025
35 checks passed
@NWilson NWilson deleted the user/niwilson/match-iterator branch March 24, 2025 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a match-iterator API

3 participants