Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp pattern matching repeat-loop exit condition checking #18414

Closed
khwilliamson opened this issue Dec 20, 2020 · 2 comments
Closed

Revamp pattern matching repeat-loop exit condition checking #18414

khwilliamson opened this issue Dec 20, 2020 · 2 comments

Comments

@khwilliamson
Copy link
Contributor

This issue is in response to a comment on #18334 asking that the lengthy commit message be moved to a ticket.

Consider the pattern /A*B/ where A and B are arbitrary. The pattern
matching code tries to make a tight loop to match the span of A's.

The commit revamps the whole mechanism so that now UTF-8 is a first
class citizen, and improves the loop exit criteria for non-UTF-8 as
well.

The old code returned FALSE if there was no possibility of
anything matching (such as the match requiring UTF-8 to represent and
the target isn't UTF-8 encoded). If there was a match possibility it
returned TRUE and 1 or 2 code points that could be the first things
in B.

If more than 2 code points could be matched, it gave up
and returned 0 code points. In this case, the caller had to check for
every eventuality, and in the case of /i had to re-fold each backtrack.

Like the old one, the revampled function introduced here returns FALSE if
there is no possibility of a match succeeding, but unlike the old one it
never gives up. If the return is TRUE, It also always returns:
1) A string and a mask to quickly rule out non-matches
2) A count of initial bytes that are exact matches, and a count of initial definitive matches
3) The first character of all possible matches

The caller does a bitwise 'and' of the target string with the returned
mask. If that doesn't match the returned string, then the caller can
immediately rule out this match. If the mask is all one bits, then
there are no false positives. This will always be the case if the
function is called on an exact match, one without /i. Likewise, if just
a single 0 bit is in the mask, there are just two possible matches, and
the match is definitive for these, with no false positives. The more 0
bits in the mask, the more false positives there are, but it still
quickly rules out the remaining ones. This is graceful degradation,
unlike the mechanism it replaces.

When the counts of initial exact and definitive match bytes are non-zero
(a likely scenario), the caller can use this to cut down the number of
brute force tests it must do.

Finally, having the exact strings of what can possibly match means no
need to refold when backtracking.

Here are some examples. In an EXACT node, the mask returned will be all
one bits, and the returned string will be the first character of the
node. The caller will AND the mask with the target string and see if it
matches the returned string. This is not very different from before,
except that the caller doesn't have to check each time if 0, 1, or 2
characters were returned from the function, and then behave accordingly.

Only if the match succeeds, does it need to check if this was a
definitive match or not.

More complex is

qr/.*\N{U+100}/i

Only U+100 and U+101 match. The old way knew this and returned those
code points and the two UTF-8 strings that represent them: \xC4\x80 and
\xC4\x81. There were a bunch of conditionals in the old way to
distinguish between the UTF-8 case from non-UTF-8, and to compare each one separately.
The mask and string returned in the new way uses a single compare and
mask, dispensing with all this.

The old way was optimized for ASCII characters, and the new way doesn't
improve that for the ones who don't have matches outside the ASCII range. But almost have
versions. But a significant number do have matches outside of ASCII. For these, the old code gave
up, and the regexec.c caller had to resort to brute force matching each time,
whereas the new one always returns something that can quickly rule out
many possibilities. An example is

qr/.*\N{LATIN SMALL LETTER K}/i

'K', 'k', and the KELVIN SIGN all match. Here is a detailed analysis of
this case to show the complexities involved.

This pattern is too complicated for the code prior to this commit to
deal with, so returned to the caller that it would need to check every eventuality.
The new code returns a single byte mask 0x56, and corresponding string 0x42.
There are 4 zero bits in this mask. That means that the code can
immediately reject all but 16 of the possible bytes that could follow
it. The ones it accepts are:

0x82 83 8A 8B
  A2 A3 AA AB
  C2 C3 CA CB
  E2 E3 EA EB

If the target string isn't UTF-8 encoded, this excludes 240 of the 256 possible
bytes, all non-ASCII, and 4 of which are extremely rare control
characters, a false positive rate of 94% or 97%, depending on if you
count those controls. If you know you only have ASCII inputs, it's a
0% false positive rate.

If the target string is UTF-8 encoded, the first 8 bytes in the list
above are illegal start bytes. The 4 Cx bytes are start bytes that begin 2-byte UTF-8 characters.
Each of these start bytes has 64 possible continuation
bytes, so any of 256 characters could match, out of the possible 1792
legal 2-UTF-8-byte characters, a 14% false positive rate

The 4 Ex bytes are the beginning of 3-byte representations, each of
which has 64-squared possible continuation bytes, so any of 16384
characters could match, out the possible 63488 legal 3-UTF-8-byte
characters, a 26% false positive rate.

All 4-byte, 5-byte, etc., representations are excluded, so the total
false positive rate approaches 0% if we include all possible UTF-8
sequences. If we confine the input domain to just the million legal
Unicode code points, the false positive rate is 1%, but, except for
emoji, most characters in any sort of common use are represented by 3 or
fewer bytes, and the false positive rate is about 25%, but only the
characters common to multiple scripts or Latin ones in that domain
are likely to occur in text with 'k' and 'K'. There are 4758 of these
in the current version of the Unicode standard, of which this could
potentially match 3500, or 76%, if all were equally likely to occur
immediately after one of these 'K'-characters. One could do further
restrictions, such as using only modern, non-technical characters to get
wildly differing false positive rates.

But whatever the false positive rate, it is better than the brute force
method required prior to this commit.

Finally, perhaps the worst case fold in Unicode is that of
\N{GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA}. There are 6
characters that match this under /i. As a result, the mask and returned
string have lots of false positives. But they still rule out all
characters that aren't represented by 2 or 3 bytes, and any 2 byte
characters that begin with D0-DF.

Module:

Description

Steps to Reproduce

Expected behavior

Perl configuration

# perl -V output goes here
khwilliamson added a commit that referenced this issue Dec 20, 2020
Consider the pattern /A*B/ where A and B are arbitrary.  The pattern
matching code tries to make a tight loop to match the span of A's.  The
logic of this was not really updated when UTF-8 was added.  I did
revamp it some releases ago to fix some bugs and to at least consider
UTF-8.

This commit changes it so that Unicode is now a first class citizen.
Some details are listed in the ticket GH #18414
@khwilliamson
Copy link
Contributor Author

Commit bb38256 addresses this.

@hvds
Copy link
Contributor

hvds commented Mar 23, 2022

Thanks for this @khwilliamson.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants