Permalink
Browse files

Speed up /[[:posix:]]/a UTF-8 matching

Before this commit, when a Posix class (not the negation) was matched
against a UTF-8 string, it ignored the UTF-8ness, because it's irrelevant
to the success or failure of the match.  However, I now realize it is
relevant to the speed of the match, hence this commit.

In particular, when scanning a string for the next occurrence of, say,
an ASCII word character, it tested every continuation byte.  Each one is
relatively efficient, because it is an array lookup, but it's work that
would be avoided if we skipped continuation bytes.

However, to skip continuation bytes, we have to look up how many to
skip.  This is extra work that the previous version avoided.  So, if the
string is entirely ASCII (but with the UTF-8 bit set), this will be a
loser, as you still have to examine every byte anyway.  The more
continuation bytes get skipped, the more of a win this will be.  Below
are some measurements from Porting/bench.pl on a 64-bit Linux g++ -O2
system.  The numbers are for very long strings, as otherwise, the delta
due solely to this change is masked by the overhead around pattern
matching in general.

All 1-byte characters improvement ratio
    Ir      100.0
    Dr      80.0
    Dw      100.0
  COND      100.0

All 2-byte characters improvement ratio
    Ir      200.0
    Dr      114.3
    Dw      200.0
  COND      200.0

All 3-byte characters improvement ratio
    Ir      300.0
    Dr      171.4
    Dw      300.0
  COND      300.0

All 4-byte characters improvement ratio
    Ir      400.0
    Dr      228.6
    Dw      400.0
  COND      400.0

This means that if a string consists entirely of ASCII (1-byte chars),
this patch worsens the number of data read instructions by 20%, while
the other measurements are unchanged.  But if the target string consists
entirely of multi-byte characters, the performance in all other criteria
goes up by much larger amounts.

The most important measurement here is COND.  One can extrapolate that
if the string contains even just a few multi-byte characters, the COND
improves over what the previous behavior did, and hence this patch is
worth doing.

It's unclear what the average number of bytes in a string might be.  If
one is processing text with a large number of bytes per character, why
would one be using /a?  But if everything is just ASCII, why is the
UTF-8 bit set?  So my best guess is that the strings that this patch
will affect are mostly, but not entirely ASCII, and the number of
conditional branches goes down in this case.

bench.pl returns other measurements which I omitted above, because they
either have unchanged performance or involve a trivial number of total
instructions.
  • Loading branch information...
khwilliamson committed Dec 29, 2017
1 parent 9bbdd1c commit 156701f3132dee0163dcd26ce187a2509d417e1d
Showing with 10 additions and 3 deletions.
  1. +10 −3 regexec.c
View
@@ -2412,12 +2412,19 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
}
to_complement = 1;
/* FALLTHROUGH */
goto posixa;
case POSIXA:
posixa:
/* Don't need to worry about utf8, as it can match only a single
* byte invariant character. */
* byte invariant character. But we do anyway for performance reasons,
* as otherwise we would have to examine all the continuation
* characters */
if (utf8_target) {
REXEC_FBC_UTF8_CLASS_SCAN(_generic_isCC_A(*s, FLAGS(c)));
break;
}
posixa:
REXEC_FBC_CLASS_SCAN(
to_complement ^ cBOOL(_generic_isCC_A(*s, FLAGS(c))));
break;

0 comments on commit 156701f

Please sign in to comment.