Please sign in to comment.
Speed up /[[:posix:]]/a UTF-8 matching
Before this commit, when a Posix class (not the negation) was matched against a UTF-8 string, it ignored the UTF-8ness, because it's irrelevant to the success or failure of the match. However, I now realize it is relevant to the speed of the match, hence this commit. In particular, when scanning a string for the next occurrence of, say, an ASCII word character, it tested every continuation byte. Each one is relatively efficient, because it is an array lookup, but it's work that would be avoided if we skipped continuation bytes. However, to skip continuation bytes, we have to look up how many to skip. This is extra work that the previous version avoided. So, if the string is entirely ASCII (but with the UTF-8 bit set), this will be a loser, as you still have to examine every byte anyway. The more continuation bytes get skipped, the more of a win this will be. Below are some measurements from Porting/bench.pl on a 64-bit Linux g++ -O2 system. The numbers are for very long strings, as otherwise, the delta due solely to this change is masked by the overhead around pattern matching in general. All 1-byte characters improvement ratio Ir 100.0 Dr 80.0 Dw 100.0 COND 100.0 All 2-byte characters improvement ratio Ir 200.0 Dr 114.3 Dw 200.0 COND 200.0 All 3-byte characters improvement ratio Ir 300.0 Dr 171.4 Dw 300.0 COND 300.0 All 4-byte characters improvement ratio Ir 400.0 Dr 228.6 Dw 400.0 COND 400.0 This means that if a string consists entirely of ASCII (1-byte chars), this patch worsens the number of data read instructions by 20%, while the other measurements are unchanged. But if the target string consists entirely of multi-byte characters, the performance in all other criteria goes up by much larger amounts. The most important measurement here is COND. One can extrapolate that if the string contains even just a few multi-byte characters, the COND improves over what the previous behavior did, and hence this patch is worth doing. It's unclear what the average number of bytes in a string might be. If one is processing text with a large number of bytes per character, why would one be using /a? But if everything is just ASCII, why is the UTF-8 bit set? So my best guess is that the strings that this patch will affect are mostly, but not entirely ASCII, and the number of conditional branches goes down in this case. bench.pl returns other measurements which I omitted above, because they either have unchanged performance or involve a trivial number of total instructions.
- Loading branch information...