New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add case insensitive and dot-all modes to RegExpTree dictionary #50906
Conversation
This is an automated comment for commit 60b0df9 with description of existing statuses. It's updated for the latest CI running
|
The new per-dictionary settings control regex match semantics around case sensitivity and the '.' wildcard with newlines. They must be set at the dictionary level since they're applied to regex engines at pattern-compile-time. - regexp_dict_flag_case_insensitive: case insensitive matching - regexp_dict_flag_dotall: '.' matches all characters including newlines They correspond to HS_FLAG_CASELESS and HS_FLAG_DOTALL in Vectorscan and case_sensitive and dot_nl in RE2. These are the most useful options compatible with the internal behavior of RegExpTreeDictionary around splitting up simple and complex patterns between Vectorscan and RE2. The alternative is to use (?i) and/or (?s) for all patterns. However, (?s) isn't handled properly by OptimizedRegularExpression::analyze(). And while (?i) is, it still causes the dictionary to treat the pattern as "complex" for sequential scanning with RE2 rather than multi-matching with Vectorscan, even though Vectorscan supports case insensitive literal matching. Setting dictionary-wide flags is both more convenient, and circumvents these problems.
25383ca
to
bcb058f
Compare
(No substantive changes, just updating with upstream. Still ready for review, pending automated test results.) |
Sorry to bother you @hanfei1991, but it seems this PR may have been missed when I put it up 3 months ago (it hasn't even been cleared for testing). I'm sure you're busy, but if you have time, do you have any thoughts here? It seems like a relatively minimal change that could still be pretty useful in specific cases. |
Fixes #50905.
The new per-dictionary settings control regex match semantics around case sensitivity and the '.' wildcard with newlines. They must be set at the dictionary level since they're applied to regex engines at pattern-compile-time.
They correspond to HS_FLAG_CASELESS and HS_FLAG_DOTALL in Vectorscan and case_sensitive and dot_nl in RE2. These are the most useful options compatible with the internal behavior of RegExpTreeDictionary around splitting up simple and complex patterns between Vectorscan and RE2.
The alternative is to use (?i) and/or (?s) for all patterns. However, (?s) isn't handled properly by OptimizedRegularExpression::analyze(). And while (?i) is, it still causes the dictionary to treat the pattern as "complex" for sequential scanning with RE2 rather than multi-matching with Vectorscan, even though Vectorscan supports case insensitive literal matching. Setting dictionary-wide flags is both more convenient, and circumvents these problems.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support case-insensitive and dot-all matching modes in RegExpTree dictionaries.
Documentation entry for user-facing changes