Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support case-insensitive and dot-all compilation flags with RegExpTree dictionaries #50905

Closed
johanngan opened this issue Jun 12, 2023 · 0 comments · Fixed by #50906
Closed
Labels

Comments

@johanngan
Copy link
Contributor

Use case

The Vectorscan and RE2 regex engines both support compiling patterns with certain flags that change the semantics around matching. For example, you can set HS_FLAG_DOTALL in Vectorscan and dot_nl in RE2 to force the . wildcard character to match all characters including newlines, which is especially useful when matching against raw bytes rather than single-line strings.

It would be great if regexp tree dictionaries supported such flags to the extent possible. Looking at the docs for Hyperscan and RE2, it seems like the most useful ones that are readily compatible with RegExpTreeDictionary's implementation are case-insensitive mode (HS_FLAG_CASELESS/case_sensitive) and dot-all mode (HS_FLAG_DOTALL/dot_nl) (other ones like UTF-8 mode would require more work to support since the OptimizedRegularExpression analyzer doesn't really know how to handle such cases).

Describe the solution you'd like

The following dictionary-level settings for regexp tree dictionaries:

  • regexp_dict_flag_case_insensitive: Use case-insensitive matching
  • regexp_dict_flag_dotall: Allow . to match newlines

Describe alternatives you've considered
Both Vectorscan and RE2 support the (?i) and (?-i) construct to enable/disable case-insensitivity at a granular level. Similarly, RE2 supports the (?s) and (?-s) construct to enable/disable dot-all mode. However:

  • (?s) doesn't work properly with the OptimizedRegularExpression analyzer and gets mangled.
  • (?i) does work with OptimizedRegularExpression, but it forces all such patterns to be classified as "complex" by RegExpTreeDictionary, meaning they'll be scanned one-by-one with RE2 instead of all at once with Vectorscan, even if the pattern is otherwise simple. This is slow.
  • If you want all your patterns to be case-insensitive and/or match '.' against newlines, it's annoying to have to prepend (?i) and/or (?s) in front of all of them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant