You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Vectorscan and RE2 regex engines both support compiling patterns with certain flags that change the semantics around matching. For example, you can set HS_FLAG_DOTALL in Vectorscan and dot_nl in RE2 to force the . wildcard character to match all characters including newlines, which is especially useful when matching against raw bytes rather than single-line strings.
It would be great if regexp tree dictionaries supported such flags to the extent possible. Looking at the docs for Hyperscan and RE2, it seems like the most useful ones that are readily compatible with RegExpTreeDictionary's implementation are case-insensitive mode (HS_FLAG_CASELESS/case_sensitive) and dot-all mode (HS_FLAG_DOTALL/dot_nl) (other ones like UTF-8 mode would require more work to support since the OptimizedRegularExpression analyzer doesn't really know how to handle such cases).
Describe the solution you'd like
The following dictionary-level settings for regexp tree dictionaries:
regexp_dict_flag_case_insensitive: Use case-insensitive matching
regexp_dict_flag_dotall: Allow . to match newlines
Describe alternatives you've considered
Both Vectorscan and RE2 support the (?i) and (?-i) construct to enable/disable case-insensitivity at a granular level. Similarly, RE2 supports the (?s) and (?-s) construct to enable/disable dot-all mode. However:
(?s) doesn't work properly with the OptimizedRegularExpression analyzer and gets mangled.
(?i) does work with OptimizedRegularExpression, but it forces all such patterns to be classified as "complex" by RegExpTreeDictionary, meaning they'll be scanned one-by-one with RE2 instead of all at once with Vectorscan, even if the pattern is otherwise simple. This is slow.
If you want all your patterns to be case-insensitive and/or match '.' against newlines, it's annoying to have to prepend (?i) and/or (?s) in front of all of them.
The text was updated successfully, but these errors were encountered:
Use case
The Vectorscan and RE2 regex engines both support compiling patterns with certain flags that change the semantics around matching. For example, you can set
HS_FLAG_DOTALL
in Vectorscan anddot_nl
in RE2 to force the.
wildcard character to match all characters including newlines, which is especially useful when matching against raw bytes rather than single-line strings.It would be great if regexp tree dictionaries supported such flags to the extent possible. Looking at the docs for Hyperscan and RE2, it seems like the most useful ones that are readily compatible with
RegExpTreeDictionary
's implementation are case-insensitive mode (HS_FLAG_CASELESS
/case_sensitive
) and dot-all mode (HS_FLAG_DOTALL
/dot_nl
) (other ones like UTF-8 mode would require more work to support since theOptimizedRegularExpression
analyzer doesn't really know how to handle such cases).Describe the solution you'd like
The following dictionary-level settings for regexp tree dictionaries:
regexp_dict_flag_case_insensitive
: Use case-insensitive matchingregexp_dict_flag_dotall
: Allow.
to match newlinesDescribe alternatives you've considered
Both Vectorscan and RE2 support the
(?i)
and(?-i)
construct to enable/disable case-insensitivity at a granular level. Similarly, RE2 supports the(?s)
and(?-s)
construct to enable/disable dot-all mode. However:(?s)
doesn't work properly with theOptimizedRegularExpression
analyzer and gets mangled.(?i)
does work withOptimizedRegularExpression
, but it forces all such patterns to be classified as "complex" by RegExpTreeDictionary, meaning they'll be scanned one-by-one with RE2 instead of all at once with Vectorscan, even if the pattern is otherwise simple. This is slow.(?i)
and/or(?s)
in front of all of them.The text was updated successfully, but these errors were encountered: