-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve robustness of numeric regex normalization #11
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently normalization of numeric regex patterns is quite fragile and limited in scope of leveraging valid regex operations. Add the ability to extract and encode numeric patterns within a regex pattern while reserving and supporting additional regex operations such as grouping, character lists, qualifiers, wildcards, and pipe ORs. Fixes #1565
ivakegg
reviewed
Sep 8, 2022
src/main/java/datawave/data/normalizer/NumericRegexNormalizer.java
Outdated
Show resolved
Hide resolved
ivakegg
reviewed
Sep 8, 2022
src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java
Outdated
Show resolved
Hide resolved
ivakegg
requested changes
Oct 19, 2022
src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java
Outdated
Show resolved
Hide resolved
Some regex forms that I am seeing:
Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next. |
Refactor the approach for normalizing numeric regex to use a tree-base parsing approach for validation, optimization, and encoding.
ivakegg
reviewed
Sep 6, 2023
…tive Co-authored-by: Ivan Bella <ivan@bella.name>
ivakegg
previously approved these changes
Dec 16, 2023
ivakegg
approved these changes
Jun 21, 2024
avgAGB
approved these changes
Jun 24, 2024
apmoriarty
approved these changes
Jun 24, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves functionality for encoding numeric regexes that are meant to match against numbers that were previously encoded via NumericalEncoder.
Requirements
\d
.\.
,\-
, or\d
.(45.*)
.?
*
+
, or a repetition quantifier such as{2}
.Supported Regex Features
The following regex features are supported, with any noted caveats.
.
.\d
.[]
. CAVEAT: Digit characters only. Ranges are supported.*
.+
.{x}
,{x,}
, and{x,y}
. Ranges are supported.^
and$
. CAVEAT: Technically not truly supported as they are ultimately removed during the pre-optimization process. However, using them will not result in an error.|
.Additionally, in order to mark a regex pattern as intended to match negative numbers only, a minus sign should be placed at the beginning of the regex pattern, e.g.
-34.*
, or at the beginning of each desired alternated pattern.Optimizations
Before encoding the incoming regex, it will undergo the following modifications to optimize the ease of encoding:
{0}
or{0,0}
will be removed as they are expected to occur zero times. This does not apply to characters with the repetition quantifier{0,}
or a variation of{0,x}
..*
or.+
will result in the addition of an alternation of the same pattern with a minus sign in front of it to ensure a variant for matching negative numbers is added. This does not apply to any regex patterns already starting with-.*
or-.+
.[0-9].*
. In this case, an alternation for the character0
will be added (i.e.[0-9].*|0
) to ensure that the ability to match zero is not lost when enriching the pattern with the required exponential bins to target the appropriate encoded numbers.A strong effort has been made to make resulting encoded patterns as accurate as possible, but there is always a chance of at least some inaccuracy, given the nature of how numbers are encoded, particularly when it comes to numbers that are very similar other than the location of a decimal point, if present, in them. I tried to cover a wide breadth of different regex patterns and edge cases, but I welcome any help on finding inconsistencies that need to be corrected.
Additionally, I have allowed for the possibility of leading zeros when matching against numbers, e.g.
00054
, since the class NumericalEncoder supports encoding numbers with these forms. As an example, currently.{3}54
is assumed to be able to match00054
. I expect a discussion with end-users will need to be had on whether this is the desired behavior going forward, or if a regex like.{3}54
must only match against numbers five digits long, such as12354
.A brief overview of the process whereby regexes are encoded via
NumericRegexEncoder.encode(String regex)
:Node
tree structure viaRegexParser
. Different regex elements are represented by various subclasses ofNode
.Visitor
classes that either check a condition or return an updated regex tree.Fixes #1565