Improve robustness of numeric regex normalization #11

lbschanno · 2022-08-30T15:07:11Z

This PR improves functionality for encoding numeric regexes that are meant to match against numbers that were previously encoded via NumericalEncoder.

Requirements

The following requirements apply to all incoming regexes:
Patterns may not be blank.
Patterns may not contain whitespace.
Patterns must be compilable.
Patterns may not contain any letters other than \d.
Patterns may not contain any escaped characters other than \., \-, or \d.
Patterns may not contain any groups, e.g. (45.*).
Patterns may not contain any decimal points that are followed by ? * +, or a repetition quantifier such as {2}.

Supported Regex Features
The following regex features are supported, with any noted caveats.

Wildcards ..
Digit character class \d.
Character class lists []. CAVEAT: Digit characters only. Ranges are supported.
Zero or more quantifier *.
One or more quantifier +.
Repetition quantifier {x}, {x,}, and {x,y}. Ranges are supported.
Anchors ^ and $. CAVEAT: Technically not truly supported as they are ultimately removed during the pre-optimization process. However, using them will not result in an error.
Alternations |.

Additionally, in order to mark a regex pattern as intended to match negative numbers only, a minus sign should be placed at the beginning of the regex pattern, e.g. -34.*, or at the beginning of each desired alternated pattern.

Optimizations
Before encoding the incoming regex, it will undergo the following modifications to optimize the ease of encoding:

Any empty alternations will be removed.
Any occurrences of the anchors ^ or $ will be removed. These will need to be added back into the returned encoded regex pattern afterwards if desired.
Optional variants (characters followed by ?} will be expanded into additional alternations as seen. This will not apply to any ? instances that directly follow a *, +, or {x}, as the ? in this case modifies the greediness of the matching rather than whether or not a character can be present.
Any characters immediately followed by the repetition quantifier {0} or {0,0} will be removed as they are expected to occur zero times. This does not apply to characters with the repetition quantifier {0,} or a variation of {0,x}.
Any patterns starting with .* or .+ will result in the addition of an alternation of the same pattern with a minus sign in front of it to ensure a variant for matching negative numbers is added. This does not apply to any regex patterns already starting with -.* or -.+.
In some cases a pattern may match both exactly zero and another number greater than one, e.g. the pattern [0-9].*. In this case, an alternation for the character 0 will be added (i.e. [0-9].*|0) to ensure that the ability to match zero is not lost when enriching the pattern with the required exponential bins to target the appropriate encoded numbers.
Pattern alternations will be de-duped.

A strong effort has been made to make resulting encoded patterns as accurate as possible, but there is always a chance of at least some inaccuracy, given the nature of how numbers are encoded, particularly when it comes to numbers that are very similar other than the location of a decimal point, if present, in them. I tried to cover a wide breadth of different regex patterns and edge cases, but I welcome any help on finding inconsistencies that need to be corrected.

Additionally, I have allowed for the possibility of leading zeros when matching against numbers, e.g. 00054, since the class NumericalEncoder supports encoding numbers with these forms. As an example, currently .{3}54 is assumed to be able to match 00054. I expect a discussion with end-users will need to be had on whether this is the desired behavior going forward, or if a regex like .{3}54 must only match against numbers five digits long, such as 12354.

A brief overview of the process whereby regexes are encoded via NumericRegexEncoder.encode(String regex):

Patterns are checked for any initial failure conditions that would prohibit the need for encoding.
Patterns are parsed to a Node tree structure via RegexParser. Different regex elements are represented by various subclasses of Node.
The regex tree undergoes validation, normalization, and finally encoding by being passed to a series of different Visitor classes that either check a condition or return an updated regex tree.
The regex tree is converted to a regex string and is returned.

Fixes #1565

Currently normalization of numeric regex patterns is quite fragile and limited in scope of leveraging valid regex operations. Add the ability to extract and encode numeric patterns within a regex pattern while reserving and supporting additional regex operations such as grouping, character lists, qualifiers, wildcards, and pipe ORs. Fixes #1565

src/main/java/datawave/data/type/util/NumericalEncoder.java

src/main/java/datawave/data/normalizer/NumericRegexNormalizer.java

src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java

src/main/java/datawave/data/type/util/NumericalEncoder.java

src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java

ivakegg · 2022-10-26T13:54:07Z

Some regex forms that I am seeing:

1111.*
1111.*?
1111\d*
.*?1111
.*1111
.*1111.*
^11[23].*
1111[0-9]{5}
.*1111\..*

Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next.

…ded list

Refactor the approach for normalizing numeric regex to use a tree-base parsing approach for validation, optimization, and encoding.

src/main/java/datawave/data/normalizer/NumberNormalizer.java

…tive Co-authored-by: Ivan Bella <ivan@bella.name>

lbschanno requested review from ivakegg, keith-ratcliffe and jwomeara August 30, 2022 15:07

Fix code formatting

afec108

lbschanno mentioned this pull request Aug 30, 2022

Numeric regex normalization could be made more robust NationalSecurityAgency/datawave#1565

Closed

lbschanno added 4 commits August 31, 2022 13:51

Add tests for \d character

fa0a959

Fix handling . wildcards

c681a8b

Code cleanup

6805ab4

Add test for permutations with multiple decimal points

e920401

ivakegg reviewed Sep 8, 2022

View reviewed changes

src/main/java/datawave/data/type/util/NumericalEncoder.java Outdated Show resolved Hide resolved

src/main/java/datawave/data/type/util/NumericalEncoder.java Outdated Show resolved Hide resolved

src/main/java/datawave/data/normalizer/NumericRegexNormalizer.java Outdated Show resolved Hide resolved

ivakegg reviewed Sep 8, 2022

View reviewed changes

src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java Outdated Show resolved Hide resolved

lbschanno added 3 commits October 18, 2022 09:02

Fix bug with implementing trailing wildcards

e37afa9

Add fidelity tests and make note of a case where fidelity fails

cf0e4cb

Add more fidelity tests

5e3d7a4

ivakegg requested changes Oct 19, 2022

View reviewed changes

src/main/java/datawave/data/type/util/NumericalEncoder.java Outdated Show resolved Hide resolved

src/test/java/datawave/data/normalizer/NumericRegexNormalizerTest.java Outdated Show resolved Hide resolved

lbschanno added 3 commits December 8, 2022 15:56

Generate the exponents programmatically without the need for a hardco…

3c28641

…ded list

Fix code formatting

1587329

Refactor numeric regex normalization API

a13a87c

Refactor the approach for normalizing numeric regex to use a tree-base parsing approach for validation, optimization, and encoding.

lbschanno marked this pull request as ready for review August 15, 2023 10:47

lbschanno added 6 commits August 15, 2023 11:26

Merge branch 'main' into task/numeric-regex

501ac77

Remove sysout prints

6b7a073

Remove self-closing <p/> tags

c1b8881

Delete unused class and methods

2198541

Move visitor classes to visitor package

91e5b00

Improve documentation for NodeListIterator

431d79d

ivakegg reviewed Sep 6, 2023

View reviewed changes

src/main/java/datawave/data/normalizer/NumberNormalizer.java Outdated Show resolved Hide resolved

lbschanno and others added 2 commits September 6, 2023 12:51

Make NumberNormalizer.normalizeRegex() exception message more descrip…

8ad38cd

…tive Co-authored-by: Ivan Bella <ivan@bella.name>

Added a random number/pattern generation to flesh out some edge cases

1f218fc

lbschanno and others added 8 commits September 12, 2023 08:40

Do not trim zeros from character classes that may match zero

2414a52

Merge branch 'main' into task/numeric-regex

08afde4

Added some additional random features to the numeric encoder test

0f4eed1

Numerous bug fixes and requirement tightening

72ac1a9

Added stopwatches to include operational time costs in logs

a334f43

Merge branch 'main' into task/numeric-regex

c3de572

Fix issues with encoding negative patterns

a890b48

Fix issues with merge from main

bbd1fae

ivakegg previously approved these changes Dec 16, 2023

View reviewed changes

Delete weird artifact

0c8f7f0

lbschanno dismissed ivakegg’s stale review via 0c8f7f0 June 3, 2024 22:48

Merge branch 'main' into task/numeric-regex

dcc6dd4

ivakegg approved these changes Jun 21, 2024

View reviewed changes

avgAGB approved these changes Jun 24, 2024

View reviewed changes

apmoriarty approved these changes Jun 24, 2024

View reviewed changes

lbschanno mentioned this pull request Jun 25, 2024

Make #MATCHES_IN_GROUP work with a number NationalSecurityAgency/datawave#2434

Merged

ivakegg merged commit fe9efe5 into main Jun 26, 2024
2 checks passed

ivakegg deleted the task/numeric-regex branch June 26, 2024 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness of numeric regex normalization #11

Improve robustness of numeric regex normalization #11

lbschanno commented Aug 30, 2022 •

edited

Loading

ivakegg commented Oct 26, 2022

Improve robustness of numeric regex normalization #11

Improve robustness of numeric regex normalization #11

Conversation

lbschanno commented Aug 30, 2022 • edited Loading

ivakegg commented Oct 26, 2022

lbschanno commented Aug 30, 2022 •

edited

Loading