Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of numeric regex normalization #11

Merged
merged 30 commits into from
Jun 26, 2024
Merged

Conversation

lbschanno
Copy link
Collaborator

@lbschanno lbschanno commented Aug 30, 2022

This PR improves functionality for encoding numeric regexes that are meant to match against numbers that were previously encoded via NumericalEncoder.

Requirements

  • The following requirements apply to all incoming regexes:
  • Patterns may not be blank.
  • Patterns may not contain whitespace.
  • Patterns must be compilable.
  • Patterns may not contain any letters other than \d.
  • Patterns may not contain any escaped characters other than \., \-, or \d.
  • Patterns may not contain any groups, e.g. (45.*).
  • Patterns may not contain any decimal points that are followed by ? * +, or a repetition quantifier such as {2}.

Supported Regex Features
The following regex features are supported, with any noted caveats.

  • Wildcards ..
  • Digit character class \d.
  • Character class lists []. CAVEAT: Digit characters only. Ranges are supported.
  • Zero or more quantifier *.
  • One or more quantifier +.
  • Repetition quantifier {x}, {x,}, and {x,y}. Ranges are supported.
  • Anchors ^ and $. CAVEAT: Technically not truly supported as they are ultimately removed during the pre-optimization process. However, using them will not result in an error.
  • Alternations |.

Additionally, in order to mark a regex pattern as intended to match negative numbers only, a minus sign should be placed at the beginning of the regex pattern, e.g. -34.*, or at the beginning of each desired alternated pattern.

Optimizations
Before encoding the incoming regex, it will undergo the following modifications to optimize the ease of encoding:

  1. Any empty alternations will be removed.
  2. Any occurrences of the anchors ^ or $ will be removed. These will need to be added back into the returned encoded regex pattern afterwards if desired.
  3. Optional variants (characters followed by ?} will be expanded into additional alternations as seen. This will not apply to any ? instances that directly follow a *, +, or {x}, as the ? in this case modifies the greediness of the matching rather than whether or not a character can be present.
  4. Any characters immediately followed by the repetition quantifier {0} or {0,0} will be removed as they are expected to occur zero times. This does not apply to characters with the repetition quantifier {0,} or a variation of {0,x}.
  5. Any patterns starting with .* or .+ will result in the addition of an alternation of the same pattern with a minus sign in front of it to ensure a variant for matching negative numbers is added. This does not apply to any regex patterns already starting with -.* or -.+.
  6. In some cases a pattern may match both exactly zero and another number greater than one, e.g. the pattern [0-9].*. In this case, an alternation for the character 0 will be added (i.e. [0-9].*|0) to ensure that the ability to match zero is not lost when enriching the pattern with the required exponential bins to target the appropriate encoded numbers.
  7. Pattern alternations will be de-duped.

A strong effort has been made to make resulting encoded patterns as accurate as possible, but there is always a chance of at least some inaccuracy, given the nature of how numbers are encoded, particularly when it comes to numbers that are very similar other than the location of a decimal point, if present, in them. I tried to cover a wide breadth of different regex patterns and edge cases, but I welcome any help on finding inconsistencies that need to be corrected.

Additionally, I have allowed for the possibility of leading zeros when matching against numbers, e.g. 00054, since the class NumericalEncoder supports encoding numbers with these forms. As an example, currently .{3}54 is assumed to be able to match 00054. I expect a discussion with end-users will need to be had on whether this is the desired behavior going forward, or if a regex like .{3}54 must only match against numbers five digits long, such as 12354.

A brief overview of the process whereby regexes are encoded via NumericRegexEncoder.encode(String regex):

  1. Patterns are checked for any initial failure conditions that would prohibit the need for encoding.
  2. Patterns are parsed to a Node tree structure via RegexParser. Different regex elements are represented by various subclasses of Node.
  3. The regex tree undergoes validation, normalization, and finally encoding by being passed to a series of different Visitor classes that either check a condition or return an updated regex tree.
  4. The regex tree is converted to a regex string and is returned.

Fixes #1565

Currently normalization of numeric regex patterns is quite fragile and
limited in scope of leveraging valid regex operations. Add the ability
to extract and encode numeric patterns within a regex pattern while
reserving and supporting additional regex operations such as grouping,
character lists, qualifiers, wildcards, and pipe ORs.

Fixes #1565
@ivakegg
Copy link
Collaborator

ivakegg commented Oct 26, 2022

Some regex forms that I am seeing:

1111.*
1111.*?
1111\d*
.*?1111
.*1111
.*1111.*
^11[23].*
1111[0-9]{5}
.*1111\..*

Note that the ? is redundant after .* but apparently that is used often. Also note that a numeric range for a digit or set of digits is not uncommon. That might be a nice one to attack next.

Refactor the approach for normalizing numeric regex to use a tree-base
parsing approach for validation, optimization, and encoding.
@lbschanno lbschanno marked this pull request as ready for review August 15, 2023 10:47
ivakegg
ivakegg previously approved these changes Dec 16, 2023
@ivakegg ivakegg merged commit fe9efe5 into main Jun 26, 2024
2 checks passed
@ivakegg ivakegg deleted the task/numeric-regex branch June 26, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants