Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex optimizations #940

Closed
4 of 6 tasks
ronawho opened this issue Oct 4, 2021 · 5 comments
Closed
4 of 6 tasks

Regex optimizations #940

ronawho opened this issue Oct 4, 2021 · 5 comments
Labels
performance Performance needs improving

Comments

@ronawho
Copy link
Contributor

ronawho commented Oct 4, 2021

While working on #917 and #930 I came up with some regular expression optimizations and simplifications I wanted to capture.

  • We're allocating too much -- there's allocations to interpret the segmented array as a string/bytes, we're working on optimizing these out (optimized with #935, chapel-lang/chapel#18465, #931, chapel-lang/chapel#18733)
  • Is utf8 matching required? If not we could interpret as bytes? If we can't intepret as bytes, I wonder if we could add an argument to createStringWith*Buffer to skip utf8-validation (offline I believer we decided utf8 is required, and in Chapel the code that validates utf-8 also calculates the number of codepoints, so skipping validation really doesn't save any time. I also don't believe this is particularly slow anyways.)
  • capturing an iterator into a variable captures that into an array today. So for instances var matches = myRegex.matches(...) allocates an array to store the result into, which adds unnecessary overhead. I wonder if we can switch the current uses to just iterate instead of capturing.
  • Can we simplify contains/startsWith/endsWith/match? Today each operation is implemented pretty differently (and the endsWith implementation ends up being slower because it captures an iterator into an array.) I think we can instead only implement search on the server side and have the client add the appropriate ^ and $ markers. e.g.
    use Regex;
    
    var contains   = compile(" string");
    var endsWith   = compile("string$");
    var startsWith = compile("^my");
    var match      = compile("^my string$");
    
    writeln(  contains.search("my string")); // (matched = true, offset = 2, size = 7)
    writeln(  endsWith.search("my string")); // (matched = true, offset = 3, size = 6)
    writeln(startsWith.search("my string")); // (matched = true, offset = 0, size = 2)
    writeln(     match.search("my string")); // (matched = true, offset = 0, size = 9)
  • I believe peel/stick still use unorderedCopy, which will be slow on networks with slow small messages rates
  • TODO need to better understand peel/stick/flatten
@ronawho
Copy link
Contributor Author

ronawho commented Oct 4, 2021

@pierce314159 particularly interested in your opinion on the "Can we simplify contains/startsWith/endsWith/match" bullet

@stress-tess
Copy link
Member

@pierce314159 particularly interested in your opinion on the "Can we simplify contains/startsWith/endsWith/match" bullet

yeah I think this should work! @glitch and I talked about something similar when I was first implementing the substring search methods, but I stayed away from it at first because I was unsure how to handle cases where ^/$ were already present in the regex pattern. After playing around with it, it seems that doubling up on these special characters doesn't change anything

@ronawho
Copy link
Contributor Author

ronawho commented Jan 19, 2022

@pierce314159 @glitch I'm updating issues that were impacted by the 1.25.1 upgrade. Any new thoughts on simplifying contains/startsWith/endsWith/match? Simplifying those would also remove the last captured regex iterator, which would take care of bullet 3 as well.

@stress-tess
Copy link
Member

stress-tess commented Jan 19, 2022

yeah I think this is a good idea! I'll see if I can get a PR for this up today

stress-tess pushed a commit to stress-tess/arkouda that referenced this issue Jan 19, 2022
mhmerrill pushed a commit that referenced this issue Jan 19, 2022
…se Regex.search (#1030)

Co-authored-by: Pierce Hayes <pierce314159@users.noreply.github.com>
@ronawho
Copy link
Contributor Author

ronawho commented Jan 20, 2022

Most of the core regex operations have been heavily optimized. The last remaining cases that I know of are peel/stick, but I believe the next step there is converting them to aggregation which I captured into a new and more specific issue -- #1039

@ronawho ronawho closed this as completed Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving
Projects
None yet
Development

No branches or pull requests

2 participants