Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Rewrite some regular expressions RLIKE cases to faster expressions #10741

Open
3 tasks
thirtiseven opened this issue Apr 25, 2024 · 3 comments
Open
3 tasks
Labels
feature request New feature or request performance A performance related task/issue

Comments

@thirtiseven
Copy link
Collaborator

thirtiseven commented Apr 25, 2024

Is your feature request related to a problem? Please describe.

RLIKE, REGEXP or REGEXP_LIKE are widely used but very expensive, and many use cases are quite simple and similar, like ^abc(.*), (.*)abc$, pattern, and ^pattern$.

These pattern cases can be replaced with faster expressions like GpuStartsWith, GpuEndsWith or GpuContains when overriding.

Some commonly used patterns are even worth to writing a custom kernel to match, such as pattern[0-9]{3,4} (some digits followed by a string) and [\u4e00-\u9fa5]+ (any Chinese character).

Describe the solution you'd like
We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.

Here is a list of planned/possible tasks:

@thirtiseven thirtiseven added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Apr 25, 2024
@revans2
Copy link
Collaborator

revans2 commented Apr 25, 2024

Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.

I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.

@thirtiseven
Copy link
Collaborator Author

Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.

pattern[0-9]{3,4} is a misleading example, I am working on a kernel that will match pattern[0-9]{X, Y} (so it will also match [0-9]{X, Y} by setting pattern = "").

I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.

Yes, this kernel can be more general so these two cases can be combined. Thanks for catching this!

@revans2
Copy link
Collaborator

revans2 commented Apr 26, 2024

I still don't see a lot of generality for STATIC_STRING[RANGE]{X, Y}.

@NVIDIA NVIDIA deleted a comment from thirtiseven Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

3 participants