You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
RLIKE, REGEXP or REGEXP_LIKE are widely used but very expensive, and many use cases are quite simple and similar, like ^abc(.*), (.*)abc$, pattern, and ^pattern$.
These pattern cases can be replaced with faster expressions like GpuStartsWith, GpuEndsWith or GpuContains when overriding.
Some commonly used patterns are even worth to writing a custom kernel to match, such as pattern[0-9]{3,4} (some digits followed by a string) and [\u4e00-\u9fa5]+ (any Chinese character).
Describe the solution you'd like
We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.
Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.
I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.
Actually pattern[0-9]{3,4} is a string followed by 3 or 4 digits. I don't know if we want to write a custom kernel for that just yet. I don't know how common something like that is, and I don't want to write a custom kernel for a single operation in a single customer query.
pattern[0-9]{3,4} is a misleading example, I am working on a kernel that will match pattern[0-9]{X, Y} (so it will also match [0-9]{X, Y} by setting pattern = "").
I would rather see us write a custom kernel for [A-B]{X,Y} where A and B are any character value and X and Y are any integer value, including null which would indicate unbounded. I see this as much more common/generic. That would let us match any Chinese character [\u4e00-\u9fa5]{1,} ({1,} is the same as +, one or more). It would also let us do things like 0 or more digits [0-9]{0,} ({0,} is the same as *, zero or more). Or any printable ASCII character. Or really any possible range of values.
Yes, this kernel can be more general so these two cases can be combined. Thanks for catching this!
Is your feature request related to a problem? Please describe.
RLIKE
,REGEXP
orREGEXP_LIKE
are widely used but very expensive, and many use cases are quite simple and similar, like^abc(.*)
,(.*)abc$
,pattern
, and^pattern$
.These pattern cases can be replaced with faster expressions like
GpuStartsWith
,GpuEndsWith
orGpuContains
when overriding.Some commonly used patterns are even worth to writing a custom kernel to match, such as
pattern[0-9]{3,4}
(some digits followed by a string) and[\u4e00-\u9fa5]+
(any Chinese character).Describe the solution you'd like
We have a regex parser in plugin code here to translate a regex pattern to cudf supported style and check fallback. We can reuse it if possible to match if it is a simple pattern that can be replaced, and replace that case to the faster expressions.
Here is a list of planned/possible tasks:
pattern[A-B]{X,Y}
(a pattern string followed by X to Y chars in range A - B) inRLIKE
to a custom kernelThe text was updated successfully, but these errors were encountered: