Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regressions related to cuDF changes in handline of end-of-line/string anchors #7211

Merged
merged 16 commits into from
Jan 27, 2023

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Nov 30, 2022

Closes #7090

Rationale

Now that rapidsai/cudf#11979 is resolved using the fix described rapidsai/cudf#11979 (comment), the regular expression transpiler code needed to be updated for new handling of $, \z and \Z.

Follow-on issue:

Changes in this PR:

  • Update transpilation for $ to reflect recent changes in cuDF
  • Fall back to CPU for \z because there is no longer an equivalent in cuDF that we can transpile to
  • Add test for \z fallback to CPU
  • Add some new tests for handling of \u0085, \u2028, and \u2029
  • Update compatibility guide

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove changed the base branch from branch-22.12 to branch-23.02 November 30, 2022 18:31
@andygrove andygrove changed the title Transpile dollar Update transpilation of $ in regexp Nov 30, 2022
@andygrove andygrove self-assigned this Nov 30, 2022
@sameerz sameerz added the task Work required that improves the product but is not user facing label Nov 30, 2022
@andygrove andygrove changed the title Update transpilation of $ in regexp Fix regressions related to cuDF changes in handline of end-of-line/string anchors Nov 30, 2022
@andygrove andygrove marked this pull request as ready for review December 16, 2022 20:08
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

There was a test failure against Spark 3.1.1. I will investigate:

test_re_replace_anchors - AssertionError: GPU and CPU string values are different at [0, 'regexp_replace(a, TEST$, , 1)']

@andygrove andygrove marked this pull request as draft January 12, 2023 17:36
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review January 25, 2023 23:35
docs/compatibility.md Outdated Show resolved Hide resolved
Comment on lines -460 to -463
The following regular expression patterns are known to potentially produce different results on the GPU
vs the CPU.

- Word and non-word boundaries, `\b` and `\B`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had two sections listing known edge cases, so I consolidated them by moving this content.

@andygrove
Copy link
Contributor Author

@NVnavkumar This is ready for review now

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some nits on docs and error messaging

- Word and non-word boundaries, `\b` and `\B`
- Line anchor `$` will incorrectly match any of the unicode characters `\u0085`, `\u2028`, or `\u2029` followed by
another line-terminator, such as `\n`. For example, the pattern `TEST$` will match `TEST\u0085\n` on the GPU but
not on the CPU ([#7585](https://github.com/NVIDIA/spark-rapids/issues/7585)).

The following regular expression patterns are not yet supported on the GPU and will fall back to the CPU.

- Line anchor `^` is not supported in some contexts, such as when combined with a choice (`^|a`).
- Line anchor `$` is not supported by `regexp_replace`, and in some rare contexts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true now? We still have integration tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updated this line.

@@ -1144,8 +1143,8 @@ class CudfRegexTranspiler(mode: RegexMode) {
case 'z' if mode == RegexSplitMode =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We left this split case for \z, but the error message just claims that \z is not supported on GPU. We should either remove this case, or clarify the error messaging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nit: Looks like compatibility doc wasn't update for \z, but I don't know if it makes sense to call out a feature that only works in StringSplit.

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 3398daa into NVIDIA:branch-23.02 Jan 27, 2023
@andygrove andygrove deleted the transpile-dollar branch January 27, 2023 19:02
andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 27, 2023
andygrove added a commit to andygrove/spark-rapids that referenced this pull request Jan 27, 2023
…-line/string anchors (NVIDIA#7211)"

This reverts commit 3398daa.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Refactor line terminator handling code
3 participants