Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators #7585

Open
andygrove opened this issue Jan 25, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
Line anchor $ will incorrectly match any of the unicode characters \u0085, \u2028, or \u2029 followed by
another line-terminator, such as \n. For example, the pattern TEST$ will match TEST\u0085\n on the GPU but
not on the CPU.

Steps/Code to reproduce bug
See new unit tests added in #7211

Expected behavior
GPU and CPU should match

Environment details (please complete the following information)
N/A

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 25, 2023
@sameerz
Copy link
Collaborator

sameerz commented Feb 1, 2023

The handling of line terminators was documented in the compatibility guide in PR #7211

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 7, 2023
@andygrove
Copy link
Contributor Author

andygrove commented Feb 14, 2023

To add more context here, cuDF has the correct behavior when passed TEST$. The problem is that we are transcoding TEST$ to TEST(?:\r|\u0085|\u2028|\u2029|\r\n)?$ to work around differences between Java's regexp engine and cuDF's regexp engine related to line terminators, and this transcoding does not provide the desired result for the example given in this issue.

Possible solutions to resolve this issue:

  1. Find a transcoding that works in all cases (I am not sure if this is possible).
  2. Scan the input data and look for cases where we would produce incorrect results and fail the query.
  3. Request new Java-compatibility features in cuDF's regexp engine so that we don't have to transcode in the first place.
  4. Fork cuDF's regexp kernels and make them compatible with Java as part of spark-rapids-jni.

If we remove the transcoding, then the test fails on a different input. In this case, Java $ ignores the final line terminator in the input, but cuDF does not.

javaPattern[0]=TEST$, cudfPattern=TEST$, input='aTEST\u0085', cpu=true, gpu=false

@andygrove
Copy link
Contributor Author

andygrove commented Feb 14, 2023

Here are some examples of mismatches when we remove the transcoding and use the same pattern TEST$ on CPU and GPU.

Input CPU GPU
TEST\r true false
TEST\r\n true false
TEST\u0085 true false
TEST\u2028 true false
TEST\u2029 true false

@davidwendt
Copy link

One option is to do a replace of these with a single \n on the input data before evaluating the regex.
This can be done with a single libcudf call to cudf::strings::replace like the following:

auto input   = cudf::test::strings_column_wrapper({"test\r\n", "test∞", "testφ", "test"});
auto targets = cudf::test::strings_column_wrapper({"\r\n", "∞", "φ"});
auto repls   = cudf::test::strings_column_wrapper({"\n", "\n", "\n"});

auto results = cudf::strings::replace(cudf::strings_column_view{input},
                                      cudf::strings_column_view(targets),
                                      cudf::strings_column_view(repls));
results => {"test\n", "test\n", "test\n", "test"}

And then apply test$ pattern to that.
I believe the \r\n to \n replacement may already be occurring based on the comment here which was resolved.
rapidsai/cudf#11979 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants