Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Support for Lazy Quantifier [databricks] #10208

Merged
merged 11 commits into from
Jan 24, 2024

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Jan 17, 2024

This PR adds the lazy quantifier to the RegexParser

Changes Made:

  • RegexParser returns the needed transpiled AST needed
  • Test was added to regexp_test.py

Tests:

  • Integration test added to regexp_test.py

Integration tests were run locally

fixes #8806

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

I will be adding more fuzzer tests

@razajafri razajafri changed the title Added Lazy Quantifier [databricks] Added Support for Lazy Quantifier [databricks] Jan 17, 2024
@sameerz sameerz added the feature request New feature or request label Jan 17, 2024
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Databricks cluster failed to start.

[2024-01-17T21:16:34.414Z] subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -p 2200 -i **** ubuntu@4.155.42.241 'SPARKSRCTGZ=/home/ubuntu/spark-rapids-ci.tgz BASE_SPARK_VERSION=3.2.1 BASE_SPARK_VERSION_TO_INSTALL_DATABRICKS_JARS=3.2.1 MVN_OPT= EXTRA_ENVS= bash /home/ubuntu/build.sh 2>&1 | tee buildout; if [ echo ${PIPESTATUS[0]} -ne 0 ]; then false; else true; fi'' returned non-zero exit status 1.

@razajafri
Copy link
Collaborator Author

build

andygrove
andygrove previously approved these changes Jan 18, 2024
Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I ran this locally and manually verified that the results from the integration test looked correct. I also tried some different datagen patterns to see if I could break the test but did not find anything.

…scala

Co-authored-by: Andy Grove <andygrove73@gmail.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

If Databricks keeps failing I will just run the test manually and skip Databricks from the premerge

@razajafri razajafri changed the title Added Support for Lazy Quantifier [databricks] Added Support for Lazy Quantifier Jan 19, 2024
@razajafri
Copy link
Collaborator Author

build

2 similar comments
@razajafri
Copy link
Collaborator Author

build

@sameerz
Copy link
Collaborator

sameerz commented Jan 22, 2024

build

@NVnavkumar
Copy link
Collaborator

Looking at recent blossom failure:

2024-01-22T01:00:54.1200895Z [2024-01-22T00:55:30.814Z] �[31m- AST fuzz test - regexp_find *** FAILED ***�[0m
2024-01-22T01:00:54.1201603Z [2024-01-22T00:55:30.815Z] �[31m  cuDF failed to compile pattern: (2|\u2029??)+? (RegularExpressionTranspilerSuite.scala:954)

It looks like cuDF doesn't support the ?? or ? inside of the choice( |). Will need to handle that case explicitly and fall back to CPU. The +? after the group is fine.

@razajafri
Copy link
Collaborator Author

Looking at recent blossom failure:

2024-01-22T01:00:54.1200895Z [2024-01-22T00:55:30.814Z] �[31m- AST fuzz test - regexp_find *** FAILED ***�[0m
2024-01-22T01:00:54.1201603Z [2024-01-22T00:55:30.815Z] �[31m  cuDF failed to compile pattern: (2|\u2029??)+? (RegularExpressionTranspilerSuite.scala:954)

It looks like cuDF doesn't support the ?? or ? inside of the choice( |). Will need to handle that case explicitly and fall back to CPU. The +? after the group is fine.

Thanks for looking into this. I am struggling to see why the pattern is being generated

@NVnavkumar
Copy link
Collaborator

Looking at recent blossom failure:

2024-01-22T01:00:54.1200895Z [2024-01-22T00:55:30.814Z] �[31m- AST fuzz test - regexp_find *** FAILED ***�[0m
2024-01-22T01:00:54.1201603Z [2024-01-22T00:55:30.815Z] �[31m  cuDF failed to compile pattern: (2|\u2029??)+? (RegularExpressionTranspilerSuite.scala:954)

It looks like cuDF doesn't support the ?? or ? inside of the choice( |). Will need to handle that case explicitly and fall back to CPU. The +? after the group is fine.

Thanks for looking into this. I am struggling to see why the pattern is being generated

It's because it's actually a valid pattern in Java (and Python fwiw). CUDF is the only thing throwing an invalid pattern here. Since it's valid in Java, and we have not detected to fallback to CPU, that's why you are getting the test error here.

@andygrove
Copy link
Contributor

I was able to get past the failing test with this diff. Note that this is only checking the right side of the choice and we would need to look at the left as well.

$ git diff
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
index 2380eb623..7666f1ab9 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
@@ -1565,6 +1565,14 @@ class CudfRegexTranspiler(mode: RegexMode) {
             r.position)
         }
 
+        rr match {
+          case RegexSequence(ListBuffer(RegexRepetition(
+              RegexRepetition(_,SimpleQuantifier('?')),SimpleQuantifier('?')))) =>
+            throw new RegexUnsupportedException(
+              "cuDF does not support lazy quantifier inside choice", r.position)
+          case other =>
+        }
+
         RegexChoice(ll, rr)

@razajafri razajafri changed the title Added Support for Lazy Quantifier Added Support for Lazy Quantifier [databricks] Jan 23, 2024
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Thanks @andygrove

@razajafri
Copy link
Collaborator Author

There are more test failures, I am looking into them

// rr = lazyQuantifier inside a choice
(_, RegexSequence(ListBuffer(RegexRepetition(
RegexRepetition(_, SimpleQuantifier('?')), SimpleQuantifier('?'))))) =>
throw new RegexUnsupportedException(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razajafri
Copy link
Collaborator Author

build

(ll, rr) match {
// ll = lazyQuantifier inside a choice
case (RegexSequence(ListBuffer(RegexRepetition(
RegexRepetition(_, SimpleQuantifier('?')), SimpleQuantifier('?')))), _) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent seems off here? I would expect this second line to be indented more that the case in the prev line. Same comment for other uses of this pattern in this PR.

Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Lett a comment about formatting.

@razajafri razajafri merged commit 5ea079a into NVIDIA:branch-24.02 Jan 24, 2024
40 checks passed
@razajafri razajafri deleted the SP-8806-lazy-quantifier branch January 24, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support lazy quantifier and specified group index in regexp_extract function
4 participants