Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] from_json fails with cuDF error Invalid list size computation error #9212

Closed
andygrove opened this issue Sep 8, 2023 · 10 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

I am testing with a custom build of spark-rapids-jni, where I am specifying RECOVER_WITH_NULL in the from_json function that gets called from extractRawMapFromJsonString.

A simple test of from_json results in the cuDF error Invalid list size computation error.

Steps/Code to reproduce bug

scala> val df = Seq("{'a': '1'}\n{'a': '2'}\n").toDF("str").repartition(2)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [str: string]

scala> df.createOrReplaceTempView("t")

scala> spark.sql("select from_json(str, 'MAP<STRING,STRING>') from t").show()

Fails with

ai.rapids.cudf.CudfException: CUDF failure at: /home/andy/git/nvidia/spark-rapids-jni/src/main/cpp/src/map_utils.cu:609: Invalid list size computation.
	at com.nvidia.spark.rapids.jni.MapUtils.extractRawMapFromJsonString(Native Method)
	at com.nvidia.spark.rapids.jni.MapUtils.extractRawMapFromJsonString(MapUtils.java:49)
	at org.apache.spark.sql.rapids.GpuJsonToStructs.doColumnar(GpuJsonToStructs.scala:153)

Expected behavior

Spark without plugin produces:

+--------+
| entries|
+--------+
|{a -> 1}|
+--------+

Environment details (please complete the following information)
N/A

Additional context

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 8, 2023
@andygrove andygrove self-assigned this Sep 8, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 12, 2023
@mattahrens
Copy link
Collaborator

@ttnghia has worked on the json tokenization layer in spark-rapids-jni and can provide help as needed.

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 12, 2023

Will look into this.

@ttnghia
Copy link
Collaborator

ttnghia commented Sep 14, 2023

This is not a bug but rather the limitation of the current implementation:

  • cudf JSON parse doesn't support single quote character.
  • from_json only works with input having one (string) JSON object per row.
  • Duplicates are not handled.

@andygrove
Copy link
Contributor Author

andygrove commented Oct 31, 2023

I just tested this again, using the code from #9423, and it actually failed with a segmentation fault, which is concerning.

Stack: [0x00007f358ff00000,0x00007f3590000000],  sp=0x00007f358fffae48,  free space=1003k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcuda.so.1+0x186618]
C  [libcuda.so.1+0x277b5a]
C  [libcuda.so.1+0x4fae18]
C  [libcuda.so.1+0x13b116]
C  [libcuda.so.1+0x13b529]
C  [libcuda.so.1+0x13bdc7]
C  [libcuda.so.1+0x2dbca1]
C  [cudf5683805021365819471.so+0x3075821]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.nvidia.spark.rapids.jni.MapUtils.extractRawMapFromJsonString(J)J+0
j  com.nvidia.spark.rapids.jni.MapUtils.extractRawMapFromJsonString(Lai/rapids/cudf/ColumnView;)Lai/rapids/cudf/ColumnVector;+37
j  org.apache.spark.sql.rapids.GpuJsonToStructs.doColumnar(Lcom/nvidia/spark/rapids/GpuColumnVector;)Lai/rapids/cudf/ColumnVector;+18

@ttnghia
Copy link
Collaborator

ttnghia commented Oct 31, 2023

I can reproduce it with the latest cudf code:

scala> val df = Seq("{'a': '1'}").toDF("str").repartition(2)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [str: string]

scala> df.createOrReplaceTempView("t")

scala> spark.sql("select from_json(str, 'MAP<STRING,STRING>') from t").show()
....
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5975570710, pid=468146, tid=0x00007f596c39a640
#

@ttnghia
Copy link
Collaborator

ttnghia commented Oct 31, 2023

I realize that the issue is due to having repartition(2). Without it, the example is just fine:


scala> val df = Seq("{'a': '1'}\n{'a': '2'}\n").toDF("str")
df: org.apache.spark.sql.DataFrame = [str: string]

scala> df.createOrReplaceTempView("t")

scala> spark.conf.set("spark.rapids.sql.expression.JsonToStructs","true")

scala> spark.sql("select from_json(str, 'MAP<STRING,STRING>') from t").show()
23/10/31 22:55:33 WARN GpuOverrides: 
! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
  @Expression <AttributeReference> entries#9 could run on GPU

+--------+
| entries|
+--------+
|{a -> 1}|
+--------+

So there should be something wrong with handling empty input somewhere.

@andygrove
Copy link
Contributor Author

I realize that the issue is due to having repartition(2)

Without the repartition the query is falling back to CPU (cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec)

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 1, 2023

Got it. So this is indeed a bug in from_json in spark-rapids-jni. The issue is due to comparing signed (negative) vs unsigned integers when an exception is being thrown due to invalid token is detected.

I'll post a fix PR shortly.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 1, 2023

Alright, that crash issue should be fixed by NVIDIA/spark-rapids-jni#1536.

After fixing, the example in this issue will cause a regular cudf exception being thrown.

@andygrove
Copy link
Contributor Author

I just tested this on latest branch-24.02 and it is no longer an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants