Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_get_json_object_single_quotes failure #10422

Closed
jlowe opened this issue Feb 14, 2024 · 3 comments · Fixed by #10423
Closed

[BUG] test_get_json_object_single_quotes failure #10422

jlowe opened this issue Feb 14, 2024 · 3 comments · Fixed by #10423
Labels
bug Something isn't working

Comments

@jlowe
Copy link
Member

jlowe commented Feb 14, 2024

test_get_json_object_single_quotes failed in last night's Databricks 10.4 tests.

[2024-02-14T11:13:00.694Z] FAILED ../../src/main/python/get_json_test.py::test_get_json_object_single_quotes[DATAGEN_SEED=1707901022] - AssertionError: GPU and CPU are not both null at [1, 'sub_b']
Details
[2024-02-14T11:13:00.691Z] =================================== FAILURES ===================================

[2024-02-14T11:13:00.691Z] ______________________ test_get_json_object_single_quotes ______________________

[2024-02-14T11:13:00.691Z] [gw1] linux -- Python 3.8.10 /usr/bin/python

[2024-02-14T11:13:00.691Z] 

[2024-02-14T11:13:00.691Z]     def test_get_json_object_single_quotes():

[2024-02-14T11:13:00.691Z]         schema = StructType([StructField("jsonStr", StringType())])

[2024-02-14T11:13:00.691Z]         data = [[r'''{'a':'A'}'''],

[2024-02-14T11:13:00.691Z]                 [r'''{'b':'"B'}'''],

[2024-02-14T11:13:00.691Z]                 [r'''{"c":"'C"}''']]

[2024-02-14T11:13:00.691Z]     

[2024-02-14T11:13:00.691Z] >       assert_gpu_and_cpu_are_equal_collect(

[2024-02-14T11:13:00.691Z]             lambda spark: spark.createDataFrame(data,schema=schema).select(

[2024-02-14T11:13:00.691Z]             f.get_json_object('jsonStr',r'''$['a']''').alias('sub_a'),

[2024-02-14T11:13:00.691Z]             f.get_json_object('jsonStr',r'''$['b']''').alias('sub_b'),

[2024-02-14T11:13:00.691Z]             f.get_json_object('jsonStr',r'''$['c']''').alias('sub_c')),

[2024-02-14T11:13:00.691Z]             conf={'spark.rapids.sql.expression.GetJsonObject': 'true'})

[2024-02-14T11:13:00.691Z] 

[2024-02-14T11:13:00.691Z] ../../src/main/python/get_json_test.py:59: 

[2024-02-14T11:13:00.691Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-02-14T11:13:00.691Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect

[2024-02-14T11:13:00.691Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)

[2024-02-14T11:13:00.691Z] ../../src/main/python/asserts.py:517: in _assert_gpu_and_cpu_are_equal

[2024-02-14T11:13:00.691Z]     assert_equal(from_cpu, from_gpu)

[2024-02-14T11:13:00.691Z] ../../src/main/python/asserts.py:107: in assert_equal

[2024-02-14T11:13:00.691Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])

[2024-02-14T11:13:00.691Z] ../../src/main/python/asserts.py:43: in _assert_equal

[2024-02-14T11:13:00.691Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-02-14T11:13:00.691Z] ../../src/main/python/asserts.py:36: in _assert_equal

[2024-02-14T11:13:00.691Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-02-14T11:13:00.691Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[2024-02-14T11:13:00.691Z] 

[2024-02-14T11:13:00.691Z] cpu = None, gpu = '"B'

[2024-02-14T11:13:00.691Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f9488165310>

[2024-02-14T11:13:00.691Z] path = [1, 'sub_b']

[2024-02-14T11:13:00.691Z] 

[2024-02-14T11:13:00.691Z]     def _assert_equal(cpu, gpu, float_check, path):

[2024-02-14T11:13:00.692Z]         t = type(cpu)

[2024-02-14T11:13:00.692Z]         if (t is Row):

[2024-02-14T11:13:00.692Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-02-14T11:13:00.692Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):

[2024-02-14T11:13:00.692Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)

[2024-02-14T11:13:00.692Z]                 for field in cpu.__fields__:

[2024-02-14T11:13:00.692Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])

[2024-02-14T11:13:00.692Z]             else:

[2024-02-14T11:13:00.692Z]                 for index in range(len(cpu)):

[2024-02-14T11:13:00.692Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-02-14T11:13:00.692Z]         elif (t is list):

[2024-02-14T11:13:00.692Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-02-14T11:13:00.692Z]             for index in range(len(cpu)):

[2024-02-14T11:13:00.692Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-02-14T11:13:00.692Z]         elif (t is tuple):

[2024-02-14T11:13:00.692Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))

[2024-02-14T11:13:00.692Z]             for index in range(len(cpu)):

[2024-02-14T11:13:00.692Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])

[2024-02-14T11:13:00.692Z]         elif (t is pytypes.GeneratorType):

[2024-02-14T11:13:00.692Z]             index = 0

[2024-02-14T11:13:00.692Z]             # generator has no zip :( so we have to do this the hard way

[2024-02-14T11:13:00.692Z]             done = False

[2024-02-14T11:13:00.692Z]             while not done:

[2024-02-14T11:13:00.692Z]                 sub_cpu = None

[2024-02-14T11:13:00.692Z]                 sub_gpu = None

[2024-02-14T11:13:00.692Z]                 try:

[2024-02-14T11:13:00.692Z]                     sub_cpu = next(cpu)

[2024-02-14T11:13:00.692Z]                 except StopIteration:

[2024-02-14T11:13:00.692Z]                     done = True

[2024-02-14T11:13:00.692Z]     

[2024-02-14T11:13:00.692Z]                 try:

[2024-02-14T11:13:00.692Z]                     sub_gpu = next(gpu)

[2024-02-14T11:13:00.692Z]                 except StopIteration:

[2024-02-14T11:13:00.692Z]                     done = True

[2024-02-14T11:13:00.692Z]     

[2024-02-14T11:13:00.692Z]                 if done:

[2024-02-14T11:13:00.692Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)

[2024-02-14T11:13:00.692Z]                 else:

[2024-02-14T11:13:00.692Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])

[2024-02-14T11:13:00.692Z]     

[2024-02-14T11:13:00.692Z]                 index = index + 1

[2024-02-14T11:13:00.692Z]         elif (t is dict):

[2024-02-14T11:13:00.692Z]             # The order of key/values is not guaranteed in python dicts, nor are they guaranteed by Spark

[2024-02-14T11:13:00.692Z]             # so sort the items to do our best with ignoring the order of dicts

[2024-02-14T11:13:00.692Z]             cpu_items = list(cpu.items()).sort(key=_RowCmp)

[2024-02-14T11:13:00.692Z]             gpu_items = list(gpu.items()).sort(key=_RowCmp)

[2024-02-14T11:13:00.692Z]             _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

[2024-02-14T11:13:00.692Z]         elif (t is int):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif (t is float):

[2024-02-14T11:13:00.692Z]             if (math.isnan(cpu)):

[2024-02-14T11:13:00.692Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]             else:

[2024-02-14T11:13:00.692Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, str):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, datetime):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, date):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, bool):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, Decimal):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU decimal values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, bytearray):

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU bytearray values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif isinstance(cpu, timedelta):

[2024-02-14T11:13:00.692Z]             # Used by interval type DayTimeInterval for Pyspark 3.3.0+

[2024-02-14T11:13:00.692Z]             assert cpu == gpu, "GPU and CPU timedelta values are different at {}".format(path)

[2024-02-14T11:13:00.692Z]         elif (cpu == None):

[2024-02-14T11:13:00.692Z] >           assert cpu == gpu, "GPU and CPU are not both null at {}".format(path)

[2024-02-14T11:13:00.692Z] E           AssertionError: GPU and CPU are not both null at [1, 'sub_b']

[2024-02-14T11:13:00.692Z] 

[2024-02-14T11:13:00.692Z] ../../src/main/python/asserts.py:100: AssertionError

@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Feb 14, 2024

test_get_json_object_single_quotes is a new test case introduced in PR 10407 for verifying get_json_object with single quotes. I tested this on Databricks 10.4 Spark 3.2.1 against rapids-4-spark_2.12-23.12.2.jar. Turns out we don't match the CPU results in that either.

from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder \
    .appName("Test get_json_object") \
    .getOrCreate()

spark.conf.set("spark.rapids.sql.enabled", "true")

json_data = [
    [r'''{'key':'A'}'''],
    [r'''{'key':'"B'}'''],
    [r'''{"key":"'C"}''']
]

df = spark.createDataFrame(json_data, ["json_str"])

result = df.select(
    expr("get_json_object(json_str, '$.key') AS key_value")
)

result.show()
| CPU  | GPU 23.12 | GPU 24.04 SNAPSHOT  |
|------|-----------|---------------------|
|   A  |    null   |          A          |
| null |    null   |         "B          |
|  'C  |     'C    |         'C          |

@revans2
Copy link
Collaborator

revans2 commented Feb 15, 2024

So databricks is different from Apache Spark when it comes to parsing JSON object values with single quotes? Because we match exactly with Apache Spark 3.4.2.

@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Feb 15, 2024

I think this issue is specific to Databricks 10.4 since the IT tests have passed on other Databricks runtimes. It appears to be a bug in DB 10.4.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants