[BUG] GetJsonObject does not normalize non-string output #10218

revans2 · 2024-01-18T19:40:37Z

Describe the bug
GetJsonObject on the CPU will first parse the JSON, and then when it goes to output the result it will convert the parsed data back to a JSON string. This results in the new string being normalized. We do not do any of this. Instead we just copy the character range back out. The following bugs can show up because of this.

Unnecessary white space is removed

{ "a" : "A" } becomes {"a":"A"} on the CPU, but { "a" : "A" } on the GPU

Quotes are normalized

{'a':'A"'} becomes {"a":"A\""} on the CPU, but stays as {'a':'A"'} on the GPU

Numbers are normalized

In the simplest case Spark strips unneeded trailing zeros for floating point numbers.
[100.0,200.000,351.980] on the CPU becomes [100.0,200.0,351.98], but on the GPU it is unchanged

For larger floating point numbers it can be converted to scientific notation, or have the notation normalized.
[12345678900000000000.0] becomes [1.23456789E19] on the CPU, but is unchanged on the GPU.
[1E308] becomes [1.0E308] on the CPU.

But for very large/small float numbers that would not fit in a double, they are turned into "Infinity"/"-Infinity"
[1.0E309,-1E309,1E5000] becomes ["Infinity","-Infinity","Infinity"]

But integer like numbers are not modified [12345678900000000000] just stays the same, even for numbers that are very, very large. i.e. "1" + ("0" * 400)

Unneeded escapes are removed

{"a":"B\'"} becomes {"a":"B'"}. Escaping the ' character is not needed here.

That said we don't need to worry about normalizing nulls, as they are always null and nothing else is allowed, or booleans because true and false are the only ones supported.

The text was updated successfully, but these errors were encountered:

GaryShen2008 · 2024-03-05T05:03:22Z

Hi @SurajAralihalli, as I confirmed with Chong, assign this issue to you.

res-life · 2024-03-18T08:26:47Z

This commit will solve items1, item2, item4.
For item3: Numbers are normalized
We need to fix

thirtiseven · 2024-03-20T08:20:45Z

For floating point numbers I think the rule is the same as double to string in java, if so we can reuse ftos_converter in jni to have an almost bit to bit match with spark. But first we should convert string to double first, string to double kernel is also available in jni.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 18, 2024

This was referenced Jan 18, 2024

[BUG] GetJsonObject sees a double quote in a single quoted string as invalid #10219

Closed

[FEA] Fix GetJsonObject #10254

Open

GregoryKimball mentioned this issue Jan 24, 2024

[FEA] Add whitespace removal as a JSON reader preprocessing option rapidsai/cudf#14865

Closed

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 30, 2024

revans2 mentioned this issue Feb 21, 2024

[BUG] JsonToStructs and ScanJson do not normalize numeric output when read as a string #10458

Open

GaryShen2008 assigned SurajAralihalli Mar 5, 2024

This was referenced Mar 12, 2024

[BUG] JsonToStructs and ScanJson does not normalize non quoted strings when read as strings #10574

Open

[FEA] JSON number normalization when returned as a string rapidsai/cudf#15318

Open

res-life mentioned this issue Mar 18, 2024

[FEA] GetJsonObject: Implement JSON generator to print JSON items NVIDIA/spark-rapids-jni#1831

Closed

4 tasks

res-life assigned res-life and thirtiseven and unassigned SurajAralihalli, thirtiseven and res-life Mar 19, 2024

thirtiseven mentioned this issue Mar 25, 2024

Use new jni kernel for getJsonObject #10581

Merged

thirtiseven closed this as completed in #10581 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GetJsonObject does not normalize non-string output #10218

[BUG] GetJsonObject does not normalize non-string output #10218

revans2 commented Jan 18, 2024 •

edited

GaryShen2008 commented Mar 5, 2024

res-life commented Mar 18, 2024

thirtiseven commented Mar 20, 2024 •

edited

[BUG] GetJsonObject does not normalize non-string output #10218

[BUG] GetJsonObject does not normalize non-string output #10218

Comments

revans2 commented Jan 18, 2024 • edited

GaryShen2008 commented Mar 5, 2024

res-life commented Mar 18, 2024

thirtiseven commented Mar 20, 2024 • edited

revans2 commented Jan 18, 2024 •

edited

thirtiseven commented Mar 20, 2024 •

edited