Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GetJsonObject does not normalize non-string output #10218

Closed
revans2 opened this issue Jan 18, 2024 · 3 comments · Fixed by #10581
Closed

[BUG] GetJsonObject does not normalize non-string output #10218

revans2 opened this issue Jan 18, 2024 · 3 comments · Fixed by #10581
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 18, 2024

Describe the bug
GetJsonObject on the CPU will first parse the JSON, and then when it goes to output the result it will convert the parsed data back to a JSON string. This results in the new string being normalized. We do not do any of this. Instead we just copy the character range back out. The following bugs can show up because of this.

  1. Unnecessary white space is removed

{ "a" : "A" } becomes {"a":"A"} on the CPU, but { "a" : "A" } on the GPU

  1. Quotes are normalized

{'a':'A"'} becomes {"a":"A\""} on the CPU, but stays as {'a':'A"'} on the GPU

  1. Numbers are normalized

In the simplest case Spark strips unneeded trailing zeros for floating point numbers.
[100.0,200.000,351.980] on the CPU becomes [100.0,200.0,351.98], but on the GPU it is unchanged

For larger floating point numbers it can be converted to scientific notation, or have the notation normalized.
[12345678900000000000.0] becomes [1.23456789E19] on the CPU, but is unchanged on the GPU.
[1E308] becomes [1.0E308] on the CPU.

But for very large/small float numbers that would not fit in a double, they are turned into "Infinity"/"-Infinity"
[1.0E309,-1E309,1E5000] becomes ["Infinity","-Infinity","Infinity"]

But integer like numbers are not modified [12345678900000000000] just stays the same, even for numbers that are very, very large. i.e. "1" + ("0" * 400)

  1. Unneeded escapes are removed

{"a":"B\'"} becomes {"a":"B'"}. Escaping the ' character is not needed here.

That said we don't need to worry about normalizing nulls, as they are always null and nothing else is allowed, or booleans because true and false are the only ones supported.

@GaryShen2008
Copy link
Collaborator

Hi @SurajAralihalli, as I confirmed with Chong, assign this issue to you.

@res-life
Copy link
Collaborator

This commit will solve items1, item2, item4.
For item3: Numbers are normalized
We need to fix

@thirtiseven
Copy link
Collaborator

thirtiseven commented Mar 20, 2024

For floating point numbers I think the rule is the same as double to string in java, if so we can reuse ftos_converter in jni to have an almost bit to bit match with spark. But first we should convert string to double first, string to double kernel is also available in jni.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants