Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance regression in from_json #10306

Merged
merged 5 commits into from
Jan 29, 2024

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Jan 27, 2024

Closes #10301

Old Approach

  • joinStrings with newline separator
  • stringConcat to add one more newline at the end, but this was operating on a column with a single string, so not parallelizable

New Approach

  • stringConcat to append newline to each entry
  • joinStrings with empty string as separator

This fixes the performance regression and we are now slightly faster than CPU for the benchmark I have been using (instead of 4x slower)

@andygrove andygrove self-assigned this Jan 27, 2024
@andygrove andygrove added the performance A performance related task/issue label Jan 27, 2024
@andygrove andygrove changed the title WIP: Fix performance regression in from_json Fix performance regression in from_json Jan 27, 2024
@andygrove andygrove added the bug Something isn't working label Jan 27, 2024
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review January 28, 2024 00:01
// join all the JSON lines into one string
val joined = withResource(withNewline) { _ =>
withResource(Scalar.fromString("")) { emptyString =>
withNewline.joinStrings(emptyString, emptyRow)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the naRep parameter of emptyRow here is redundant since we already replaced nulls with emptyRow earlier in this code.

@andygrove andygrove merged commit bc22bf8 into NVIDIA:branch-24.02 Jan 29, 2024
40 checks passed
@andygrove andygrove deleted the fix-from-json-perf2 branch January 29, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve performance of from_json
3 participants