Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate csv performance #3033

Closed
stress-tess opened this issue Mar 12, 2024 · 0 comments · Fixed by #3053
Closed

Investigate csv performance #3033

stress-tess opened this issue Mar 12, 2024 · 0 comments · Fixed by #3053
Assignees
Labels
performance Performance needs improving User Reported A user submitted the issue

Comments

@stress-tess
Copy link
Member

stress-tess commented Mar 12, 2024

a user reported our csv write speed are worse than chunking up the data and using pandas. We should look into this further

@stress-tess stress-tess added the performance Performance needs improving label Mar 12, 2024
@stress-tess stress-tess self-assigned this Mar 12, 2024
@stress-tess stress-tess changed the title csv performance Investigate csv performance Mar 12, 2024
stress-tess added a commit to stress-tess/arkouda that referenced this issue Mar 20, 2024
This PR (closes Bears-R-Us#3033) improves the performance of the CSV write code.

The perfomance increase comes from switching the inner and outer loop. So instead of serially iterating the indices and doing an inner parallel loop over the columns. We serially iterate the columns and use local slices / bulk comm(?) to hopefully get less communication.

It's worth noting there is a time memory tradeoff here. The old way only ever had one row's worth data in a string at a time. The new way has a chapel native strings array which contains all the data the locale need to write (index `i` contains all the data of the `i`th row for this file). The speedup seems well worth it since I'm under the impression that csv is normally used for relatively smaller datasets.

If this is a problem and we want to trade off some runtime for less mem usage, i *think* it should be pretty straightfoward to write in chunks (but now that I say that it definitely won't be). The idea is to take slices of `localSubdomain.size/N`, so the strings array will only have `(all data the locale needs to write) / N` at once.
@stress-tess stress-tess added the User Reported A user submitted the issue label Mar 20, 2024
github-merge-queue bot pushed a commit that referenced this issue Mar 21, 2024
* Closes #3033: Optimize CSV write

This PR (closes #3033) improves the performance of the CSV write code.

The perfomance increase comes from switching the inner and outer loop. So instead of serially iterating the indices and doing an inner parallel loop over the columns. We serially iterate the columns and use local slices / bulk comm(?) to hopefully get less communication.

It's worth noting there is a time memory tradeoff here. The old way only ever had one row's worth data in a string at a time. The new way has a chapel native strings array which contains all the data the locale need to write (index `i` contains all the data of the `i`th row for this file). The speedup seems well worth it since I'm under the impression that csv is normally used for relatively smaller datasets.

If this is a problem and we want to trade off some runtime for less mem usage, i *think* it should be pretty straightfoward to write in chunks (but now that I say that it definitely won't be). The idea is to take slices of `localSubdomain.size/N`, so the strings array will only have `(all data the locale needs to write) / N` at once.

* add const declaration

---------

Co-authored-by: Tess Hayes <stress-tess@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance needs improving User Reported A user submitted the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant