Investigate csv performance #3033

stress-tess · 2024-03-12T18:04:38Z

a user reported our csv write speed are worse than chunking up the data and using pandas. We should look into this further

This PR (closes Bears-R-Us#3033) improves the performance of the CSV write code. The perfomance increase comes from switching the inner and outer loop. So instead of serially iterating the indices and doing an inner parallel loop over the columns. We serially iterate the columns and use local slices / bulk comm(?) to hopefully get less communication. It's worth noting there is a time memory tradeoff here. The old way only ever had one row's worth data in a string at a time. The new way has a chapel native strings array which contains all the data the locale need to write (index `i` contains all the data of the `i`th row for this file). The speedup seems well worth it since I'm under the impression that csv is normally used for relatively smaller datasets. If this is a problem and we want to trade off some runtime for less mem usage, i *think* it should be pretty straightfoward to write in chunks (but now that I say that it definitely won't be). The idea is to take slices of `localSubdomain.size/N`, so the strings array will only have `(all data the locale needs to write) / N` at once.

* Closes #3033: Optimize CSV write This PR (closes #3033) improves the performance of the CSV write code. The perfomance increase comes from switching the inner and outer loop. So instead of serially iterating the indices and doing an inner parallel loop over the columns. We serially iterate the columns and use local slices / bulk comm(?) to hopefully get less communication. It's worth noting there is a time memory tradeoff here. The old way only ever had one row's worth data in a string at a time. The new way has a chapel native strings array which contains all the data the locale need to write (index `i` contains all the data of the `i`th row for this file). The speedup seems well worth it since I'm under the impression that csv is normally used for relatively smaller datasets. If this is a problem and we want to trade off some runtime for less mem usage, i *think* it should be pretty straightfoward to write in chunks (but now that I say that it definitely won't be). The idea is to take slices of `localSubdomain.size/N`, so the strings array will only have `(all data the locale needs to write) / N` at once. * add const declaration --------- Co-authored-by: Tess Hayes <stress-tess@users.noreply.github.com>

stress-tess added the performance Performance needs improving label Mar 12, 2024

stress-tess self-assigned this Mar 12, 2024

stress-tess changed the title ~~csv performance~~ Investigate csv performance Mar 12, 2024

stress-tess added the User Reported A user submitted the issue label Mar 20, 2024

stress-tess mentioned this issue Mar 20, 2024

Closes #3033: Optimize CSV write #3053

Merged

stress-tess closed this as completed in #3053 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate csv performance #3033

Investigate csv performance #3033

stress-tess commented Mar 12, 2024 •

edited

Loading

Investigate csv performance #3033

Investigate csv performance #3033

Comments

stress-tess commented Mar 12, 2024 • edited Loading

stress-tess commented Mar 12, 2024 •

edited

Loading