Continued issues with Reading Large SegArrays #2329

Ethan-DeBandi99 · 2023-03-31T11:07:01Z

User reporting that the issue in #2263 appears to still be present in v2023.03.24.

I suspect that something else is causing an additional slowdown where we initially thought it was fixupSegBoundaries. That function has definitely been improved and completes very quickly. I will start looking at what runs after that in the workflow.

The text was updated successfully, but these errors were encountered:

Ethan-DeBandi99 · 2023-03-31T12:23:46Z

I tracked a few things down this morning. First, the seemingly largest issue (Particularly when running multi-locale with a large dataset), the code was configured to write the Segments and Values arrays out to the log after they are read from the file. Removing both of these resulted in about a ~15x increase on a single locale. This translated to running the same problem size with 5 locales.

Additionally, it appears that the creation of the SegArray object is taking roughly the same amount of time when reading the object as when creating the object from scratch. We need to look into this if there is anything we can do to speed this up.

I know the initial issue was put in around DataFrame, but everything seems to boil down to SegArrays only and does not change if they are in a file with multiple columns. Currently, accounting for the time to create the SegArray object, read/write performance is fairly similar. I will be looking into some ways to hopefully make SegArray creation faster.

Ethan-DeBandi99 · 2023-03-31T12:55:22Z

After reviewing the SegArray Creation code, I am not sure there will be much that we can do to make it faster. I will definitely continue to look into things, but I am fairly certain judging by the fact that remove the 2 unnecessary log messages actually allows me to complete reading a problem size of 10**8 from 5 locales on my Mac and with the log messages it would not complete after running for ~45+mins.

21771 · 2023-04-17T20:37:44Z

Thanks! Look forward to trying this release.

Ethan-DeBandi99 added bug Something isn't working performance Performance needs improving important high priority File IO Arkouda file IO capabilities labels Mar 31, 2023

Ethan-DeBandi99 self-assigned this Mar 31, 2023

Ethan-DeBandi99 mentioned this issue Mar 31, 2023

Closes #2329 - Fixes Continued SegArray Read Performance Issues #2330

Merged

Ethan-DeBandi99 added the In Progress Work on ticket is in progress / ticket is actively being worked label Mar 31, 2023

stress-tess closed this as completed in #2330 Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continued issues with Reading Large SegArrays #2329

Continued issues with Reading Large SegArrays #2329

Ethan-DeBandi99 commented Mar 31, 2023 •

edited by stress-tess

Loading

Ethan-DeBandi99 commented Mar 31, 2023

Ethan-DeBandi99 commented Mar 31, 2023

21771 commented Apr 17, 2023

Continued issues with Reading Large SegArrays #2329

Continued issues with Reading Large SegArrays #2329

Comments

Ethan-DeBandi99 commented Mar 31, 2023 • edited by stress-tess Loading

Ethan-DeBandi99 commented Mar 31, 2023

Ethan-DeBandi99 commented Mar 31, 2023

21771 commented Apr 17, 2023

Ethan-DeBandi99 commented Mar 31, 2023 •

edited by stress-tess

Loading