Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued issues with Reading Large SegArrays #2329

Closed
Ethan-DeBandi99 opened this issue Mar 31, 2023 · 3 comments · Fixed by #2330
Closed

Continued issues with Reading Large SegArrays #2329

Ethan-DeBandi99 opened this issue Mar 31, 2023 · 3 comments · Fixed by #2330
Assignees
Labels
bug Something isn't working File IO Arkouda file IO capabilities important high priority In Progress Work on ticket is in progress / ticket is actively being worked performance Performance needs improving

Comments

@Ethan-DeBandi99
Copy link
Contributor

Ethan-DeBandi99 commented Mar 31, 2023

User reporting that the issue in #2263 appears to still be present in v2023.03.24.

I suspect that something else is causing an additional slowdown where we initially thought it was fixupSegBoundaries. That function has definitely been improved and completes very quickly. I will start looking at what runs after that in the workflow.

@Ethan-DeBandi99 Ethan-DeBandi99 added bug Something isn't working performance Performance needs improving important high priority File IO Arkouda file IO capabilities labels Mar 31, 2023
@Ethan-DeBandi99 Ethan-DeBandi99 self-assigned this Mar 31, 2023
@Ethan-DeBandi99
Copy link
Contributor Author

I tracked a few things down this morning. First, the seemingly largest issue (Particularly when running multi-locale with a large dataset), the code was configured to write the Segments and Values arrays out to the log after they are read from the file. Removing both of these resulted in about a ~15x increase on a single locale. This translated to running the same problem size with 5 locales.

Additionally, it appears that the creation of the SegArray object is taking roughly the same amount of time when reading the object as when creating the object from scratch. We need to look into this if there is anything we can do to speed this up.

I know the initial issue was put in around DataFrame, but everything seems to boil down to SegArrays only and does not change if they are in a file with multiple columns. Currently, accounting for the time to create the SegArray object, read/write performance is fairly similar. I will be looking into some ways to hopefully make SegArray creation faster.

@Ethan-DeBandi99
Copy link
Contributor Author

After reviewing the SegArray Creation code, I am not sure there will be much that we can do to make it faster. I will definitely continue to look into things, but I am fairly certain judging by the fact that remove the 2 unnecessary log messages actually allows me to complete reading a problem size of 10**8 from 5 locales on my Mac and with the log messages it would not complete after running for ~45+mins.

@Ethan-DeBandi99 Ethan-DeBandi99 added the In Progress Work on ticket is in progress / ticket is actively being worked label Mar 31, 2023
@21771
Copy link

21771 commented Apr 17, 2023

Thanks! Look forward to trying this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working File IO Arkouda file IO capabilities important high priority In Progress Work on ticket is in progress / ticket is actively being worked performance Performance needs improving
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants