Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality that directly writes variables to a temporary zarr store #774

Merged
merged 42 commits into from Aug 11, 2022

Conversation

b-reyes
Copy link
Contributor

@b-reyes b-reyes commented Aug 9, 2022

This PR addresses issues #489, #407, and #408. The primary concern with these issues is that there is large memory expansion occurring when data is converted and organized as an EchoData object. To combat these issues, this PR implements the following items:

  1. Adds the variable offload_to_zarr to open_raw that can be one of: True, False, ‘auto’
  • If ‘auto’ is selected a heuristically driven method will determine if variables with a large memory footprint should be written to a temporary zarr store.
  • If True, the heuristic method will be skipped and the data variables with a large memory footprint will be directly written to a temporary zarr store.
  • If False, echopype will behave as it previously did. This is very important as it allows our tests to run as they normally did and additionally we can use it to test that our options True and ‘auto’ are producing the expected output.
  1. Adds the variable max_zarr_mb to open_raw, which is the maximum MB that each zarr chunk should hold, when offloading variables with a large memory footprint to a temporary zarr store.
  2. If variables are written to a temporary zarr store, they are then lazy loaded into a dask array. These dask arrays are then assigned to the appropriate groups within set_groups.

Note: the above items were only implemented for the echosounders EK60, ES70, EK80, ES80, and EA640.

In #489 a 725 MB OOI file and a 95 MB file (shared by @oftfrfbf) were provided. Below we provide the new memory profiling results for opening/parsing the files using open_raw and then writing the EchoData object to a zarr file using ed.to_zarr(). These results were obtained on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.

  • 725 MB OOI file

mem_prof_new_mem_to_zarr_725MB

From the results above we see that we are able to successfully open the raw file and then write the EchoData object to a zarr file. The dip in memory usage around 28 seconds signifies that open_raw has completed and the subsequent spike in memory usage is caused by ed.to_zarr().

mem_prof_new_mem_to_zarr_95MB

In comparison to @oftfrfbf’s comment, we see from the above results that we are no longer getting a substantial increase in memory usage when creating the EchoData object, the action of writing the data to zarr around 8 seconds maintains a reasonable memory usage, and the process is significantly faster.

…ys directly into the EchoData object and the temp zarr directory persists until EchoData object is completely destroyed
…begin documenting parsed_to_zarr, and add the padding of range_sample in get_np_chunk
… modify the inputs of these functions, and begin working on set_groups_ek80 for straight to zarr
…cally determines if large variables should be written to a temporary zarr store
@b-reyes b-reyes requested a review from lsetiawan August 9, 2022 00:21
@b-reyes
Copy link
Contributor Author

b-reyes commented Aug 9, 2022

@lsetiawan please note that this PR is currently in a draft state. There are currently two tests failing and I have not implemented the unit tests for the above work.

@lsetiawan
Copy link
Member

@b-reyes could you resolve the branch conflict that's happening? Thanks!

@b-reyes b-reyes marked this pull request as ready for review August 10, 2022 18:10
@b-reyes
Copy link
Contributor Author

b-reyes commented Aug 10, 2022

@lsetiawan this PR is ready for review. Offline @leewujung and I discussed and agreed that we should delay unit tests until after this release. We will consider the functionality introduced in this PR to be in beta until we implement these unit tests.

@codecov-commenter
Copy link

codecov-commenter commented Aug 10, 2022

Codecov Report

Merging #774 (d9d06e2) into dev (96ed2fd) will decrease coverage by 18.03%.
The diff coverage is 67.29%.

@@             Coverage Diff             @@
##              dev     #774       +/-   ##
===========================================
- Coverage   82.16%   64.13%   -18.04%     
===========================================
  Files          48       52        +4     
  Lines        4244     4731      +487     
===========================================
- Hits         3487     3034      -453     
- Misses        757     1697      +940     
Flag Coverage Δ
unittests 64.13% <67.29%> (-18.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
echopype/echodata/api.py 83.33% <ø> (ø)
echopype/convert/parsed_to_zarr_ek80.py 17.64% <17.64%> (ø)
echopype/convert/set_groups_ek80.py 79.91% <59.09%> (-16.09%) ⬇️
echopype/convert/set_groups_base.py 82.83% <73.17%> (-7.59%) ⬇️
echopype/convert/api.py 82.46% <76.00%> (-2.32%) ⬇️
echopype/convert/parsed_to_zarr_ek60.py 77.92% <77.92%> (ø)
echopype/convert/parsed_to_zarr.py 85.29% <85.29%> (ø)
echopype/convert/parse_base.py 87.27% <96.87%> (+1.85%) ⬆️
echopype/convert/parse_ad2cp.py 91.41% <100.00%> (+0.02%) ⬆️
echopype/convert/parse_azfp.py 91.35% <100.00%> (+0.16%) ⬆️
... and 35 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lsetiawan
Copy link
Member

Alright @b-reyes, I've run through the testing. You can find my line profiling results here: https://gist.github.com/lsetiawan/de779bf1e48c22c55b3f0616623d5225

Your offloading to zarr definitely works great, and I see that temp files don't stick around, at least when restarting after each run.. so probably take another tests for when doing parallel reads.. future stuff. I don't think the 'auto' worked for me. It seems to keep choosing to offload_to_zarr False instead of True. So something probably need to be worked out there to predict future memory consumption.

Other than that I really enjoyed reviewing this and I think it's probably good to go. Seems like all checks passed. Since the memory consumption is low enough for OOI and NOAA files, and you said they're in google drive. Could you create a quick test for these with offload_to_zarr=True? That way we know it works and doesn't crash in CI! 😄

@b-reyes
Copy link
Contributor Author

b-reyes commented Aug 11, 2022

I don't think the 'auto' worked for me. It seems to keep choosing to offload_to_zarr False instead of True. So something probably need to be worked out there to predict future memory consumption.

Yes, I thought that there may be issues with this as it is "heuristically" determined based off of only my computer. As it stands right now, the choice of whether to offload to zarr is dependent on the percentage of memory consumed if the variables were to be expanded. For example, see write_to_zarr for EK60. Currently, I am hard coding this value to mem_mult=0.4 as that worked well for my machine. I think this may not be the best approach as the amount of memory can vary and downstream processes can increase the memory beyond this threshold.

The other idea I had (which may be more appropriate) is to write those variables that when expanded exceed the user input value max_zarr_mb. @lsetiawan what do you think about that approach? I think it would be more stable.

Could you create a quick test for these with offload_to_zarr=True?

@leewujung asked that I hold off on this, please see the comment above. Right now the CI will not crash because the default value for offload_to_zarr is False.

@lsetiawan
Copy link
Member

The other idea I had (which may be more appropriate) is to write those variables that when expanded exceed the user input value max_zarr_mb. @lsetiawan what do you think about that approach? I think it would be more stable.

Hmm... I thought this is only for a chunk of the data and not the whole dataset. But I guess that would work. I think at this point, let's hold off on improving this so we can get this beta feature out. Can you potentially turn off the 'auto' for now and not give anyone that option for this release until we work it out more? That would probably be better than someone getting confused why it's not working. Thanks!

@leewujung asked that I hold off on this, please see the comment above. Right now the CI will not crash because the default value for offload_to_zarr is False.

Okay. And you're not using any of the memory test files in current tests right? I guess for now, we've manually tested it and we know it can work so this is probably okay to hold off.

@b-reyes
Copy link
Contributor Author

b-reyes commented Aug 11, 2022

Hmm... I thought this is only for a chunk of the data and not the whole dataset. But I guess that would work. I think at this point, let's hold off on improving this so we can get this beta feature out. Can you potentially turn off the 'auto' for now and not give anyone that option for this release until we work it out more? That would probably be better than someone getting confused why it's not working. Thanks!

I agree. For now, it is probably better to turn this option off.

And you're not using any of the memory test files in current tests right?

Correct, I did put the 95 MB in the Google Drive, but I am never interacting with it.

@leewujung
Copy link
Member

And you're not using any of the memory test files in current tests right?

Correct, I did put the 95 MB in the Google Drive, but I am never interacting with it.

Yes I did ask @b-reyes to hold off on the tests. IMHO the 95 Mb file is too large, and what we need are tests to unit test the functions that the convert mechanisms use, instead of actually converting the large files.

@lsetiawan
Copy link
Member

lsetiawan commented Aug 11, 2022

IMHO the 95 Mb file is too large, and what we need are tests to unit test the functions that the convert mechanisms use, instead of actually converting the large files.

I added a simple test anyway since brandon put it in the google drive 😛 haha hope it works!

echopype/convert/api.py Outdated Show resolved Hide resolved
echopype/convert/api.py Outdated Show resolved Hide resolved
Copy link
Member

@lsetiawan lsetiawan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me. Thanks @b-reyes for working on this and implementing changes as I review. I have tested out the functionalities manually and added a small CI test with the problematic NOAA data. The result is great! With offload_to_zarr=True, memory expansion only happened minimally! Everything seem to work as expected at this state and it's ready for merging with a beta stamp. 😄

@lsetiawan
Copy link
Member

I'll wait for CI to finish to merge this to dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants