Add functionality that directly writes variables to a temporary zarr store #774

b-reyes · 2022-08-09T00:21:22Z

This PR addresses issues #489, #407, and #408. The primary concern with these issues is that there is large memory expansion occurring when data is converted and organized as an EchoData object. To combat these issues, this PR implements the following items:

Adds the variable offload_to_zarr to open_raw that can be one of: True, False, ‘auto’

If ‘auto’ is selected a heuristically driven method will determine if variables with a large memory footprint should be written to a temporary zarr store.
If True, the heuristic method will be skipped and the data variables with a large memory footprint will be directly written to a temporary zarr store.
If False, echopype will behave as it previously did. This is very important as it allows our tests to run as they normally did and additionally we can use it to test that our options True and ‘auto’ are producing the expected output.

Adds the variable max_zarr_mb to open_raw, which is the maximum MB that each zarr chunk should hold, when offloading variables with a large memory footprint to a temporary zarr store.
If variables are written to a temporary zarr store, they are then lazy loaded into a dask array. These dask arrays are then assigned to the appropriate groups within set_groups.

Note: the above items were only implemented for the echosounders EK60, ES70, EK80, ES80, and EA640.

In #489 a 725 MB OOI file and a 95 MB file (shared by @oftfrfbf) were provided. Below we provide the new memory profiling results for opening/parsing the files using open_raw and then writing the EchoData object to a zarr file using ed.to_zarr(). These results were obtained on a MacBook Pro with an Apple M1 Pro chip and 16GB of RAM.

725 MB OOI file

From the results above we see that we are able to successfully open the raw file and then write the EchoData object to a zarr file. The dip in memory usage around 28 seconds signifies that open_raw has completed and the subsequent spike in memory usage is caused by ed.to_zarr().

95 MB file shared by @oftfrfbf

In comparison to @oftfrfbf’s comment, we see from the above results that we are no longer getting a substantial increase in memory usage when creating the EchoData object, the action of writing the data to zarr around 8 seconds maintains a reasonable memory usage, and the process is significantly faster.

…ys directly into the EchoData object and the temp zarr directory persists until EchoData object is completely destroyed

…begin documenting parsed_to_zarr, and add the padding of range_sample in get_np_chunk

…s twice

… modify the inputs of these functions, and begin working on set_groups_ek80 for straight to zarr

…imensional arrays

…cally determines if large variables should be written to a temporary zarr store

… zarr store

b-reyes · 2022-08-09T00:22:14Z

@lsetiawan please note that this PR is currently in a draft state. There are currently two tests failing and I have not implemented the unit tests for the above work.

lsetiawan · 2022-08-10T17:33:18Z

@b-reyes could you resolve the branch conflict that's happening? Thanks!

…ucture for direct to zarr unit tests, run pre-commit on all files

b-reyes · 2022-08-10T18:10:40Z

@lsetiawan this PR is ready for review. Offline @leewujung and I discussed and agreed that we should delay unit tests until after this release. We will consider the functionality introduced in this PR to be in beta until we implement these unit tests.

codecov-commenter · 2022-08-10T18:10:41Z

Codecov Report

Merging #774 (d9d06e2) into dev (96ed2fd) will decrease coverage by 18.03%.
The diff coverage is 67.29%.

@@             Coverage Diff             @@
##              dev     #774       +/-   ##
===========================================
- Coverage   82.16%   64.13%   -18.04%     
===========================================
  Files          48       52        +4     
  Lines        4244     4731      +487     
===========================================
- Hits         3487     3034      -453     
- Misses        757     1697      +940

Flag	Coverage Δ
unittests	`64.13% <67.29%> (-18.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
echopype/echodata/api.py	`83.33% <ø> (ø)`
echopype/convert/parsed_to_zarr_ek80.py	`17.64% <17.64%> (ø)`
echopype/convert/set_groups_ek80.py	`79.91% <59.09%> (-16.09%)`	⬇️
echopype/convert/set_groups_base.py	`82.83% <73.17%> (-7.59%)`	⬇️
echopype/convert/api.py	`82.46% <76.00%> (-2.32%)`	⬇️
echopype/convert/parsed_to_zarr_ek60.py	`77.92% <77.92%> (ø)`
echopype/convert/parsed_to_zarr.py	`85.29% <85.29%> (ø)`
echopype/convert/parse_base.py	`87.27% <96.87%> (+1.85%)`	⬆️
echopype/convert/parse_ad2cp.py	`91.41% <100.00%> (+0.02%)`	⬆️
echopype/convert/parse_azfp.py	`91.35% <100.00%> (+0.16%)`	⬆️
... and 35 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

lsetiawan · 2022-08-11T18:13:01Z

Alright @b-reyes, I've run through the testing. You can find my line profiling results here: https://gist.github.com/lsetiawan/de779bf1e48c22c55b3f0616623d5225

Your offloading to zarr definitely works great, and I see that temp files don't stick around, at least when restarting after each run.. so probably take another tests for when doing parallel reads.. future stuff. I don't think the 'auto' worked for me. It seems to keep choosing to offload_to_zarr False instead of True. So something probably need to be worked out there to predict future memory consumption.

Other than that I really enjoyed reviewing this and I think it's probably good to go. Seems like all checks passed. Since the memory consumption is low enough for OOI and NOAA files, and you said they're in google drive. Could you create a quick test for these with offload_to_zarr=True? That way we know it works and doesn't crash in CI! 😄

b-reyes · 2022-08-11T18:59:42Z

I don't think the 'auto' worked for me. It seems to keep choosing to offload_to_zarr False instead of True. So something probably need to be worked out there to predict future memory consumption.

Yes, I thought that there may be issues with this as it is "heuristically" determined based off of only my computer. As it stands right now, the choice of whether to offload to zarr is dependent on the percentage of memory consumed if the variables were to be expanded. For example, see write_to_zarr for EK60. Currently, I am hard coding this value to mem_mult=0.4 as that worked well for my machine. I think this may not be the best approach as the amount of memory can vary and downstream processes can increase the memory beyond this threshold.

The other idea I had (which may be more appropriate) is to write those variables that when expanded exceed the user input value max_zarr_mb. @lsetiawan what do you think about that approach? I think it would be more stable.

Could you create a quick test for these with offload_to_zarr=True?

@leewujung asked that I hold off on this, please see the comment above. Right now the CI will not crash because the default value for offload_to_zarr is False.

lsetiawan · 2022-08-11T19:50:38Z

The other idea I had (which may be more appropriate) is to write those variables that when expanded exceed the user input value max_zarr_mb. @lsetiawan what do you think about that approach? I think it would be more stable.

Hmm... I thought this is only for a chunk of the data and not the whole dataset. But I guess that would work. I think at this point, let's hold off on improving this so we can get this beta feature out. Can you potentially turn off the 'auto' for now and not give anyone that option for this release until we work it out more? That would probably be better than someone getting confused why it's not working. Thanks!

@leewujung asked that I hold off on this, please see the comment above. Right now the CI will not crash because the default value for offload_to_zarr is False.

Okay. And you're not using any of the memory test files in current tests right? I guess for now, we've manually tested it and we know it can work so this is probably okay to hold off.

echopype/test_data/README.md

b-reyes · 2022-08-11T20:42:35Z

Hmm... I thought this is only for a chunk of the data and not the whole dataset. But I guess that would work. I think at this point, let's hold off on improving this so we can get this beta feature out. Can you potentially turn off the 'auto' for now and not give anyone that option for this release until we work it out more? That would probably be better than someone getting confused why it's not working. Thanks!

I agree. For now, it is probably better to turn this option off.

And you're not using any of the memory test files in current tests right?

Correct, I did put the 95 MB in the Google Drive, but I am never interacting with it.

Add simple test for noaa file

leewujung · 2022-08-11T21:07:30Z

And you're not using any of the memory test files in current tests right?

Correct, I did put the 95 MB in the Google Drive, but I am never interacting with it.

Yes I did ask @b-reyes to hold off on the tests. IMHO the 95 Mb file is too large, and what we need are tests to unit test the functions that the convert mechanisms use, instead of actually converting the large files.

lsetiawan · 2022-08-11T21:08:36Z

IMHO the 95 Mb file is too large, and what we need are tests to unit test the functions that the convert mechanisms use, instead of actually converting the large files.

I added a simple test anyway since brandon put it in the google drive 😛 haha hope it works!

echopype/test_data/README.md

echopype/convert/api.py

Co-authored-by: Don Setiawan <landungs@uw.edu>

lsetiawan

All looks good to me. Thanks @b-reyes for working on this and implementing changes as I review. I have tested out the functionalities manually and added a small CI test with the problematic NOAA data. The result is great! With offload_to_zarr=True, memory expansion only happened minimally! Everything seem to work as expected at this state and it's ready for merging with a beta stamp. 😄

lsetiawan · 2022-08-11T22:00:38Z

I'll wait for CI to finish to merge this to dev.

b-reyes added 21 commits July 21, 2022 17:06

add initial code to grab the appropriate parsed data

1e37ecc

establish initial structure to go from parsed to zarr

23d845b

modify open_raw routines for EK60/80 so that we can load in zarr arra…

11245d5

…ys directly into the EchoData object and the temp zarr directory persists until EchoData object is completely destroyed

change distribution of times to a red-robin like distribution

105667a

take first step towards generalizing the parsed_to_zarr module

583206d

generalize write_df_to_zarr so it can handle columns without arrays, …

9af7944

…begin documenting parsed_to_zarr, and add the padding of range_sample in get_np_chunk

finish cleaning up the code in parsed_to_zarr

b32a7ea

improve chunking in parsed_to_zarr and change num_mb to max_mb

97a36cb

make a preliminary attempt at writing complex data

5519f6a

start the restructuring of parsed_to_zarr into a class

5a82799

finish parsed to zarr reorganization for EK60 and EK80

aa637e1

document and clean up the code associated with set_groups_ek60

4e71f31

clean up parse_base and make it so that we do not store zarr variable…

13d45e6

…s twice

move get_power_dataarray and get_angle_dataarrays to set_groups_base,…

c4a79c2

… modify the inputs of these functions, and begin working on set_groups_ek80 for straight to zarr

obtain partially working version of Beam_group2 for EK80

44ccd5e

finish constructing ds_beam_power when zarr variables are present

6fecd9d

add method to get complex data arrays from zarr in set_groups_base

9d0d186

generalize parsed_to_zarr so we can have column elements with multi-d…

947941a

…imensional arrays

finish get_ds_complex_zarr in set_groups_ek80

e0aafa4

add open_raw zarr variables to api and create a routine that automati…

3dbf1c9

…cally determines if large variables should be written to a temporary zarr store

modify the condition for when we should write directly to a temporary…

b13a2b9

… zarr store

b-reyes requested a review from lsetiawan August 9, 2022 00:21

b-reyes added 2 commits August 10, 2022 10:46

only store zarr varriables when we do not have receieve data, add str…

200d3ee

…ucture for direct to zarr unit tests, run pre-commit on all files

merged dev into branch and resolved conflict in set_groups_ek80.py

c486be4

b-reyes marked this pull request as ready for review August 10, 2022 18:10

leewujung mentioned this pull request Aug 10, 2022

Overhaul access pattern [all tests ci] #762

Merged

lsetiawan reviewed Aug 11, 2022

View reviewed changes

echopype/test_data/README.md Outdated Show resolved Hide resolved

lsetiawan and others added 2 commits August 11, 2022 13:43

Add simple test for noaa file

473a26d

Merge pull request #1 from lsetiawan/add_test

f5188e3

Add simple test for noaa file

b-reyes added 3 commits August 11, 2022 14:16

remove the auto option in open_raw

c8a77ec

remove Union typing import

7c9b89f

add test_data/README.md lines back in

926d591

lsetiawan reviewed Aug 11, 2022

View reviewed changes

echopype/test_data/README.md Outdated Show resolved Hide resolved

lsetiawan reviewed Aug 11, 2022

View reviewed changes

echopype/test_data/README.md Outdated Show resolved Hide resolved

add spaces in test_data/README.md

33c11f1

lsetiawan reviewed Aug 11, 2022

View reviewed changes

echopype/convert/api.py Outdated Show resolved Hide resolved

lsetiawan reviewed Aug 11, 2022

View reviewed changes

echopype/convert/api.py Outdated Show resolved Hide resolved

b-reyes and others added 2 commits August 11, 2022 14:43

remove optional typing for offload_to_zarr

77b36ee

Co-authored-by: Don Setiawan <landungs@uw.edu>

remove auto description in notes and add beta statement in open_raw

d9d06e2

lsetiawan approved these changes Aug 11, 2022

View reviewed changes

lsetiawan merged commit a099ee5 into OSOceanAcoustics:dev Aug 11, 2022

lsetiawan removed the ready for review label Aug 11, 2022

This was referenced Aug 11, 2022

Add 'auto' option to kwarg offload_to_zarr in open_raw #782

Closed

Adding an option to make EchoData from open_raw delay-friendly #408

Closed

Large memory expansion when data are converted and organized into groups/EchoData object #489

Closed

leewujung mentioned this pull request Aug 13, 2022

Refactor EK to improve speed and memory usage through removal of xr.merge #563

Closed

2 tasks

This was referenced Aug 19, 2022

Allow handling larger-than-memory total data volume in combine_echodata #766

Closed

open_raw(offload_to_zarr=True) integration tests and bug fix #794

Merged

Account for RAW4 datagrams in parsed to zarr method #795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality that directly writes variables to a temporary zarr store #774

Add functionality that directly writes variables to a temporary zarr store #774

b-reyes commented Aug 9, 2022

b-reyes commented Aug 9, 2022

lsetiawan commented Aug 10, 2022

b-reyes commented Aug 10, 2022 •

edited

codecov-commenter commented Aug 10, 2022 •

edited

lsetiawan commented Aug 11, 2022

b-reyes commented Aug 11, 2022

lsetiawan commented Aug 11, 2022

b-reyes commented Aug 11, 2022

leewujung commented Aug 11, 2022

lsetiawan commented Aug 11, 2022 •

edited

lsetiawan left a comment

lsetiawan commented Aug 11, 2022

Add functionality that directly writes variables to a temporary zarr store #774

Add functionality that directly writes variables to a temporary zarr store #774

Conversation

b-reyes commented Aug 9, 2022

b-reyes commented Aug 9, 2022

lsetiawan commented Aug 10, 2022

b-reyes commented Aug 10, 2022 • edited

codecov-commenter commented Aug 10, 2022 • edited

Codecov Report

lsetiawan commented Aug 11, 2022

b-reyes commented Aug 11, 2022

lsetiawan commented Aug 11, 2022

b-reyes commented Aug 11, 2022

leewujung commented Aug 11, 2022

lsetiawan commented Aug 11, 2022 • edited

lsetiawan left a comment

Choose a reason for hiding this comment

lsetiawan commented Aug 11, 2022

b-reyes commented Aug 10, 2022 •

edited

codecov-commenter commented Aug 10, 2022 •

edited

lsetiawan commented Aug 11, 2022 •

edited