Fix data loss issue in `combine_echodata` #824

b-reyes · 2022-10-01T00:04:29Z

This PR addresses #822. This is done by first creating a mapping between uniform chunks and the initial starting chunks. Then a Dask Lock is assigned to each of the initial starting chunks so that no starting chunks are written to the uniform chunk at the same time. To illustrate the approach, consider the following simplified example:

Say we have three files each with the variable 'back_r' that contain the following values (these would be the starting chunks):

file 1: back_r = [0,1,2]

file 2: back_r = [3,4,5]

file 3: back_r = [6,7]

We then want to combine all of these back_r variables into back_r_combined = [0,1,2,3,4,5,6,7] with uniform chunk size 2.

Thus, the chunks would be as follows:

chunk 1: [0,1]

chunk 2: [2,3]

chunk 3: [4,5]

chunk 4: [6,7]

For all chunks besides chunk 2, we can safely write the processes in parallel. However, we see that chunk 2 contains data from file 1 and file 2. Thus, two different processes will attempt to write to chunk 2 at the same time and data corruption will likely occur. To remedy this, we can assign a lock name to each write to a uniform chunk:

data write [0,1]  will be given lock name = "lock1"

data write [2] and [3]  will be given lock name = "lock2"

data write [4,5]  will be given lock name = "lock3"

data write [6,7]  will be given lock name = "lock4"

We then use Dask Lock with the establish lock names and we can prevent chunks being written to at the same time.

…p using combine_lazily_v2

…a full EchoData combine

…bine

…roup

…llel write of coords

…py arrays

for more information, see https://pre-commit.ci

…ault compressor in zarr_combine.py

for more information, see https://pre-commit.ci

echopype/echodata/zarr_combine.py

…check_zarr_path Co-authored-by: Don Setiawan <landungs@uw.edu>

for more information, see https://pre-commit.ci

Co-authored-by: Don Setiawan <landungs@uw.edu>

for more information, see https://pre-commit.ci

Co-authored-by: Don Setiawan <landungs@uw.edu>

b-reyes · 2022-10-05T22:49:33Z

@lsetiawan I agree with you that I was incorrectly using Optional for storage_options, this should always be a required variable and have an empty dict for its value. I have hopefully changed all typing to reflect this.

I am slightly puzzled about your changes to the doc strings. Based on your comments, it appears that you do not want typing types in the doc strings i.e. things like Dict, Optional, etc.. If this is the case, can you explain to me why this is? I don't necessarily disagree with it, but I would like some guidelines for my future documentation of echopype.

lsetiawan · 2022-10-05T23:13:27Z

I am slightly puzzled about your changes to the doc strings. Based on your comments, it appears that you do not want typing types in the doc strings i.e. things like Dict, Optional, etc.. If this is the case, can you explain to me why this is? I don't necessarily disagree with it, but I would like some guidelines for my future documentation of echopype.

I do like the straight type hints in there however, for users that are not familiar with type hinting, they will be reading it and might be puzzled. Type hints are a new feature in python, but the primitive types are a lot more familiar with the majority of people and it's much more readable. Take for example a docstring with a parameter of start_count and it can take int or float. A type hint would be Union[int, float]... however that's not very intuitive, it's great for IDE, but not a new python user that just wants to use the function based on the docstring so the more appropriate thing would be start_count: int or float or start_count: int | float. Anyways, type hints can get very cryptic real quick, and I don't think docstring is the right place for them.

b-reyes · 2022-10-05T23:18:29Z

Anyways, type hints can get very cryptic real quick, and I don't think docstring is the right place for them.

That is a fair point. I will go ahead and fix all doc strings that have them in it.

To your point of user readability, I think I like start_count: int or float, rather than start_count: int | float. Do you mind if I use the format start_count: int or float? For an optional, this would look like start_count: int or None.

lsetiawan · 2022-10-05T23:28:22Z

Do you mind if I use the format start_count: int or float?

Yea that's the best I think also.

For an optional, this would look like start_count: int or None

For this, based on numpydoc convention, it should be int, optional.

See: https://numpydoc.readthedocs.io/en/latest/format.html#parameters

b-reyes · 2022-10-05T23:32:00Z

For this, based on numpydoc convention, it should be int, optional.

Thank you for that reference! Great, I will go with int, optional

lsetiawan · 2022-10-05T23:33:15Z

Thanks! Sorry for all of these stylistic changes! I guess it's better cleaning them up now than later 😛

b-reyes · 2022-10-05T23:34:27Z

Thanks! Sorry for all of these stylistic changes! I guess it's better cleaning them up now than later 😛

No worries. I agree, it is better to change them now.

echopype/echodata/combine.py

lsetiawan

After detailed testing found here: https://nbviewer.org/gist/lsetiawan/ebb3faed65e53a3188518d62dbe0968a I conclude that this combine_echodata doesn't have any data loss issue and it's able to combine a large amount of data with minimal impact on memory consumption/spikes.

The example notebook that I tested on converted 318 EK60 Raw Files from OOI in 16min and then combined those files in about 15min. These times may be due to limitation in my CPU clock speed. The memory consumption in my machine never blew up and at the end I was able to explore 133GB of data with ease.

AWESOME WORK @b-reyes! This PR is ready for merging. Please put your testing results in this PR for the Hake data.

b-reyes · 2022-10-07T22:35:02Z

@lsetiawan thank you very much for investigating this PR and testing out a large amount of data files! I am glad to hear that memory consumption stayed steady and the runtime for the combination of the files was small.

As @lsetiawan mentioned, I have also tested out Hake data. Specifically, I tested out 107 files within s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60. It took roughly 7 minutes to convert the data and 5 minutes to combine them. The notebook to run this can be found here: https://nbviewer.org/gist/b-reyes/77bbede005f266509618d1430ae8f33e, please be sure to provide a Client to combine_echodata before you run this!

b-reyes and others added 30 commits August 26, 2022 17:02

start creating the structure for lazy echodata combine

321e53d

Merge branch 'dev' into lazy-comb-files

662bca4

create PreprocessCallable class and add functionality to laze_combine

c1426f2

finish creating a working version of lazy_combine

bb59291

start working on v2 of combine_lazily

e58df72

get a working version of direct_write in combine_lazily_v2

e2b9ec6

make construct_lazy_ds return ds_unwritten

1d8dffa

correctly write all variables and dimensions for the Environment grou…

b4d9a13

…p using combine_lazily_v2

account for the rest of the constant dimensions

67877a1

add comments and documentation to code in combine_lazily_v2

44faf4d

make combine_lazily_v2 into a class

2a89e6d

add mechanism to strore dataset attributes and make first attempt at …

71fc731

…a full EchoData combine

delay region write in direct_write

6be4dc0

add sychronizer for to_zarr and turn off blosc threads when using com…

ce62334

…bine

Rename class and add attributes from all datasets to the Provenance g…

36afe2b

…roup

add additional type checks to combine

8e95644

rename combine_lazily_v2.py to zarr_combine.py

a7b51e7

start simplifying the logic needed to append data and removal of para…

932355e

…llel write of coords

reorganize code and include original compressor in encodings

36768c6

document functions and add retries in compute

3d87f0e

start implementing checks for time and channel coordinates

339ce72

add TODO statements

c2af831

fix pre-commit issues

3665a56

add routine to check Dataset attributes and drop them if they are num…

8eaed23

…py arrays

[pre-commit.ci] auto fixes from pre-commit.com hooks

7ff0ea1

for more information, see https://pre-commit.ci

set all variables and dims compressor to be the same in io.py and def…

b7fd81e

…ault compressor in zarr_combine.py

merge in origin branch

db4cf9a

Merge branch 'dev' into lazy-comb-files

9bdc0a9

change conversion to combination

14ccb84

[pre-commit.ci] auto fixes from pre-commit.com hooks

3c7ad86

for more information, see https://pre-commit.ci

lsetiawan reviewed Oct 5, 2022

View reviewed changes

echopype/echodata/zarr_combine.py Outdated Show resolved Hide resolved

lsetiawan reviewed Oct 5, 2022

View reviewed changes

echopype/echodata/zarr_combine.py Outdated Show resolved Hide resolved

lsetiawan reviewed Oct 5, 2022

View reviewed changes

echopype/echodata/zarr_combine.py Outdated Show resolved Hide resolved

b-reyes mentioned this pull request Oct 5, 2022

Add NOAA s3 bucket and reversed time files to test_combine_echodata #830

Merged

b-reyes and others added 8 commits October 5, 2022 15:06

change storage options typing and set default value for overwrite in …

b3b375d

…check_zarr_path Co-authored-by: Don Setiawan <landungs@uw.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

4f1d4fb

for more information, see https://pre-commit.ci

add Dict and Any typing in combine.py

376c56f

Co-authored-by: Don Setiawan <landungs@uw.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

387479b

for more information, see https://pre-commit.ci

change docstring type for storage_options in check_zarr_path

0bf73af

change storage options typing in combine_echodata input

9346fb1

Co-authored-by: Don Setiawan <landungs@uw.edu>

update typing for storage_options in docstring of combine_echodata

290c729

change all typing for storage_options in zarr_combine

b3a51ff

remove typing types from docstrings and add optional where necessary

6386646

lsetiawan reviewed Oct 6, 2022

View reviewed changes

echopype/echodata/combine.py Outdated Show resolved Hide resolved

specify the type of the elements in a list within docstrings

5afa715

lsetiawan mentioned this pull request Oct 6, 2022

ValueError during compute_MVBS #834

Closed

b-reyes added 2 commits October 6, 2022 12:02

Merge branch 'dev' into fix-data-loss

a3b3f7f

Merge branch 'dev' into fix-data-loss

0adc23c

lsetiawan approved these changes Oct 7, 2022

View reviewed changes

b-reyes merged commit fc011ba into OSOceanAcoustics:dev Oct 7, 2022

This was referenced Oct 7, 2022

Fix data corruption created by PR #808 #822

Closed

Modify test_combine_echodata so it works with the changes introduced in #808 #823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data loss issue in `combine_echodata` #824

Fix data loss issue in `combine_echodata` #824

b-reyes commented Oct 1, 2022

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022 •

edited

Loading

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022 •

edited

Loading

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022

b-reyes commented Oct 5, 2022

lsetiawan left a comment

b-reyes commented Oct 7, 2022

Fix data loss issue in combine_echodata #824

Fix data loss issue in combine_echodata #824

Conversation

b-reyes commented Oct 1, 2022

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022 • edited Loading

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022 • edited Loading

b-reyes commented Oct 5, 2022

lsetiawan commented Oct 5, 2022

b-reyes commented Oct 5, 2022

lsetiawan left a comment

Choose a reason for hiding this comment

b-reyes commented Oct 7, 2022

Fix data loss issue in `combine_echodata` #824

Fix data loss issue in `combine_echodata` #824

lsetiawan commented Oct 5, 2022 •

edited

Loading

lsetiawan commented Oct 5, 2022 •

edited

Loading