Overhaul `combine_echodata` method #1042

lsetiawan · 2023-05-15T23:53:28Z

Overview

This PR aims to overhaul the combine_echodata functionality to be in favor of using xarray's functions to concat the datasets. This relates to issue #976. The changes that happens in this PR are as follows:

Remove the usage of ZarrCombine object
Remove spinning up dask client under the hood during combine
Remove the need for zarr path as combine_echodata input
Utilizes xr.concat to combine the datasets under the hood so it doesn't require any monotonic values, keeping the data lossless
Updates group_paths attribute to be a tuple rather than set to keep order of paths.

NOTE: Currently this works for EK60, but there are still some issues with attributes combining within the Vendor_specific group for EK80 since some of the attribute values are arrays.

In another PR, I will address the issue as stated above.

codecov-commenter · 2023-05-16T18:54:31Z

Codecov Report

Merging #1042 (caa30cd) into dev (a09d8f3) will decrease coverage by 24.76%.
The diff coverage is 81.75%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@             Coverage Diff             @@
##              dev    #1042       +/-   ##
===========================================
- Coverage   80.79%   56.03%   -24.76%     
===========================================
  Files          67       18       -49     
  Lines        6086     1574     -4512     
===========================================
- Hits         4917      882     -4035     
+ Misses       1169      692      -477

Flag	Coverage Δ
unittests	`56.03% <81.75%> (-24.76%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
echopype/utils/coding.py	`20.23% <0.00%> (-74.21%)`	⬇️
echopype/utils/io.py	`60.58% <0.00%> (-29.89%)`	⬇️
echopype/echodata/combine.py	`78.37% <95.08%> (+3.94%)`	⬆️
echopype/echodata/echodata.py	`77.35% <100.00%> (-1.65%)`	⬇️

... and 54 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

lsetiawan · 2023-05-18T18:54:45Z

TODO

Benchmarking is needed before merging this PR into dev!
Address Move filter coefficients and decimation factor to variables #1044 (comment)

echopype/echodata/combine.py

aniketfadia96

LGTM

Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>

lsetiawan · 2023-05-22T17:21:08Z

@leewujung This is currently failing because the Vendor specific identical check fails for AZFP. Should AZFP have same filter params for all files to be valid for merging?

emiliom · 2023-05-22T17:52:08Z

@leewujung This is currently failing because the Vendor specific identical check fails for AZFP. Should AZFP have same filter params for all files to be valid for merging?

I'll chime in, in case it helps. I believe the content of the Vendor group will be quite different for AZFP.

leewujung

Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes and set_zarr_encodings , and something to discuss in _capture_prov_attrs regarding whether ED_GROUP should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!

leewujung

Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes and set_zarr_encodings , and something to discuss in _capture_prov_attrs regarding whether ED_GROUP should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

lsetiawan · 2023-06-02T20:29:53Z

Thank you both @emiliom and @leewujung for your helpful comments. I've integrated the changes needed from the comments. Please give it another look and let me know what you think! Thanks again 😄

leewujung

Hey @lsetiawan : Thanks for putting in the new tests and the revised chunking schemes. I like the flexibility it provides, and we can see about allowing users rechunking options later since that would involve an additional input argument -- though I think we can probably leave the rechunking to the Sv stage since it probably makes sense to have that along with the regridding to bring everything onto the same timebase.

I just have a question about the pre-optimized chunk size of the combined ds -- how is that related to the chunk size of the individual ds (those that get combined), but I'll ask that on Slack.

leewujung · 2023-06-05T15:08:37Z

@lsetiawan put together this great gist
https://nbviewer.org/urls/gist.githubusercontent.com/lsetiawan/fcfe141579d05dacc1a8c0f5a82cbcec/raw/412ee4e4776af447ff3b78499621883c6747b810/ChunkingGist.ipynb
that illustrates what happened with the chunk size when the individual datasets were of different sizes and the importance of ensuring even chunk size for zarr.

My question was what cell 13 and the text explanation showed -- what does xarray use for the combined chunk when the datasets to be combined are of different length (before explicit rechunking) -- does it use the biggest one, the smallest one, or somewhere in between. This example shows the selection of the smallest is one, but I wonder what the rules are, or is that always the smallest.

We can investigate this more in the coming weeks as we handle more data, and more fun for testing the efficiency too! 😀

To make things easier to find in this large PR (since GitHub hides this one by default), below links to the benchmarking comment:
#1042 (comment)

Thanks @lsetiawan for this awesome PR!! I will merge this now.

* Initialize combine_echodata overhaul * Remove print statement on test * Update multi combine test and fix empty prov dims * Modify group_paths to return tuple instead of set * Move globar var * Modify combine for provenance of attributes * Modify echodata test to be a tuple * Add more comments and docstrings * Remove unneeded functions * Remove extra filename setting in dataframe * Update echopype/echodata/combine.py Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com> * Add vendor specific group checking * Update check_filter_params for group having ds_append_dims * Fix how encodings are set for data chunks * Apply suggestions from code review Co-authored-by: Wu-Jung Lee <leewujung@gmail.com> * Apply suggestions from code review Co-authored-by: Wu-Jung Lee <leewujung@gmail.com> * Rename echodatas * Remove reference to zarr_path * Fix ref to check_echodata_inputs * Add comments to _merge_attributes * Remove EK80 if statement * Move echopype group to attribute within provenance * Rename _check_filter_params and fix bugs appending multi combined * Remove checking preferred_chunks * Add chunk optimization during encoding determination * Add another line of docstring to test * Update attributes merging to store first value --------- Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com> Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

Initialize combine_echodata overhaul

90adeb9

lsetiawan self-assigned this May 15, 2023

lsetiawan changed the title ~~Initialize combine_echodata overhaul~~ Overhaul combine_echodata method May 15, 2023

lsetiawan added 2 commits May 15, 2023 16:55

Remove print statement on test

be6a7ca

Update multi combine test and fix empty prov dims

fd9c1ec

lsetiawan added 5 commits May 17, 2023 16:54

Modify group_paths to return tuple instead of set

9111feb

Move globar var

0d42b41

Modify combine for provenance of attributes

63f9912

Modify echodata test to be a tuple

5471077

Add more comments and docstrings

df6bcc7

lsetiawan marked this pull request as ready for review May 18, 2023 18:46

lsetiawan requested review from aniketfadia96 and leewujung May 18, 2023 18:46

aniketfadia96 reviewed May 18, 2023

View reviewed changes

echopype/echodata/combine.py Show resolved Hide resolved

aniketfadia96 reviewed May 18, 2023

View reviewed changes

echopype/echodata/combine.py Outdated Show resolved Hide resolved

lsetiawan commented May 18, 2023

View reviewed changes

echopype/echodata/combine.py Outdated Show resolved Hide resolved

lsetiawan added 2 commits May 18, 2023 13:43

Remove unneeded functions

a608630

Remove extra filename setting in dataframe

ad3cdac

aniketfadia96 reviewed May 18, 2023

View reviewed changes

echopype/echodata/combine.py Outdated Show resolved Hide resolved

aniketfadia96 reviewed May 18, 2023

View reviewed changes

echopype/echodata/combine.py Outdated Show resolved Hide resolved

aniketfadia96 approved these changes May 18, 2023

View reviewed changes

Update echopype/echodata/combine.py

5a3c2a7

Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>

lsetiawan mentioned this pull request May 19, 2023

Skip attribute comparison for Top-level #1035

Closed

lsetiawan added 3 commits May 19, 2023 15:06

Merge branch 'dev' into overhaul

944fc2c

Add vendor specific group checking

5b6bb73

Merge branch 'dev' into overhaul

2273c1e

leewujung reviewed May 30, 2023

View reviewed changes

lsetiawan and others added 13 commits June 1, 2023 10:01

Apply suggestions from code review

51529e8

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

Apply suggestions from code review

51840d9

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

Rename echodatas

0d53545

Remove reference to zarr_path

9d67034

Fix ref to check_echodata_inputs

2609ef2

Add comments to _merge_attributes

7b7a2b2

Remove EK80 if statement

6c2167e

Move echopype group to attribute within provenance

957a139

Rename _check_filter_params and fix bugs appending multi combined

c41bbc6

Remove checking preferred_chunks

164055a

Add chunk optimization during encoding determination

ba1131f

Add another line of docstring to test

b54ea47

Update attributes merging to store first value

caa30cd

leewujung approved these changes Jun 2, 2023

View reviewed changes

leewujung merged commit 7a49cc7 into OSOceanAcoustics:dev Jun 5, 2023

lsetiawan deleted the overhaul branch June 5, 2023 16:53

lsetiawan mentioned this pull request Jun 22, 2023

Dask vs Prefect-Dask OSOceanAcoustics/echodataflow#17

Closed

leewujung mentioned this pull request Jul 31, 2023

Small fixes for BB and splitbeam angle handling #1105

Merged

3 tasks

emiliom mentioned this pull request Aug 15, 2023

Remove zarr_combine.py and duplicate/old ping_time machinery #1122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul `combine_echodata` method #1042

Overhaul `combine_echodata` method #1042

lsetiawan commented May 15, 2023 •

edited

Loading

codecov-commenter commented May 16, 2023 •

edited

Loading

lsetiawan commented May 18, 2023 •

edited

Loading

aniketfadia96 left a comment

lsetiawan commented May 22, 2023

emiliom commented May 22, 2023

leewujung left a comment

leewujung left a comment

lsetiawan commented Jun 2, 2023

leewujung left a comment

leewujung commented Jun 5, 2023 •

edited

Loading

Overhaul combine_echodata method #1042

Overhaul combine_echodata method #1042

Conversation

lsetiawan commented May 15, 2023 • edited Loading

Overview

codecov-commenter commented May 16, 2023 • edited Loading

Codecov Report

lsetiawan commented May 18, 2023 • edited Loading

TODO

aniketfadia96 left a comment

Choose a reason for hiding this comment

lsetiawan commented May 22, 2023

emiliom commented May 22, 2023

leewujung left a comment

Choose a reason for hiding this comment

leewujung left a comment

Choose a reason for hiding this comment

lsetiawan commented Jun 2, 2023

leewujung left a comment

Choose a reason for hiding this comment

leewujung commented Jun 5, 2023 • edited Loading

Overhaul `combine_echodata` method #1042

Overhaul `combine_echodata` method #1042

lsetiawan commented May 15, 2023 •

edited

Loading

codecov-commenter commented May 16, 2023 •

edited

Loading

lsetiawan commented May 18, 2023 •

edited

Loading

leewujung commented Jun 5, 2023 •

edited

Loading