Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul combine_echodata method #1042

Merged
merged 29 commits into from
Jun 5, 2023
Merged

Conversation

lsetiawan
Copy link
Member

@lsetiawan lsetiawan commented May 15, 2023

Overview

This PR aims to overhaul the combine_echodata functionality to be in favor of using xarray's functions to concat the datasets. This relates to issue #976. The changes that happens in this PR are as follows:

  • Remove the usage of ZarrCombine object
  • Remove spinning up dask client under the hood during combine
  • Remove the need for zarr path as combine_echodata input
  • Utilizes xr.concat to combine the datasets under the hood so it doesn't require any monotonic values, keeping the data lossless
  • Updates group_paths attribute to be a tuple rather than set to keep order of paths.

NOTE: Currently this works for EK60, but there are still some issues with attributes combining within the Vendor_specific group for EK80 since some of the attribute values are arrays.

In another PR, I will address the issue as stated above.

@lsetiawan lsetiawan self-assigned this May 15, 2023
@lsetiawan lsetiawan changed the title Initialize combine_echodata overhaul Overhaul combine_echodata method May 15, 2023
@codecov-commenter
Copy link

codecov-commenter commented May 16, 2023

Codecov Report

Merging #1042 (caa30cd) into dev (a09d8f3) will decrease coverage by 24.76%.
The diff coverage is 81.75%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@             Coverage Diff             @@
##              dev    #1042       +/-   ##
===========================================
- Coverage   80.79%   56.03%   -24.76%     
===========================================
  Files          67       18       -49     
  Lines        6086     1574     -4512     
===========================================
- Hits         4917      882     -4035     
+ Misses       1169      692      -477     
Flag Coverage Δ
unittests 56.03% <81.75%> (-24.76%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
echopype/utils/coding.py 20.23% <0.00%> (-74.21%) ⬇️
echopype/utils/io.py 60.58% <0.00%> (-29.89%) ⬇️
echopype/echodata/combine.py 78.37% <95.08%> (+3.94%) ⬆️
echopype/echodata/echodata.py 77.35% <100.00%> (-1.65%) ⬇️

... and 54 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lsetiawan lsetiawan marked this pull request as ready for review May 18, 2023 18:46
@lsetiawan
Copy link
Member Author

lsetiawan commented May 18, 2023

TODO

Copy link
Contributor

@aniketfadia96 aniketfadia96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>
@lsetiawan
Copy link
Member Author

@leewujung This is currently failing because the Vendor specific identical check fails for AZFP. Should AZFP have same filter params for all files to be valid for merging?

@emiliom
Copy link
Collaborator

emiliom commented May 22, 2023

@leewujung This is currently failing because the Vendor specific identical check fails for AZFP. Should AZFP have same filter params for all files to be valid for merging?

I'll chime in, in case it helps. I believe the content of the Vendor group will be quite different for AZFP.

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes and set_zarr_encodings , and something to discuss in _capture_prov_attrs regarding whether ED_GROUP should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes and set_zarr_encodings , and something to discuss in _capture_prov_attrs regarding whether ED_GROUP should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!

@lsetiawan
Copy link
Member Author

Thank you both @emiliom and @leewujung for your helpful comments. I've integrated the changes needed from the comments. Please give it another look and let me know what you think! Thanks again 😄

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lsetiawan : Thanks for putting in the new tests and the revised chunking schemes. I like the flexibility it provides, and we can see about allowing users rechunking options later since that would involve an additional input argument -- though I think we can probably leave the rechunking to the Sv stage since it probably makes sense to have that along with the regridding to bring everything onto the same timebase.

I just have a question about the pre-optimized chunk size of the combined ds -- how is that related to the chunk size of the individual ds (those that get combined), but I'll ask that on Slack.

@leewujung
Copy link
Member

leewujung commented Jun 5, 2023

@lsetiawan put together this great gist
https://nbviewer.org/urls/gist.githubusercontent.com/lsetiawan/fcfe141579d05dacc1a8c0f5a82cbcec/raw/412ee4e4776af447ff3b78499621883c6747b810/ChunkingGist.ipynb
that illustrates what happened with the chunk size when the individual datasets were of different sizes and the importance of ensuring even chunk size for zarr.

My question was what cell 13 and the text explanation showed -- what does xarray use for the combined chunk when the datasets to be combined are of different length (before explicit rechunking) -- does it use the biggest one, the smallest one, or somewhere in between. This example shows the selection of the smallest is one, but I wonder what the rules are, or is that always the smallest.

We can investigate this more in the coming weeks as we handle more data, and more fun for testing the efficiency too! 😀

To make things easier to find in this large PR (since GitHub hides this one by default), below links to the benchmarking comment:
#1042 (comment)

Thanks @lsetiawan for this awesome PR!! I will merge this now.

@leewujung leewujung merged commit 7a49cc7 into OSOceanAcoustics:dev Jun 5, 2023
@lsetiawan lsetiawan deleted the overhaul branch June 5, 2023 16:53
lsetiawan added a commit to lsetiawan/echopype that referenced this pull request Jul 28, 2023
* Initialize combine_echodata overhaul

* Remove print statement on test

* Update multi combine test and fix empty prov dims

* Modify group_paths to return tuple instead of set

* Move globar var

* Modify combine for provenance of attributes

* Modify echodata test to be a tuple

* Add more comments and docstrings

* Remove unneeded functions

* Remove extra filename setting in dataframe

* Update echopype/echodata/combine.py

Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>

* Add vendor specific group checking

* Update check_filter_params for group having ds_append_dims

* Fix how encodings are set for data chunks

* Apply suggestions from code review

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

* Apply suggestions from code review

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

* Rename echodatas

* Remove reference to zarr_path

* Fix ref to check_echodata_inputs

* Add comments to _merge_attributes

* Remove EK80 if statement

* Move echopype group to attribute within provenance

* Rename _check_filter_params and fix bugs appending multi combined

* Remove checking preferred_chunks

* Add chunk optimization during encoding determination

* Add another line of docstring to test

* Update attributes merging to store first value

---------

Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>
Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants