-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul combine_echodata
method
#1042
Conversation
combine_echodata
method
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. @@ Coverage Diff @@
## dev #1042 +/- ##
===========================================
- Coverage 80.79% 56.03% -24.76%
===========================================
Files 67 18 -49
Lines 6086 1574 -4512
===========================================
- Hits 4917 882 -4035
+ Misses 1169 692 -477
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 54 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
TODO
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com>
@leewujung This is currently failing because the |
I'll chime in, in case it helps. I believe the content of the Vendor group will be quite different for AZFP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes
and set_zarr_encodings
, and something to discuss in _capture_prov_attrs
regarding whether ED_GROUP
should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lsetiawan : Thanks for this great PR! I've gone through the code and also went through a few test cases. I think there might be a potential bug in _merge_attributes
and set_zarr_encodings
, and something to discuss in _capture_prov_attrs
regarding whether ED_GROUP
should be a dimension there. All my other comments are minor. It'll be awesome once we get this merged!!
Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>
Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>
Thank you both @emiliom and @leewujung for your helpful comments. I've integrated the changes needed from the comments. Please give it another look and let me know what you think! Thanks again 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @lsetiawan : Thanks for putting in the new tests and the revised chunking schemes. I like the flexibility it provides, and we can see about allowing users rechunking options later since that would involve an additional input argument -- though I think we can probably leave the rechunking to the Sv stage since it probably makes sense to have that along with the regridding to bring everything onto the same timebase.
I just have a question about the pre-optimized chunk size of the combined ds -- how is that related to the chunk size of the individual ds (those that get combined), but I'll ask that on Slack.
@lsetiawan put together this great gist My question was what cell 13 and the text explanation showed -- what does xarray use for the combined chunk when the datasets to be combined are of different length (before explicit rechunking) -- does it use the biggest one, the smallest one, or somewhere in between. This example shows the selection of the smallest is one, but I wonder what the rules are, or is that always the smallest. We can investigate this more in the coming weeks as we handle more data, and more fun for testing the efficiency too! 😀 To make things easier to find in this large PR (since GitHub hides this one by default), below links to the benchmarking comment: Thanks @lsetiawan for this awesome PR!! I will merge this now. |
* Initialize combine_echodata overhaul * Remove print statement on test * Update multi combine test and fix empty prov dims * Modify group_paths to return tuple instead of set * Move globar var * Modify combine for provenance of attributes * Modify echodata test to be a tuple * Add more comments and docstrings * Remove unneeded functions * Remove extra filename setting in dataframe * Update echopype/echodata/combine.py Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com> * Add vendor specific group checking * Update check_filter_params for group having ds_append_dims * Fix how encodings are set for data chunks * Apply suggestions from code review Co-authored-by: Wu-Jung Lee <leewujung@gmail.com> * Apply suggestions from code review Co-authored-by: Wu-Jung Lee <leewujung@gmail.com> * Rename echodatas * Remove reference to zarr_path * Fix ref to check_echodata_inputs * Add comments to _merge_attributes * Remove EK80 if statement * Move echopype group to attribute within provenance * Rename _check_filter_params and fix bugs appending multi combined * Remove checking preferred_chunks * Add chunk optimization during encoding determination * Add another line of docstring to test * Update attributes merging to store first value --------- Co-authored-by: Aniket Fadia <fadiaaniket@gmail.com> Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>
Overview
This PR aims to overhaul the
combine_echodata
functionality to be in favor of using xarray's functions to concat the datasets. This relates to issue #976. The changes that happens in this PR are as follows:combine_echodata
inputxr.concat
to combine the datasets under the hood so it doesn't require any monotonic values, keeping the data losslessgroup_paths
attribute to be atuple
rather thanset
to keep order of paths.NOTE: Currently this works for EK60, but there are still some issues with attributes combining within the Vendor_specific group for EK80 since some of the attribute values are arrays.
In another PR, I will address the issue as stated above.