Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize compute_MVBS and improve its efficiency for Dask input #878

Merged
merged 29 commits into from
Nov 23, 2022

Conversation

b-reyes
Copy link
Contributor

@b-reyes b-reyes commented Nov 4, 2022

This PR addresses #662 and #845 by directly calculating MVBS in an efficient manner. This new approach allows us to apply compute_MVBS on echo_range data sizes that vary with ping_time and Sv values that are Dask arrays. In addition to the new core routines that compute MVBS, unit tests for the calculation of MVBS were also completed. These tests construct mock Sv values with known MVBS values.

Please note that this PR is currently a draft. The following items need to be completed:

  • Finish documenting the function get_MVBS_along_channels in preprocess/api.py
  • Replace the previous ds_MVBS calculation in compute_MVBS with the new and improved get_MVBS_along_channels function
  • Remove the restriction that echo_range must be the same for each ping_time
  • Make sure that the new compute_MVBS function works well with the 19 files in the OOI eclipse notebook
  • Possibly replace test_compute_MVBS with a more general test using the ideas in test_bin_and_mean_2d

@codecov-commenter
Copy link

codecov-commenter commented Nov 5, 2022

Codecov Report

Merging #878 (32093d1) into dev (757d6cd) will increase coverage by 12.51%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##              dev     #878       +/-   ##
===========================================
+ Coverage   82.21%   94.73%   +12.51%     
===========================================
  Files          53        1       -52     
  Lines        5022       38     -4984     
===========================================
- Hits         4129       36     -4093     
+ Misses        893        2      -891     
Flag Coverage Δ
unittests 94.73% <ø> (+12.51%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
echopype/preprocess/api.py
echopype/convert/parsed_to_zarr_ek60.py
echopype/convert/set_groups_ek80.py
echopype/convert/parse_ad2cp.py
echopype/calibrate/ecs_parser.py
echopype/echodata/widgets/widgets.py
echopype/core.py
echopype/utils/prov.py
echopype/convert/utils/ek_raw_parsers.py
echopype/utils/coding.py
... and 42 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@b-reyes b-reyes marked this pull request as ready for review November 9, 2022 17:30
@b-reyes
Copy link
Contributor Author

b-reyes commented Nov 9, 2022

@leewujung this PR is ready for a review. As we discussed yesterday, there are a couple of functions where docstrings/comments are needed. I have tried my best to mark these with TODO statements. I have decided not to incorporate the toggle for comprehensive/less comprehensive checks for echo_range into the compute_MVBS function. Instead, I have only included this option in bin_and_mean_2d.

After further discussion with @leewujung, we have decided to move the last bullet point above (Possibly replace test_compute_MVBS with a more general test using the ideas in test_bin_and_mean_2d) to a different PR.

@b-reyes
Copy link
Contributor Author

b-reyes commented Nov 11, 2022

@leewujung I have went through and documented/cleaned up the code I previously mentioned. This code is now in a state that I am happy with, please go ahead and review it.

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-reyes : I am done with my first round of review for now. I recommended some refactoring that should be trivial to do. I have some questions on why for loops instead of delayed functions are used in a couple of places when binned averages are computed. Is there a performance concern here if the bins are too small (i.e., contain too little number of values?). If that's case, I am also wondering about delaying bin_and_mean_2d for each channel instead.

I have not reviewed the tests and also the content of the 2 is_grouping_needed_* functions, but I can do that in the next round. Wanted to return this to you first so that I don't hold up the process. Thanks.

@leewujung
Copy link
Member

Ok @b-reyes and I resolved the main comments above and have the 2 items:

  • @leewujung : to add an issue for later enhancement re using either the temp zarr store approach or the scatter approach for the intermediate step from er_means to ping time bins
  • @b-reyes : to move most of the MVBS functions to a separate script (i.e. out of api.py)

echopype/preprocess/api.py Outdated Show resolved Hide resolved
@b-reyes
Copy link
Contributor Author

b-reyes commented Nov 23, 2022

@leewujung I have responded to or addressed your comments. This PR is ready for another round of review.

Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-reyes : Everything looks good! Once you added/changed the other two thing’s in the comments, please feel free to merge this. Great work!!

…ho_range to group_dig_er_bin_mean_echo_range
@b-reyes
Copy link
Contributor Author

b-reyes commented Nov 23, 2022

@leewujung I have addressed those comments. It looks like everything is in order here and the tests are passing. I will go ahead and merge it in shortly.

@b-reyes b-reyes merged commit 40e6982 into OSOceanAcoustics:dev Nov 23, 2022
@b-reyes b-reyes deleted the new-comp-mvbs branch December 14, 2022 23:46
lsetiawan pushed a commit to lsetiawan/echopype that referenced this pull request Feb 27, 2023
…SOceanAcoustics#878)

* establish core function for binning and computing the mean of a 2D array

* finish structure of bin_and_mean_2d and create simple tests for it

* modify bin_and_mean_2d so that it produces the appropriate values and add get_MVBS_along_channels, which computes MVBS for each channel

* investigate more efficient dask graph

* investigate chunking of MVBS computation

* add new approach that computes mean binned echorange values and then bins and means ping_time

* remove commented out lines and add a function for bin and meaning the times

* start creating a bin and mean 2d method that accounts for different echo_range values for each ping time

* replace previous bin and mean method with general method, construct mock sv values, and test general method against mock values

* document and clean up code associated with creating the mock data set for MVBS testing

* allow Sv and echo_range arrays to be dask arrays in mock data

* add pytest fixture to test_bin_and_mean_2d

* begin documenting all functions needed to compute MVBS along a channel

* finish documenting get_MVBS_along_channels function

* incorporate new bin and mean method into compute_MVBS and go back to old way of assuming echo_range is the same throughout ping_time

* make the MVBS output values of new method be equivalent to the old method by changing the bins produced by the resample method

* add code section that groups different echo_range values, computes them, and then constructs the appropriate er_means array

* add two functions (less and more comprehensive) that determine if grouping echo_range with respect to ping_time is needed

* modify get_unequal_rows to account for dask input and add bool input to bin_and_mean_2d

* add seed to test_bin_and_mean_2d and randomly choose a ping bin to be completely NaN

* remove old functions associated with compute_MVBS and use the new compute_MVBS methods only

* comment and add docstrings to previously undocumented functions required for compute_MVBS

* clean up bin_and_mean_2d code by creating a function for grouping, binning, and calculating the mean of an array with respect to echo_range

* catch warnings associated with taking the nanmean of an array filled with NaNs

* provide more specific docstring for arr in bin_and_mean_echo_range

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>

* refactor and rename code associated with determining if grouping of echo_range is needed

* move all core routines needed for compute_MVBS into the python script mvbs.py

* add note to docstring of bin_and_mean_2d and rename group_bin_mean_echo_range to group_dig_er_bin_mean_echo_range

Co-authored-by: Wu-Jung Lee <leewujung@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants