investigation group idea: how to validate new features #24

ChristineStawitz-NOAA · 2021-12-02T22:35:48Z

This came up in discussions for the software design spec #20 - we want to avoid bloat so we should only add features if they are a documented best practice. We need some principles for what constitutes a best practice. Some ideas:

A research study (ideally >1) has proven this is the most accurate approach.
This feature is implemented in a tactical stock assessment modeling platform where >1 real world assessments use it
This feature follows sound statistical principles.

Rick-Methot-NOAA · 2021-12-03T00:31:17Z

It is good to see attention to this issue. I think that the criteria as described so far may not be strong enough to prevent bloat. (3) is quite weak. (1) is best, especially if it is explicitly designed to compare to other plausible methods.

Another consideration is the formal NOAA R2O transition track per:
https://www.noaa.gov/organization/administration/nao-216-105b-policy-on-research-and-development-transitions

Andrea-Havron-NOAA · 2022-01-26T22:09:31Z

I think we can define specific statistical criteria that need to be met to make (3) more robust (eg. new feature needs to show all parameters are estimable and identifiable with minimal bias)

timjmiller · 2022-01-26T22:15:42Z

Not so sure it is this easy. The estimability will likely depend on configuration of other parts of the model unrelated to the feature.

…

On Jan 26, 2022, at 5:09 PM, Andrea-Havron-NOAA ***@***.***> wrote: I think we can define specific statistical criteria that need to be met to make (3) more robust (eg. new feature needs to show all parameters are estimable and identifiable with minimal bias) — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you are subscribed to this thread.

Cole-Monnahan-NOAA · 2022-01-27T16:55:57Z

@timjmiller but then wouldn't that mean the feature is validated, if it only breaks unrelated features break?

As a concrete example, I tried doing self-testing on the double normal selex in SS using Simulation Based Calibration here and it failed really badly. This means that the parameterization is fundamentally incompatible with integration, for at least some configurations. I suspect this would be true for MLEs too, that there'd be large biases in parameters. So, presuming it fails the validation criteria, would we not include it in FIMS? I see a big gray area there. For me, the validation steps are to test for bugs in the code. This is a slightly different issue.

timjmiller · 2022-01-27T17:16:32Z

Yeah, it might work for some model configurations and not others, but the tests are not going to include the full set of possible configurations. I think if it works in some configurations of interest, then that would be sufficient?

…

On Thu, Jan 27, 2022 at 11:56 AM Cole Monnahan ***@***.***> wrote: @timjmiller <https://github.com/timjmiller> but then wouldn't that mean the feature is validated, if it only breaks unrelated features break? As a concrete example, I tried doing self-testing on the double normal selex in SS using Simulation Based Calibration here <https://arxiv.org/abs/1804.06788> and it failed really badly. This means that the parameterization is fundamentally incompatible with integration, for at least some configurations. I suspect this would be true for MLEs too, that there'd be large biases in parameters. So, presuming it fails the validation criteria, would we not include it in FIMS? I see a big gray area there. For me, the validation steps are to test for bugs in the code. This is a slightly different issue. — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEIGN7DEHJ5IBM4XPDZGT73UYF2KXANCNFSM5JIJOJZQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Timothy J. Miller, PhD (he, him, his) Research Fishery Biologist NOAA, Northeast Fisheries Science Center Woods Hole, MA 508-495-2365

Cole-Monnahan-NOAA · 2022-01-27T17:41:23Z

@timjmiller yeah that's what I'm thinking too, that proves it is coded right. Presumably we can setup some data-saturated situations where we've hit asymptotic land and everything should work there too.

Andrea-Havron-NOAA · 2022-01-27T17:58:48Z

@timjmiller and @Cole-Monnahan-NOAA, I agree. As the complexity of a model grows, it becomes challenging to demonstrate estimability for parameters under all possible combinations of a model, and I don’t think this bar should be a requirement for a feature to be included in FIMS. At the bare minimum, however, a new feature should at least demonstrate feature specific parameter estimability under a suite of examples in which the feature will most likely be used. Ideally, it would be helpful to document cases when estimability fails.

ChristineStawitz-NOAA · 2022-01-27T18:47:52Z

We might be able to leverage the ROpenSci standards for statistical software - in particular, the time series, spatial, and Bayesian sections.

Cole-Monnahan-NOAA · 2022-01-27T20:20:08Z

Those are new to me but definitely worth investigating and considering.

Rick-Methot-NOAA · 2022-01-27T20:45:46Z

I like the idea of having, and archiving, feature specific tests. Making that test demonstrate performance is harder.

In the SS3 testing, out automated testing only tests whether or not existing model configurations get broken by the change.
We rely on the developer to show that the feature computes what it is intended to compute (and document that result in a github comment), but have not yet started archiving those tests; we should.
When we add a feature at request of some user, we rely on the user to do that performance test as they use it in a publication or assessment. We have not found a way to routinely get that documentation into a SS3 specific library of use cases.
Building up our library of standard tests to cover all features and possible interactions of features has not occurred; I think it would take an experienced user many person-months to develop that much expanded library. It might need 100s, rather than the current 10s, of configurations. We have not figured out a smarter plan B yet.

Here's a war story regarding how subtle a difference can be:
hake asmt team noticed that new SS3 version gave different correlations in MCMC chain (ask Kelli for details);
but new code operating on converged parameters from old code gave identical derived quantities;
traced it down to the new version making 3 more iterations (out of ~450) before stopping;
differences in parameters and derived quantiies were at e-05 level and passed all of our filters for testing new versions;
traced it down to a commit in which there was some re-ordering of some operations. Assumption is that this affected the gradients in a subtle way, which then slightly changed the path to convergence.
Our conclusion was to move on with the new code.

ChristineStawitz-NOAA added the question Further information is requested label Dec 2, 2021

ChristineStawitz-NOAA added this to the post-m1 investigations milestone Dec 2, 2021

ChristineStawitz-NOAA self-assigned this Dec 2, 2021

ChristineStawitz-NOAA assigned Andrea-Havron-NOAA Jan 25, 2022

Andrea-Havron-NOAA assigned iantaylor-NOAA and Cole-Monnahan-NOAA and unassigned ChristineStawitz-NOAA and Andrea-Havron-NOAA Jan 26, 2022

Andrea-Havron-NOAA assigned ChristineStawitz-NOAA Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigation group idea: how to validate new features #24

investigation group idea: how to validate new features #24

ChristineStawitz-NOAA commented Dec 2, 2021

Rick-Methot-NOAA commented Dec 3, 2021

Andrea-Havron-NOAA commented Jan 26, 2022

timjmiller commented Jan 26, 2022 via email

Cole-Monnahan-NOAA commented Jan 27, 2022

timjmiller commented Jan 27, 2022 via email

Cole-Monnahan-NOAA commented Jan 27, 2022

Andrea-Havron-NOAA commented Jan 27, 2022

ChristineStawitz-NOAA commented Jan 27, 2022

Cole-Monnahan-NOAA commented Jan 27, 2022

Rick-Methot-NOAA commented Jan 27, 2022

investigation group idea: how to validate new features #24

investigation group idea: how to validate new features #24

Comments

ChristineStawitz-NOAA commented Dec 2, 2021

Rick-Methot-NOAA commented Dec 3, 2021

Andrea-Havron-NOAA commented Jan 26, 2022

timjmiller commented Jan 26, 2022 via email

Cole-Monnahan-NOAA commented Jan 27, 2022

timjmiller commented Jan 27, 2022 via email

Cole-Monnahan-NOAA commented Jan 27, 2022

Andrea-Havron-NOAA commented Jan 27, 2022

ChristineStawitz-NOAA commented Jan 27, 2022

Cole-Monnahan-NOAA commented Jan 27, 2022

Rick-Methot-NOAA commented Jan 27, 2022