Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

investigation group idea: how to validate new features #24

Open
ChristineStawitz-NOAA opened this issue Dec 2, 2021 · 10 comments
Open
Assignees
Labels
question Further information is requested

Comments

@ChristineStawitz-NOAA
Copy link
Contributor

This came up in discussions for the software design spec #20 - we want to avoid bloat so we should only add features if they are a documented best practice. We need some principles for what constitutes a best practice. Some ideas:

  1. A research study (ideally >1) has proven this is the most accurate approach.
  2. This feature is implemented in a tactical stock assessment modeling platform where >1 real world assessments use it
  3. This feature follows sound statistical principles.
@ChristineStawitz-NOAA ChristineStawitz-NOAA added the question Further information is requested label Dec 2, 2021
@ChristineStawitz-NOAA ChristineStawitz-NOAA self-assigned this Dec 2, 2021
@Rick-Methot-NOAA
Copy link
Collaborator

It is good to see attention to this issue. I think that the criteria as described so far may not be strong enough to prevent bloat. (3) is quite weak. (1) is best, especially if it is explicitly designed to compare to other plausible methods.

Another consideration is the formal NOAA R2O transition track per:
https://www.noaa.gov/organization/administration/nao-216-105b-policy-on-research-and-development-transitions

@Andrea-Havron-NOAA
Copy link
Collaborator

I think we can define specific statistical criteria that need to be met to make (3) more robust (eg. new feature needs to show all parameters are estimable and identifiable with minimal bias)

@timjmiller
Copy link

timjmiller commented Jan 26, 2022 via email

@Cole-Monnahan-NOAA
Copy link

@timjmiller but then wouldn't that mean the feature is validated, if it only breaks unrelated features break?

As a concrete example, I tried doing self-testing on the double normal selex in SS using Simulation Based Calibration here and it failed really badly. This means that the parameterization is fundamentally incompatible with integration, for at least some configurations. I suspect this would be true for MLEs too, that there'd be large biases in parameters. So, presuming it fails the validation criteria, would we not include it in FIMS? I see a big gray area there. For me, the validation steps are to test for bugs in the code. This is a slightly different issue.

@timjmiller
Copy link

timjmiller commented Jan 27, 2022 via email

@Cole-Monnahan-NOAA
Copy link

@timjmiller yeah that's what I'm thinking too, that proves it is coded right. Presumably we can setup some data-saturated situations where we've hit asymptotic land and everything should work there too.

@Andrea-Havron-NOAA
Copy link
Collaborator

@timjmiller and @Cole-Monnahan-NOAA, I agree. As the complexity of a model grows, it becomes challenging to demonstrate estimability for parameters under all possible combinations of a model, and I don’t think this bar should be a requirement for a feature to be included in FIMS. At the bare minimum, however, a new feature should at least demonstrate feature specific parameter estimability under a suite of examples in which the feature will most likely be used. Ideally, it would be helpful to document cases when estimability fails.

@ChristineStawitz-NOAA
Copy link
Contributor Author

We might be able to leverage the ROpenSci standards for statistical software - in particular, the time series, spatial, and Bayesian sections.

@Cole-Monnahan-NOAA
Copy link

Those are new to me but definitely worth investigating and considering.

@Rick-Methot-NOAA
Copy link
Collaborator

I like the idea of having, and archiving, feature specific tests. Making that test demonstrate performance is harder.

  • In the SS3 testing, out automated testing only tests whether or not existing model configurations get broken by the change.
  • We rely on the developer to show that the feature computes what it is intended to compute (and document that result in a github comment), but have not yet started archiving those tests; we should.
  • When we add a feature at request of some user, we rely on the user to do that performance test as they use it in a publication or assessment. We have not found a way to routinely get that documentation into a SS3 specific library of use cases.
  • Building up our library of standard tests to cover all features and possible interactions of features has not occurred; I think it would take an experienced user many person-months to develop that much expanded library. It might need 100s, rather than the current 10s, of configurations. We have not figured out a smarter plan B yet.

Here's a war story regarding how subtle a difference can be:
hake asmt team noticed that new SS3 version gave different correlations in MCMC chain (ask Kelli for details);
but new code operating on converged parameters from old code gave identical derived quantities;
traced it down to the new version making 3 more iterations (out of ~450) before stopping;
differences in parameters and derived quantiies were at e-05 level and passed all of our filters for testing new versions;
traced it down to a commit in which there was some re-ordering of some operations. Assumption is that this affected the gradients in a subtle way, which then slightly changed the path to convergence.
Our conclusion was to move on with the new code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants