Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression testing updates #2459

Merged
merged 15 commits into from Jun 8, 2023
Merged

Conversation

rwest
Copy link
Member

@rwest rwest commented Jun 7, 2023

Motivation or Problem

I suspect regression tests are often not looked at because the results are buried in the logs.

Description of Changes

The biggest change is that regression tests actually report a failure if a model changes significantly - either the core or edge changes or an observable in the simulation changes significantly.
Many changes to RMG are deliberate improvements and ought to change things, so sometimes we want these "regression" tests to pass - but it should probably be a deliberate act of an admin approving the pull request despite the regression, not just everyone ignoring the regression tests because nobody checks the logs. If there are too many failures and so we have a bottle-neck of approvals, or admins over-riding tests too much, we can revisit the policy. At least these changes now give us the framework to detect the regressions during the testing.

Other changes include

  • the results from the matrix build jobs are stored separately
  • failures of regression tests are reported in the job step summary
  • only the official main ubuntu run is "blessed" as the stable job. All others are dynamic for comparison (including the 'main' branch on forks)
  • other small changes to the CI file.

Testing

Tested it in the CI workflow.

Reviewer Tips

I tried to keep each commit self-contained with commit messages, and rebased a few times merging fixups.

.github/workflows/CI.yml Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jun 7, 2023

Codecov Report

Merging #2459 (e44a4ee) into main (34d5d87) will decrease coverage by 0.03%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##             main    #2459      +/-   ##
==========================================
- Coverage   48.13%   48.10%   -0.03%     
==========================================
  Files         110      110              
  Lines       30629    30627       -2     
  Branches     7989     7989              
==========================================
- Hits        14743    14734       -9     
- Misses      14355    14364       +9     
+ Partials     1531     1529       -2     
Impacted Files Coverage Δ
rmgpy/tools/observablesregression.py 8.21% <0.00%> (+0.21%) ⬆️
rmgpy/tools/regression.py 30.23% <42.85%> (+0.47%) ⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rwest rwest force-pushed the regression branch 3 times, most recently from 17cb2ed to 887e287 Compare June 8, 2023 02:13
@rwest rwest marked this pull request as ready for review June 8, 2023 02:18
@rwest rwest requested a review from hwpang June 8, 2023 02:22
@rwest rwest enabled auto-merge June 8, 2023 02:24
@rwest
Copy link
Member Author

rwest commented Jun 8, 2023

I think this is working as intended.

A problem is that the CI has been not working properly for a while, and the last successful upload it's finding is from May 18:

==> (found) Run ID: 5012006801
==> (found) Run date: 2023-05-18T08:20:16Z

so the regressions are run against that. Some differences are detected (demonstrating that these tests work?)

I'm not 100% sure this PR will fix that uploading issue (it may be that the scheduled runs don't have the expected metadata to count as reference cases) but hard to know until we push this (or some test commits) to main.

@rwest rwest disabled auto-merge June 8, 2023 13:26
rwest added 13 commits June 8, 2023 09:46
Before it would upload the test folder on _any_ failure.
Now it's just if the regression tests fail.
Checking recent successful runs this takes 20-45 minutes
120 minute should be a generous upper bound. Any more than that
and it has hung for ever and should be killed.
This changes it so that only the ubuntu runner on the official main branch
is the reference case. It also stores that result in an environment variable
so we only need to evaluate the logic once.

For regression testing we need to determine if we're the reference case
(env.REFERENCE_JOB=='true') or the comparison case (env.REFERENCE_JOB=='false')

Because it becomes a string, 'false' is then read as a
non-empty string and evaluates as true.
So we have to do string comparison not negation.
i.e. instead of ${{ !( env.REFERENCE_JOB ) }}
     we must do ${{ env.REFERENCE_JOB == 'false' }}
These overwrite each other if you don't specify the name uniquely,
and if multiple jobs write at once you can get corrupted files.
This should facilitate the CI testing actually noticing if the
tests pass or not.
This will allow us to easily see if there's a difference while
using the script on the command line (or in the CI workflow)
…sion tests.

Now the regression tests compare the core and edge, not just the observables
from the simulations.
Could be helpful to know code coverage even in cases when the 
regression tests are failing - sometimes (hopfully often!) a 
"regression" is an improvement, and a "fail" is thus a success.
even though we have git logs
Mostly changing whitespace and blank lines so the logs are
easier to read.
@JacksonBurns
Copy link
Contributor

I think this is working as intended.

A problem is that the CI has been not working properly for a while, and the last successful upload it's finding is from May 18:

==> (found) Run ID: 5012006801
==> (found) Run date: 2023-05-18T08:20:16Z

so the regressions are run against that. Some differences are detected (demonstrating that these tests work?)

I'm not 100% sure this PR will fix that uploading issue (it may be that the scheduled runs don't have the expected metadata to count as reference cases) but hard to know until we push this (or some test commits) to main.

As it turns out this is a known bug in dawidd6/action-download-artifact@v2 (see this issue), can you change that uses line in the action to uses: dsnopek/action-download-artifact@91dda23aa09c68860977dd0ed11d93c0ed3795e7 (the patch that someone has proposed, but has not yet been merged).

rwest added 2 commits June 8, 2023 12:25
Otherwise, if it fails (eg. the julia step hangs and times out) then that automatically
cancels the ubuntu-latest run, which was
running along fine.
We're using a patched version of the action-download-artifact action
to try to make sure we get the latest results
before comparing the regression.
@rwest
Copy link
Member Author

rwest commented Jun 8, 2023

I switched to the patched download-artifact action you suggested and turned on ensure_latest but it still found one from 2023-05-18T08:20:16Z. I think that might actually be the most recent, because if you check lot of recent runs on the main branch have been skipping the "Upload Results as Reference" step.

I suggest that, once someone approves this, we merge it to main and see if it uploads a new regression result as reference - including the new RMS tests that were just merged?

@JacksonBurns
Copy link
Contributor

I switched to the patched download-artifact action you suggested and turned on ensure_latest but it still found one from 2023-05-18T08:20:16Z. I think that might actually be the most recent, because if you check lot of recent runs on the main branch have been skipping the "Upload Results as Reference" step.

Ah, the upload results step has a syntax error in the if statement. This PR fixes that.

I suggest that, once someone approves this, we merge it to main and see if it uploads a new regression result as reference - including the new RMS tests that were just merged?

Agreed.

@rwest rwest requested a review from JacksonBurns June 8, 2023 20:23
Copy link
Contributor

@JacksonBurns JacksonBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these updates - LGTM.

@rwest rwest merged commit 7234fb6 into ReactionMechanismGenerator:main Jun 8, 2023
2 of 7 checks passed
@@ -181,63 +187,81 @@ jobs:
name: stable_regression_results
path: stable_regression_results
search_artifacts: true # retrieves the last run result, either scheduled daily or on push to main
ensure_latest: true # ensures that the latest run is retrieved
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JacksonBurns do you know how this is meant to be used? It's reporting Warning: Unexpected input(s) 'ensure_latest', valid inputs are ['github_token', 'workflow', 'workflow_conclusion', 'repo', 'pr', 'commit', 'branch', 'event', 'run_id', 'run_number', 'name', 'name_is_regexp', 'path', 'check_artifacts', 'search_artifacts', 'skip_unpack', 'dry_run', 'if_no_artifact_found'] (but anyway, it's using a newer artifact now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it's doing that, but it is presumably something we can blame on having used a random version of a forked action. We could try removing this line and see if this action still works?

@rwest rwest deleted the regression branch June 9, 2023 13:13
@rwest
Copy link
Member Author

rwest commented Jun 9, 2023

The overnight scheduled job uploaded a new reference result 🥳 , and the CI test on matt's PR fetched them for comparison 🥳 .

Unfortunately there seems to be a bug causing a crash when comparing some of the results
(see #2446 (comment) )
and quite a few of the regression tests have different core and edges -- is this to be expected by random fluctuations?

@JacksonBurns
Copy link
Contributor

The overnight scheduled job uploaded a new reference result partying_face , and the CI test on matt's PR fetched them for comparison partying_face .

🎉

Unfortunately there seems to be a bug causing a crash when comparing some of the results (see #2446 (comment) ) and quite a few of the regression tests have different core and edges -- is this to be expected by random fluctuations?

I can't comment on the science behind this, but I will say that a 'failure' in the old testing scheme was never established either. The output from the regression tests was just left to the user to read. The change you made in 1a424ef makes sense, but I think its up to us now to decide what constitutes failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants