Regression testing updates #2459

rwest · 2023-06-07T18:48:33Z

Motivation or Problem

I suspect regression tests are often not looked at because the results are buried in the logs.

Description of Changes

The biggest change is that regression tests actually report a failure if a model changes significantly - either the core or edge changes or an observable in the simulation changes significantly.
Many changes to RMG are deliberate improvements and ought to change things, so sometimes we want these "regression" tests to pass - but it should probably be a deliberate act of an admin approving the pull request despite the regression, not just everyone ignoring the regression tests because nobody checks the logs. If there are too many failures and so we have a bottle-neck of approvals, or admins over-riding tests too much, we can revisit the policy. At least these changes now give us the framework to detect the regressions during the testing.

Other changes include

the results from the matrix build jobs are stored separately
failures of regression tests are reported in the job step summary
only the official main ubuntu run is "blessed" as the stable job. All others are dynamic for comparison (including the 'main' branch on forks)
other small changes to the CI file.

Testing

Tested it in the CI workflow.

Reviewer Tips

I tried to keep each commit self-contained with commit messages, and rebased a few times merging fixups.

.github/workflows/CI.yml

codecov · 2023-06-07T21:27:44Z

Codecov Report

Merging #2459 (e44a4ee) into main (34d5d87) will decrease coverage by 0.03%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##             main    #2459      +/-   ##
==========================================
- Coverage   48.13%   48.10%   -0.03%     
==========================================
  Files         110      110              
  Lines       30629    30627       -2     
  Branches     7989     7989              
==========================================
- Hits        14743    14734       -9     
- Misses      14355    14364       +9     
+ Partials     1531     1529       -2

Impacted Files	Coverage Δ
rmgpy/tools/observablesregression.py	`8.21% <0.00%> (+0.21%)`	⬆️
rmgpy/tools/regression.py	`30.23% <42.85%> (+0.47%)`	⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rwest · 2023-06-08T13:26:39Z

I think this is working as intended.

A problem is that the CI has been not working properly for a while, and the last successful upload it's finding is from May 18:

==> (found) Run ID: 5012006801
==> (found) Run date: 2023-05-18T08:20:16Z

so the regressions are run against that. Some differences are detected (demonstrating that these tests work?)

I'm not 100% sure this PR will fix that uploading issue (it may be that the scheduled runs don't have the expected metadata to count as reference cases) but hard to know until we push this (or some test commits) to main.

Before it would upload the test folder on _any_ failure. Now it's just if the regression tests fail.

Checking recent successful runs this takes 20-45 minutes 120 minute should be a generous upper bound. Any more than that and it has hung for ever and should be killed.

This changes it so that only the ubuntu runner on the official main branch is the reference case. It also stores that result in an environment variable so we only need to evaluate the logic once. For regression testing we need to determine if we're the reference case (env.REFERENCE_JOB=='true') or the comparison case (env.REFERENCE_JOB=='false') Because it becomes a string, 'false' is then read as a non-empty string and evaluates as true. So we have to do string comparison not negation. i.e. instead of ${{ !( env.REFERENCE_JOB ) }} we must do ${{ env.REFERENCE_JOB == 'false' }}

These overwrite each other if you don't specify the name uniquely, and if multiple jobs write at once you can get corrupted files.

This should facilitate the CI testing actually noticing if the tests pass or not.

This will allow us to easily see if there's a difference while using the script on the command line (or in the CI workflow)

…sion tests. Now the regression tests compare the core and edge, not just the observables from the simulations.

Could be helpful to know code coverage even in cases when the regression tests are failing - sometimes (hopfully often!) a "regression" is an improvement, and a "fail" is thus a success.

even though we have git logs

Mostly changing whitespace and blank lines so the logs are easier to read.

JacksonBurns · 2023-06-08T15:42:23Z

I think this is working as intended.

A problem is that the CI has been not working properly for a while, and the last successful upload it's finding is from May 18:
==> (found) Run ID: 5012006801
==> (found) Run date: 2023-05-18T08:20:16Z
so the regressions are run against that. Some differences are detected (demonstrating that these tests work?)

I'm not 100% sure this PR will fix that uploading issue (it may be that the scheduled runs don't have the expected metadata to count as reference cases) but hard to know until we push this (or some test commits) to main.

As it turns out this is a known bug in dawidd6/action-download-artifact@v2 (see this issue), can you change that uses line in the action to uses: dsnopek/action-download-artifact@91dda23aa09c68860977dd0ed11d93c0ed3795e7 (the patch that someone has proposed, but has not yet been merged).

Otherwise, if it fails (eg. the julia step hangs and times out) then that automatically cancels the ubuntu-latest run, which was running along fine.

We're using a patched version of the action-download-artifact action to try to make sure we get the latest results before comparing the regression.

rwest · 2023-06-08T19:23:18Z

I switched to the patched download-artifact action you suggested and turned on ensure_latest but it still found one from 2023-05-18T08:20:16Z. I think that might actually be the most recent, because if you check lot of recent runs on the main branch have been skipping the "Upload Results as Reference" step.

I suggest that, once someone approves this, we merge it to main and see if it uploads a new regression result as reference - including the new RMS tests that were just merged?

JacksonBurns · 2023-06-08T20:11:10Z

I switched to the patched download-artifact action you suggested and turned on ensure_latest but it still found one from 2023-05-18T08:20:16Z. I think that might actually be the most recent, because if you check lot of recent runs on the main branch have been skipping the "Upload Results as Reference" step.

Ah, the upload results step has a syntax error in the if statement. This PR fixes that.

I suggest that, once someone approves this, we merge it to main and see if it uploads a new regression result as reference - including the new RMS tests that were just merged?

Agreed.

JacksonBurns

Thanks for these updates - LGTM.

rwest · 2023-06-09T13:11:45Z

.github/workflows/CI.yml

@@ -181,63 +187,81 @@ jobs:
          name: stable_regression_results
          path: stable_regression_results
          search_artifacts: true  # retrieves the last run result, either scheduled daily or on push to main
+          ensure_latest: true     # ensures that the latest run is retrieved


@JacksonBurns do you know how this is meant to be used? It's reporting Warning: Unexpected input(s) 'ensure_latest', valid inputs are ['github_token', 'workflow', 'workflow_conclusion', 'repo', 'pr', 'commit', 'branch', 'event', 'run_id', 'run_number', 'name', 'name_is_regexp', 'path', 'check_artifacts', 'search_artifacts', 'skip_unpack', 'dry_run', 'if_no_artifact_found'] (but anyway, it's using a newer artifact now)

I'm not sure why it's doing that, but it is presumably something we can blame on having used a random version of a forked action. We could try removing this line and see if this action still works?

rwest · 2023-06-09T13:26:21Z

The overnight scheduled job uploaded a new reference result 🥳 , and the CI test on matt's PR fetched them for comparison 🥳 .

Unfortunately there seems to be a bug causing a crash when comparing some of the results
(see #2446 (comment) )
and quite a few of the regression tests have different core and edges -- is this to be expected by random fluctuations?

JacksonBurns · 2023-06-09T15:20:39Z

The overnight scheduled job uploaded a new reference result partying_face , and the CI test on matt's PR fetched them for comparison partying_face .

🎉

Unfortunately there seems to be a bug causing a crash when comparing some of the results (see #2446 (comment) ) and quite a few of the regression tests have different core and edges -- is this to be expected by random fluctuations?

I can't comment on the science behind this, but I will say that a 'failure' in the old testing scheme was never established either. The output from the regression tests was just left to the user to read. The change you made in 1a424ef makes sense, but I think its up to us now to decide what constitutes failure.

hwpang reviewed Jun 7, 2023

View reviewed changes

.github/workflows/CI.yml Outdated Show resolved Hide resolved

rwest force-pushed the regression branch 3 times, most recently from 17cb2ed to 887e287 Compare June 8, 2023 02:13

rwest marked this pull request as ready for review June 8, 2023 02:18

rwest requested a review from hwpang June 8, 2023 02:22

rwest enabled auto-merge June 8, 2023 02:24

rwest disabled auto-merge June 8, 2023 13:26

rwest added 13 commits June 8, 2023 09:46

CI: Streamline logging in regression test results

c2c1954

CI: small changes to CI file and call it "Continuous Integration"

638c947

CI: only upload failed test artifacts if tests failed

9643c6f

Before it would upload the test folder on _any_ failure. Now it's just if the regression tests fail.

CI: report regression tests to the job summary

bd9e529

CI: 120 minute timeout on building Julia step.

9a1c717

Checking recent successful runs this takes 20-45 minutes 120 minute should be a generous upper bound. Any more than that and it has hung for ever and should be killed.

CI: store artifacts with specific names to avoid over-writing.

edc9912

These overwrite each other if you don't specify the name uniquely, and if multiple jobs write at once you can get corrupted files.

CI: regression test returns exit code 1 if it doesn't pass all tests.

0b9d45c

This should facilitate the CI testing actually noticing if the tests pass or not.

CI: checkModels now has exit code 1 if it detects a discrepancy.

1a424ef

This will allow us to easily see if there's a difference while using the script on the command line (or in the CI workflow)

CI: report failure if detect difference in core or edge during regres…

1169f9d

…sion tests. Now the regression tests compare the core and edge, not just the observables from the simulations.

CI: run codecov even if regression tests fail.

9e8f5c4

Could be helpful to know code coverage even in cases when the regression tests are failing - sometimes (hopfully often!) a "regression" is an improvement, and a "fail" is thus a success.

CI: update changelog

a854f81

even though we have git logs

CI: change logging for regression tests.

ae2b67f

Mostly changing whitespace and blank lines so the logs are easier to read.

rwest force-pushed the regression branch from 887e287 to ae2b67f Compare June 8, 2023 13:48

JacksonBurns mentioned this pull request Jun 8, 2023

Attempt to ensure that we download the latest artifact dawidd6/action-download-artifact#241

Open

rwest added 2 commits June 8, 2023 12:25

CI: allow macos-latest run to fail

25cc365

Otherwise, if it fails (eg. the julia step hangs and times out) then that automatically cancels the ubuntu-latest run, which was running along fine.

CI: compare regression with latest successful artifact

e44a4ee

We're using a patched version of the action-download-artifact action to try to make sure we get the latest results before comparing the regression.

rwest requested a review from JacksonBurns June 8, 2023 20:23

JacksonBurns approved these changes Jun 8, 2023

View reviewed changes

rwest merged commit 7234fb6 into ReactionMechanismGenerator:main Jun 8, 2023
2 of 7 checks passed

rwest commented Jun 9, 2023

View reviewed changes

rwest deleted the regression branch June 9, 2023 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression testing updates #2459

Regression testing updates #2459

rwest commented Jun 7, 2023 •

edited

codecov bot commented Jun 7, 2023 •

edited

rwest commented Jun 8, 2023

JacksonBurns commented Jun 8, 2023

rwest commented Jun 8, 2023

JacksonBurns commented Jun 8, 2023

JacksonBurns left a comment

rwest Jun 9, 2023

JacksonBurns Jun 9, 2023

rwest commented Jun 9, 2023

JacksonBurns commented Jun 9, 2023

Regression testing updates #2459

Regression testing updates #2459

Conversation

rwest commented Jun 7, 2023 • edited

Motivation or Problem

Description of Changes

Testing

Reviewer Tips

codecov bot commented Jun 7, 2023 • edited

Codecov Report

rwest commented Jun 8, 2023

JacksonBurns commented Jun 8, 2023

rwest commented Jun 8, 2023

JacksonBurns commented Jun 8, 2023

JacksonBurns left a comment

Choose a reason for hiding this comment

rwest Jun 9, 2023

Choose a reason for hiding this comment

JacksonBurns Jun 9, 2023

Choose a reason for hiding this comment

rwest commented Jun 9, 2023

JacksonBurns commented Jun 9, 2023

rwest commented Jun 7, 2023 •

edited

codecov bot commented Jun 7, 2023 •

edited