Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retrying flaky tests in nightly CI #4351

Merged
merged 1 commit into from
May 24, 2024

Conversation

neilmayhew
Copy link
Contributor

Nightly tests continue to fail randomly due to an excessive number of discards. Fixing the discards problem will be difficult, but we can work around it for now by retrying test suites that have failed for this specific reason. However, we do this only for nightly CI and not for other workflow runs.

@neilmayhew neilmayhew requested a review from lehins May 22, 2024 20:39
Copy link
Collaborator

@lehins lehins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach might result in longer nightly runs without really fixing the problem (no guarantee that 2 or 3 or even 5 retries is enough for nightly tests to succeed)

That being said, we can give it a try, since it only affects the nightly runs there is no danger in this.

.github/workflows/haskell.yml Outdated Show resolved Hide resolved
@neilmayhew
Copy link
Contributor Author

This approach might result in longer nightly runs without really fixing the problem (no guarantee that 2 or 3 or even 5 retries is enough for nightly tests to succeed)

From what I've observed, the individual test suites fail only occasionally, but when you're running 108 of them the chance of at least one failing is quite high. Rerunning just the failed ones seems to have a good chance of succeeding. (I tried that manually on last night's CI before starting this.)

Most of the the individual test suites are fairly quick to execute, so if only a couple are failing it doesn't add much additional time to run them twice. In fact, since there's a lot of suites running in parallel, running a shorter one twice might still finish before the longer ones.

@neilmayhew neilmayhew changed the title Enable retrying flaky tests in nightly CI Enable retrying flaky tests in night-ly CI May 22, 2024
@neilmayhew neilmayhew changed the title Enable retrying flaky tests in night-ly CI Enable retrying flaky tests in nightly CI May 22, 2024
@neilmayhew neilmayhew force-pushed the neilmayhew/retry-flaky-tests branch from dbd7762 to 0e255cc Compare May 23, 2024 02:16
@neilmayhew neilmayhew enabled auto-merge May 23, 2024 02:19
@MaximilianAlgehed
Copy link
Collaborator

MaximilianAlgehed commented May 23, 2024

FYI I took the time to look into what it would take to update the QuickCheck dependency to 2.15 to fix the flaky test in shelley and these are the packages for which you'd need allow-newer:

  ral:QuickCheck,
  fin:QuickCheck,
  bin:QuickCheck,
  tree-diff:QuickCheck,
  base64-bytestring-type:QuickCheck

The thing about this is that the only breaking change in the latest version of QuickCheck is that properties that don't do any random generation are no longer implicitly tested only once. This shouldn't break any of these packages in any significant way (as we aren't using any properties from those packages anyway) so it's up to @lehins if you think it's an acceptable trade-off to introduce these to cabal.project to (hopefully) fix the flaky test issue. However, I understand if allow-newer makes people queezy!

The QuickCheck bump is on branch https://github.com/IntersectMBO/cardano-ledger/tree/bump-quickcheck but when doing this I discovered a very stupid bug that I introduced in QuickCheck so we're going to have to wait for 2.15.1 anyway. That gives us some more time to wait for the maintainers of the above packages to catch up.

@lehins
Copy link
Collaborator

lehins commented May 23, 2024

it's up to @lehins if you think it's an acceptable trade-off to introduce these to cabal.project to (hopefully) fix the flaky test issue. However, I understand if allow-newer makes people queezy!

allow-newer would fix it for us, but then it could confuse ledger users into believing that they also need the newer version of QuickCheck.
I don't think there is any rush on this, eventually we will all get onto the new version of QuickCheck and this flakiness issue will go away. In other words IMHO there is no point in putting effort and causing disturbance for people if the issue will resolve itself with time, especially since this interim solution should unblock the nightly CI and recent improvements to CI make dealing with flaky tests a lot less painfull.

@lehins lehins force-pushed the neilmayhew/retry-flaky-tests branch from 0e255cc to 4b5d515 Compare May 23, 2024 18:25
@neilmayhew neilmayhew force-pushed the neilmayhew/retry-flaky-tests branch from 4b5d515 to a91b4e6 Compare May 23, 2024 22:10
@neilmayhew neilmayhew merged commit 509a6e3 into master May 24, 2024
125 checks passed
@neilmayhew neilmayhew deleted the neilmayhew/retry-flaky-tests branch May 24, 2024 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants