Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Tests on CI #2494

Closed
apruden2008 opened this issue Jun 14, 2024 · 3 comments · Fixed by #2523
Closed

Flaky Tests on CI #2494

apruden2008 opened this issue Jun 14, 2024 · 3 comments · Fixed by #2523
Labels
bug Something isn't working does not block mainnet For when we make decisions that this will not block mainnet.

Comments

@apruden2008
Copy link
Contributor

🐛 Bug Report

Issue: Flaky Tests on CI
Severity: Medium

Description

We have observed that some tests in our CI pipeline exhibit flaky behavior, requiring multiple runs to pass. This inconsistency is affecting the reliability and efficiency of our development process.

Affected Tests

algorithms - This test often fails unpredictably, and the root cause is currently unknown.

While not explicitly mentioned, other tests may also exhibit similar behavior, requiring multiple attempts to pass.

Steps to Reproduce

  • Run the CI pipeline.
  • Observe the failure of the algorithms test (and potentially others) intermittently.
  • Re-run the failed tests.
  • Notice that the tests may pass on subsequent attempts.

Expected Behavior

All tests should pass consistently on the first run, provided that the code is correct.

Actual Behavior

The algorithms test (and potentially others) fail intermittently without any changes to the code.
These tests often require multiple attempts to pass, leading to wasted time and resources.

Impact

Decreases confidence in the CI results.
Slows down the development process due to the need for re-running tests.
Makes it difficult to identify genuine issues in the codebase.

Possible Causes

Race conditions or timing issues within the tests or the code being tested.
Environmental issues related to the CI infrastructure.
Dependencies on external services or resources that may not be consistently available.

Suggested Actions

Investigation and Diagnosis
- Conduct a thorough investigation to identify the root cause of the flakiness in the algorithms test.
- Review the test code and the associated application code for potential issues.

Test Stabilization
- Implement fixes to address any identified issues causing the flakiness.
- Ensure that tests do not have hidden dependencies on external resources or timing conditions.

Enhancement of CI Infrastructure
- Ensure that the CI environment is consistent and reliable.
- Consider introducing additional logging or diagnostics to capture more information about the failures.

Documentation and Communication
- Document the findings and the steps taken to address the flaky tests.
- Communicate any changes to the team to ensure that everyone is aware of the improvements and any new best practices.

Additional Information

Please provide any logs or additional context that might help in diagnosing the issue.
If you have observed flaky behavior in other tests, please list them here as well.

@apruden2008 apruden2008 added bug Something isn't working does not block mainnet For when we make decisions that this will not block mainnet. labels Jun 14, 2024
@zosorock
Copy link
Contributor

Not sure if it helps but can we upgrade to Rust 1.79.0?

@vicsn
Copy link
Contributor

vicsn commented Jun 14, 2024

Some comments:

  • Don't think a Rust upgrade will help, flakiness has been an issue for a while
  • One frequent cause of flakiness across all crates is that parameter downloading fails - perhaps this is AWS rate-limiting
  • Separately from the downloads failing, indeed there seems to be too high resource usage for the algorithms crate. As this is heavily influenced by the particular environment, Provable will triage this on our own CI independently.

@zosorock
Copy link
Contributor

zosorock commented Jun 22, 2024

Funny enough that by lowering the resource class (or perhaps a fix in one of the PRs), CI is passing now for algorithms:
https://app.circleci.com/pipelines/github/AleoNet/snarkVM/13211/workflows/44a17171-197b-4df2-95c3-58e4180b57f8/jobs/576904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working does not block mainnet For when we make decisions that this will not block mainnet.
Projects
None yet
3 participants