Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hold] Bug: Randomly failing E2E tests #376

Closed
wishfulthinkerme opened this issue Nov 22, 2018 · 14 comments
Closed

[hold] Bug: Randomly failing E2E tests #376

wishfulthinkerme opened this issue Nov 22, 2018 · 14 comments
Assignees
Labels
bug Something isn't working CI Tasks related to the CI, pipeline, and releasing e2e-tests team/asterix
Milestone

Comments

@wishfulthinkerme
Copy link
Contributor

Expected Results

Stop E2E from failing randomly.

Observed Results

Randomly, random tests are failing locally and in the pipeline. There isn't any case, which this can be reproduced. Sometimes it works, sometimes not. We have to check if that is caused by our E2E tests, connection issues or backend problems.

@wishfulthinkerme wishfulthinkerme added the bug Something isn't working label Nov 22, 2018
@Xymmer Xymmer added CI Tasks related to the CI, pipeline, and releasing e2e-tests labels Nov 22, 2018
@wishfulthinkerme wishfulthinkerme self-assigned this Nov 23, 2018
@Xymmer Xymmer added this to the ALPHA-0 milestone Nov 27, 2018
@Xymmer
Copy link
Contributor

Xymmer commented Dec 5, 2018

moving this back in, gil is saying this still happens, sometimes, with the server;
Marcin, i can't remember why we moved this to done, do you?

@wishfulthinkerme
Copy link
Contributor Author

I couldn't reproduce random failing, so we decided to move it to the done for now.

@Xymmer Xymmer added this to To Do in Asterix SPRINT via automation Dec 6, 2018
@wishfulthinkerme wishfulthinkerme removed their assignment Dec 7, 2018
@marlass marlass moved this from To Do to In Progress in Asterix SPRINT Dec 11, 2018
@marlass marlass moved this from In Progress to To Do in Asterix SPRINT Dec 11, 2018
@marlass
Copy link
Contributor

marlass commented Dec 11, 2018

First we need to resolve the issue with wrong backend payment url.

@Xymmer
Copy link
Contributor

Xymmer commented Dec 11, 2018

hey marcin, e2e fixed, can you resume investigation?

@marlass marlass moved this from To Do to In Progress in Asterix SPRINT Dec 14, 2018
@marlass
Copy link
Contributor

marlass commented Dec 18, 2018

I looked at it for a few hours and could found the real issue. Failures are completely random. One time some test fail and then it works for few runs. In the meantime, another one fails and the story repeats. Will look more at this issue in next weeks and might check some alternatives to protractor that are much more stable and easier to debug and work with. Current solution slows whole development too much. We lose way to much time on triggering pipelines and the process of writing tests is also not efficient. We need to check our work line by line and writing whole tests that just runs in the first try is almost impossible. After finding some better replacement I will introduce it to a wider audience and we will decide the e2e tests future in the project.

@marlass marlass moved this from In Progress to To Do in Asterix SPRINT Dec 18, 2018
@dunqan
Copy link
Contributor

dunqan commented Dec 18, 2018

@marlass Please take into account that it could not be a protractor to blame (at least the only one), but underlying selenium. So if you'll decide on an alternative library that also uses selenium, the same issues will probably resurface after some time, where the number of tests will grow, especially in case of our app, where almost every part of the page is created dynamically (we need to make a backend call first).

What I wanted to point out is that there may be no easy/obvious solution to this, so please take caution on any miraculous alternatives.

And yup, writing them is hard (and it would be good to make it easier) but take into account, that testing all the stuff manually at the same rate is not just hard, it's impossible.

Also, there is a good one to read:
https://sqa.stackexchange.com/questions/32542/how-to-make-selenium-tests-more-stable

Against flaky test, even Google themselves content in vain.

@marlass
Copy link
Contributor

marlass commented Dec 18, 2018

Yeah. I first want to check Cypress that does not use selenium under the hood. In the previous project, we didn't have a failure rate that is even close to the current situation in the project and one thing that it improved dramatically was ease of debugging and writing tests. I will try first to move the happy path to it and if it will bring enough improvement, we will discuss it and decide if we want to move it.

@hackergil
Copy link
Contributor

hackergil commented Dec 18, 2018

According to the link above that @dunqan provided and a couple other if you do some research, flakiness on selenium based tests is normal and expected. Rewriting the tests in another framework is not in scope for this ticket or something we should consider at this point.

I'd propose to consider the following:

  • First, we can get some stats on the success/failure ratio of e2e tests vs daily builds (travis has APIs for this)
  • Once we know this, maybe we can do some analysis on the sample data to see if it's the same test or tests that fail and go from there
  • Also, we could consider retrying the failing e2e tests using protractor-retry or something similar https://www.npmjs.com/package/protractor-retry
  • Additionally, we could even generate reports of the success/failure ratio

@dunqan
Copy link
Contributor

dunqan commented Dec 18, 2018

Agree with @hackergil, that changing testing framework is not in the scope of this ticket, but...
if we plan to write many e2e tests, maybe it's not a bad idea to create another ticket, just to give a quick try with Cypress and compare the results?
If we can live with the main drawback, which is lack of support for firefox/ie-edge/safari (we probably can) the eventual better debugging and easier test writing makes it worth at least trying (not to mention better stability, because of not using selenium).

@Xymmer Xymmer moved this from To Do to In Progress in Asterix SPRINT Dec 19, 2018
@Xymmer Xymmer moved this from In Progress to To Do in Asterix SPRINT Dec 19, 2018
@marlass marlass moved this from To Do to In Progress in Asterix SPRINT Dec 28, 2018
@marlass
Copy link
Contributor

marlass commented Dec 28, 2018

I checked our build statistics with Travis API and got the following results (last 3000 builds of 3725 total):

image

However this doesn't give the whole picture. Travis API doesn't return any information about jobs/builds restarts, so we only see the final result (which might be sometimes result of 3-4 restarts).

Almost 50% failure rate doesn't look good. Most of them fails at the 'Unit tests' stage that probably indicates the flaky e2e tests.

Source code: https://github.com/marlass/travis-build-stats

We can inspect more the tests, introduce some sort of auto retries or try some new solutions, because it really have a great impact on everyones work. Waiting sometimes 30 minutes to merge and then finding out that your branch is again not up to date and retrying this process is extremly frustrating. In my opinion this one thing is the biggest factor to our current development speed.

Let me know what in your opinion we should do next?

@dunqan
Copy link
Contributor

dunqan commented Dec 31, 2018

As we discussed: because of the fact that Travis runs for each commit (and even runs twice if we have a PR), those stats include failures from each commit on "work in progress" branches that could have not yet adjusted unit (or e2e) tests.
And if those stats include only final results (without restarts) then they effectively exclude failures caused by flaky tests on the most representative (ready to merge) commits.

So I'd vote for implementing a more robust logging mechanism, that could take into account protractor output, job id, branch, commit -> then we could use that info to find most often failing tests, most flaky ones (that finally works after some restarts), etc.

And of course, test/implement a retry mechanism for protractor as soon as it is possible, to ease developers life. It's a part of this ticket (#580), but IMO deserves its own ticket.

@marlass
Copy link
Contributor

marlass commented Dec 31, 2018

After a quick search, I found that implementing better logging mechanism is pretty easy.
Here is a PR for gathering all our protractor result on S3 that we can use for advanced and reliable analysis: #790

@marlass
Copy link
Contributor

marlass commented Jan 10, 2019

Status of issue: I will review the stats in next week and prepare some script for the future.

@marlass marlass moved this from In Progress to To Do in Asterix SPRINT Jan 10, 2019
@marlass marlass moved this from To Do to In Progress in Asterix SPRINT Jan 21, 2019
@marlass marlass moved this from In Progress to To Do in Asterix SPRINT Jan 22, 2019
@kacperknapik kacperknapik changed the title Bug: Randomly failing E2E tests [hold] Bug: Randomly failing E2E tests Jan 22, 2019
@kacperknapik kacperknapik removed this from To Do in Asterix SPRINT Jan 22, 2019
@Xymmer Xymmer modified the milestones: ALPHA-0, BETA-0 Mar 11, 2019
@Xymmer Xymmer modified the milestones: 1.0 Beta-0, 1.0 RC-0 Apr 25, 2019
@Xymmer Xymmer modified the milestones: 1.0 RC-0, 1.? Milestone TBD, graveyard May 22, 2019
@Xymmer
Copy link
Contributor

Xymmer commented May 24, 2019

no longer needed as we moved from protractor to cypress

@Xymmer Xymmer closed this as completed May 24, 2019
@Xymmer Xymmer added this to the before-5.0 milestone Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CI Tasks related to the CI, pipeline, and releasing e2e-tests team/asterix
Projects
None yet
Development

No branches or pull requests

5 participants