-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GS flaky test #1332
Comments
Looks related:
|
Looks like I found another one:
|
And another one with Log
|
And yet another one, refs #2003 Log
|
One more: https://github.com/TheThingsNetwork/lorawan-stack/pull/2036/checks?check_run_id=475381911
|
This one might be serious as it may panic in production as well:
|
They just keep coming Log
|
@neoaggelos @KrishnaIyer can you coordinate who looks into this? We might have different things to fix. |
There have been changes in #2108 that make #1332 (comment) and #1332 (comment) irrelevant. Also, even though it is not a real fix for the problem, adding a bigger timeout here https://github.com/TheThingsNetwork/lorawan-stack/pull/2108/files?file-filters%5B%5D=.go#diff-b91e68ddfed2ddb93b15d4b9ab05bde0R1045 seems to have cleaned up a lot of the flakiness. In general, everything related with statistics is being looked at as part of #2108 (and it consistently does not fail). |
Upon loads of testing, all of the tests above (and the ones that keep breaking the CI a few times a day) can be grouped in (so far) 4 categories:
|
Yeah, probably the gateway status is not the only thing. But I really cannot find what else could be causing the issue here. Could it be that somehow the tests are led in an illegal state (e.g. same stats being updated by two sources at the same time, gc taking over and releasing stuff that it shouldn't)? |
I think this one is similar to the previous one; this is also doing a full Note that GS tests already take 36 seconds, which is quite a lot. You might start experimenting lowering the general GS
That could very well be. You can test that with |
Which is why I am mostly reluctant to simply increasing timeouts when Travis fails. I think the tests would benefit a great deal if we used events instead of timeouts. Also, I think decreasing the timeout will only lead to more flaky tests (I don't really know how the values being used came to be, so maybe @KrishnaIyer would be more fit for this one.) |
I'm fine using events better, but we cannot have events for failed links if there are preconditions that are not met. It is inevitable that something happens asynchronous and there's no event for failure, while we want it to fail, i.e. we need to wait to be sure that it didn't succeed. |
@neoaggelos we have too many of these:
Please resolve |
New one;
This needs finetuning of a timeout used at this particular place |
|
#2624 fixed that last comment. |
|
This looks like an int overflow, which is quite scary. Please ensure that this is something caused by obscurities of the test and not something that can happen in production. |
Another one:
|
Last comment is very old, closing. Re-open as needed. |
Summary
The GS test
TestGatewayServer/Traffic/grpc/Upstream/OneValidFS
is flaky.Steps to Reproduce
What do you see now?
What do you want to see instead?
The test should not fail randomly.
Can you do this yourself and submit a Pull Request?
Can review.
The text was updated successfully, but these errors were encountered: