-
-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Some temporal tests can sometimes take more than 4 mins #2185
Comments
Got a timeout for a t.rast.what test:
Outputs from another timeouts:
...and 2 more similar. ...and 1 more from the full config test. |
Timeout for a v.what.strds test:
Output from another timeout:
|
I guess it is now fair to say that multiple tests can get stuck. However, so far it was only tests of temporal modules/tools. |
I understand that tests are cancelled after 4 minutes now. Did they report errors before (the change to cancel them) to see if something is wrong in the tests themselves after eg semantic label changes? I remember we once observed many temporal tests failing too, I think it was even before semantic labels, was it? |
You are right. Something is wrong for some time already and it is not clear when it started. The errors looked pretty much the same before the timeout change. The output would be cut somewhere in the middle without any indication of what happened (you had to check the HTML report to see that). After the timeout change, it is just more explicit and the error output is visible in the text output directly in the CI and you can also see that the other tests are running. We have never observed this outside of GitHub Actions, so it is possible it is specific just to that environment, but if it can happen anywhere, we want to know. The number of observations we have for GitHub Actions is also much higher than anything else. If I've ever had that in daily tests I was running on my fatra machine between 2014 and 2019, I would probably just have restarted the tests or skipped a day without giving it a second thought. If I remember correctly, it actually took some time before we have ever observed it on main/master branch. At first, it seemed it was only happening in PRs, maybe because there is more runs (commits) happening there than on the main branch. |
Timeouts for a t.rast.algebra test:
...and 1 more. ...and 4 more from the full config test. ...and 6 more from the min config test. |
Please report here when you get a timeout for a test which is not yet reported. Now, this is clearly applies to various temporal tests, but if it is not a temporal test, report it here to. That would broaden the scope of this issue. |
Timeout for a t.rast.export test:
|
Timeout for a t.rast.to.vect test:
|
From the grass-addons repo (grass.guittest is not configured to apply any timeout there, so GitHub Action timeout applies):
At least 1 other timeout like this was observed. |
Timeout for a t.vect.extract test:
|
Timeout for a t.rast.contour test:
|
Temporal modules seem to get stuck at certain circumstances. I was recording here all the events when I noticed tests failed due to timeout and it was always from a temporal module. Even considering the high number of temporal module tests comparing to other tests, I would say it is highly unlikely that the timeouts are due to the testing code or happening to other than temporal modules (29 recorded temporal timeouts versus no other timeouts). |
Could we temporarily add in "main" some debug output to |
The tests could run with debug on permanently (e.g., GRASS debugs and compiler settings if there is any use for it). I would prefer having another test job for that so that one does not have to search through tons of debug output to find basic errors. This would be additional matrix settings in the Ubuntu workflow similar to the minimal configuration I added recently to avoid the need to test locally each OpenMP PR without OpenMP on (ubuntu.yml, #2245). More debug messages in the code, that's another issue. Maybe more standard debug messages can go there permanently if they are not a performance issue. If they would be really temporary than perhaps a patch to apply in CI before compilation would be a systematic way to add some arbitrary experimental code to tests (still perhaps just in that debug job). Locally, you could try some repeated runs either directly on the machine or in a container. Looking at the tests for main, you are likely to get at least one timeout after 30 runs. (I restarted most of the failed jobs on main, so you won't see that data anymore. However, I'm not sure how random this is because it seemed to me that at one point there was not that many timeouts while now I had a feeling I struggle to get a clean run - this can be just incorrect impression or irrelevant random fluctuation.) |
I tried this on my local Fedora box (a recent Intel i7 processor):
Could the tests be split into two test scripts to not run into the timeout risk? Dunno if that makes sense and the CI support it... |
Are you thinking that the reason why we test_raster_algebra.py shows up more in the record is that it sometimes just needs more time, i.e., it is not actually stuck? I was thinking that test_raster_algebra.py is just bigger, e.g., executes more processes, so it has a greater likelihood of getting stuck. So, I was thinking that splinting it won't decrease the likelihood of hitting the limit. The test file is big (almost 900 lines, 40 test methods), but there is substantial The 5 min limit is arbitrary and I decided that likely taking into consideration the long running tests such as test_raster_algebra.py (from the commit message: The value for timeout comes from the test for t.rast.algebra which takes 3.5 mins. (89fe85c)). The limit is actually 5 mins, I don't know why I said 4 mins, maybe a typo or the original value I tried. Can the limit be twice more? Sure, but why? Before introduction of the timeout, the tests would just get stuck and would get canceled after they hit the 6 hour limit of GitHub Actions. The 5 minute timeout just makes it fail faster. |
Who knows. I guess I'll only find out once we add some logging output to this particular test. |
An attempt to fix this in #2332. My guess is it's caused by a bug in pygrass.Messanger class given all the multiprocessing Lock usage. If that's it, finding a bug and fixing it would be quite difficult, so #2332 is reimplementing the class using grass.script g.message (so process calls). It will be slower (calling g.message vs direct C calls), but other than that I am not sure if there are any other downsides to it. |
I just saw pytest timeout. Not sure if related. I didn't include any timeout plugin yet, so it was GitHub Actions 6 hour timeout. Both jobs, not much output.
|
Since the `testsuite/test_raster_algebra.py` comes with many tests which overall may take more than 300s of runtime, this PR separates out the `test_temporal*()` tests into an own file. Thus, in the ideal case, a CI timeout is no longer reached in this case. Partially addresses OSGeo#2185
Could it be that the Ubuntu CI machines are simply overbooked? Since e.g. Centos doesn't show these timeout problems... |
Just too many jobs doesn't seem to be the issue because other things run okay; that's more a queuing issue. I think it is also safe to assume that the individual jobs are separated from each other sufficiently in the GitHub Actions environment.
There is actually no difference between Ubuntu and CentOS in terms of available runners. Both run in a Ubuntu virtual machine. Ubuntu workflow runs directly there. CentOS uses Docker on top of an Ubuntu VM ( |
On 2022-05-17 on several tests failed in a 20.04 tests on the main branch. In all or almost all other cases, it was always only one which would time out, although theoretically with the time limit for one test, more than one can time out (because when one times out, the other still run and thus can also time out). For the first time, there is also non-temporal test which timed out: r.patch (test_rpatch_artificial.py). Severely shortened, but still long job log follows.
|
On 2022-05-20 six tests failed in a 20.04 tests on the main branch including r.patch with a different mix of temporal tests than before. The r.patch test was without error output:
|
Another case when a test of a non-temporal module timed out: g.extension timed out together with t.rast.algebra on 2022-05-25:
Result of
|
r.patch had race condition for CELL rasters which caused extremely long processing times on some machines, so we can assume the timeout was caused by that. Issue fixed in #2410. g.extension timeout can be caused by connection issues, although there is no evidence for that. This leaves us again with only temporal modules producing consistent timeouts. One way to support that conclusion is to add more tests of all other functionality except for temporal (there is a lot of temporal tests comparing to the other tests; more other tests would increase the likelihood that the non-temporal tests will time out if all test have the same probability of timeout). In the mean time, a less common timeout of two tests at the same time occurred:
|
Another failing one to the collection just to be clear that any temporal test can time out:
|
I would almost close this issue and start over with a new report, but there is actually not much change as far as the main issue discussed here goes, i.e., temporal tests sometimes time out, likely get stuck forever. A run in a PR from July 6 in this repo failed with:
Tests in the grass-addons repo still time out once in a while being canceled by GitHub after 6h hours. |
Recent timeouts on main (link, link):
When the tests pass, the test duration is 51 seconds for test_convert.py and 46 seconds for test_t_rast3d_extract.py, i.e., nowhere close to the 300s per-test limit. Artifact for the first failure is no longer available, stderr for test_t_rast3d_extract.py was just |
A slightly different output for a test.t.vect.import.sh test failure from 2 weeks ago:
|
I just restarted more than five jobs across main, 8.3 and 8.2 branches with various tests failing. |
Observed on Oct 9: Failing on main with timeout in different two jobs:
Last running for almost 6h (stuck) in grass-addons in different two jobs:
|
An unusual timeout:
Even more interestingly, |
Temporal-related tests timing out also through pytest:
|
One of specific failures in pytest pointing with timeout when waiting for t.register (after a successful t.create):
|
Describe the bug
In CI checks for a PR, I have noticed that a test for t.vect.import takes more than four minutes to complete.
This might be related to the case when the tests would run forever and then were canceled by GitHub Actions.
In the lack of better data, I'm assuming here that the test would run much longer and that it is the the test which is getting stuck.
However, it is possible that the test would finish in next half a minute and everything would be fine. We have also seen the log ending with different tests, not one specific test.
To Reproduce
This may be visible time to time in the CI, but it is not clear how to reproduce it.
Expected behavior
The tests pass within the set timeout.
Perhaps the timeout should simply be double to account for some possible fluctuation in the CI or the test really can get stuck possibly due to an issue in t.vect.import or some other module used in the test.
Screenshots
Another one ends with:
System description
Additional context
Previously, tests which would get stuck would run forever until they were canceled by GitHub Actions. After 89fe85c (#2172), each test can run only 4 mins.
The text was updated successfully, but these errors were encountered: