Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Some temporal tests can sometimes take more than 4 mins #2185

Open
wenzeslaus opened this issue Feb 9, 2022 · 35 comments
Open

[Bug] Some temporal tests can sometimes take more than 4 mins #2185

wenzeslaus opened this issue Feb 9, 2022 · 35 comments
Labels
bug Something isn't working help wanted Extra attention is needed temporal Related to temporal data processing
Milestone

Comments

@wenzeslaus
Copy link
Member

wenzeslaus commented Feb 9, 2022

Describe the bug

In CI checks for a PR, I have noticed that a test for t.vect.import takes more than four minutes to complete.

This might be related to the case when the tests would run forever and then were canceled by GitHub Actions.

In the lack of better data, I'm assuming here that the test would run much longer and that it is the the test which is getting stuck.

However, it is possible that the test would finish in next half a minute and everything would be fine. We have also seen the log ending with different tests, not one specific test.

To Reproduce

This may be visible time to time in the CI, but it is not clear how to reproduce it.

Expected behavior

The tests pass within the set timeout.

Perhaps the timeout should simply be double to account for some possible fluctuation in the CI or the test really can get stuck possibly due to an issue in t.vect.import or some other module used in the test.

Screenshots

Running ./temporal/t.vect.import/testsuite/test.t.vect.import.sh...
========================================================================
+ g.region s=0 n=80 w=0 e=120 b=0 t=50 res=10 res3=10 -p3
+ mkdir test
+ v.random --o -z output=soil_1 n=100 zmin=0 zmax=100 column=height seed=1
Default driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_1@__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378>...
Registering primitives...

+ v.random --o -z output=soil_2 n=100 zmin=0 zmax=100 column=height seed=2
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_2@__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378>...
Registering primitives...

+ v.random --o -z output=soil_3 n=100 zmin=0 zmax=100 column=height seed=3
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_3@__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378>...
Registering primitives...

+ g.tempfile pid=1 -d
+ n1=/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378/.tmp/fv-az183-463/1.0
+ cat
+ t.create --o type=stvds temporaltype=absolute output=soil_abs1 title=A test descr=A test
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378/tgis/sqlite.db
+ t.register -i type=vector input=soil_abs1 file=/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az183_463_32378/.tmp/fv-az183-463/1.0 start=2001-01-01 increment=1 months
Gathering map information...
0..100
Registering maps in the temporal database...
Registering maps in the space time dataset...
0..Updating space time dataset...
100
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar.bz2 compression=bzip2 directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar.gz compression=gzip directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar compression=no directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
========================================================================
FAILED ./temporal/t.vect.import/testsuite/test.t.vect.import.sh - Timeout >300.0s
...
Executed 234 test files in 0:55:14.506874.
From them 233 files (100%) were successful and 1 files (0%) failed.

Another one ends with:

v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=pack input=soil_abs1 output=stvds_export_pack.tar.bz2 compression=bzip2 directory=/tmp
Pack file </tmp/tmpmsytnjg1/soil_1.pack> created
Pack file </tmp/tmpmsytnjg1/soil_2.pack> created
Pack file </tmp/tmpmsytnjg1/soil_3.pack> created
+ t.vect.export format=pack input=soil_abs1 output=stvds_export_pack.tar.gz compression=gzip directory=/tmp
Pack file </tmp/tmpasqj23r6/soil_1.pack> created
Pack file </tmp/tmpasqj23r6/soil_2.pack> created
Pack file </tmp/tmpasqj23r6/soil_3.pack> created
+ t.vect.export format=pack input=soil_abs1 output=stvds_export_pack.tar compression=no directory=/tmp
Pack file </tmp/tmpy60f6pyy/soil_1.pack> created
Pack file </tmp/tmpy60f6pyy/soil_2.pack> created
Pack file </tmp/tmpy60f6pyy/soil_3.pack> created
========================================================================
FAILED ./temporal/t.vect.import/testsuite/test.t.vect.import.sh - Timeout >300.0s

System description

  • Operating System: ubuntu-20.04 VM in GitHub Actions
  • GRASS GIS version: feature branch in PR against main
GRASS 8.1.dev (2022)
PROJ: 6.3.1
GDAL/OGR: 3.0.4
GEOS: 3.8.0
SQLite: 3.31.1
Python: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0]

Additional context

Previously, tests which would get stuck would run forever until they were canceled by GitHub Actions. After 89fe85c (#2172), each test can run only 4 mins.

@wenzeslaus wenzeslaus added the bug Something isn't working label Feb 9, 2022
@wenzeslaus
Copy link
Member Author

wenzeslaus commented Feb 13, 2022

Got a timeout for a t.rast.what test:

Running ./temporal/t.rast.what/testsuite/test_what.py...
========================================================================
........========================================================================
FAILED ./temporal/t.rast.what/testsuite/test_what.py - Timeout >300.0s

Outputs from another timeouts:

Running ./temporal/t.rast.what/testsuite/test_what.py...
========================================================================
..............========================================================================
FAILED ./temporal/t.rast.what/testsuite/test_what.py - Timeout >300.0s
 Running ./temporal/t.rast.what/testsuite/test_what.py...
========================================================================
........========================================================================
FAILED ./temporal/t.rast.what/testsuite/test_what.py - Timeout >300.0s

...and 2 more similar.

...and 1 more from the full config test.

@wenzeslaus
Copy link
Member Author

wenzeslaus commented Feb 14, 2022

Timeout for a v.what.strds test:

Running ./scripts/v.what.strds/testsuite/test_what_strds.py...
========================================================================
Process has timed out in 300.0s and produced no error output.
========================================================================
FAILED ./scripts/v.what.strds/testsuite/test_what_strds.py - Timeout >300.0s

Output from another timeout:

Running ./scripts/v.what.strds/testsuite/test_what_strds.py...
========================================================================
.========================================================================
FAILED ./scripts/v.what.strds/testsuite/test_what_strds.py - Timeout >300.0s

@wenzeslaus wenzeslaus added the help wanted Extra attention is needed label Feb 14, 2022
@wenzeslaus
Copy link
Member Author

I guess it is now fair to say that multiple tests can get stuck. However, so far it was only tests of temporal modules/tools.

@veroandreo
Copy link
Contributor

I guess it is now fair to say that multiple tests can get stuck. However, so far it was only tests of temporal modules/tools.

I understand that tests are cancelled after 4 minutes now. Did they report errors before (the change to cancel them) to see if something is wrong in the tests themselves after eg semantic label changes? I remember we once observed many temporal tests failing too, I think it was even before semantic labels, was it?

@wenzeslaus
Copy link
Member Author

...Did they report errors before (the change to cancel them) to see if something is wrong in the tests themselves after eg semantic label changes?...

You are right. Something is wrong for some time already and it is not clear when it started. The errors looked pretty much the same before the timeout change. The output would be cut somewhere in the middle without any indication of what happened (you had to check the HTML report to see that). After the timeout change, it is just more explicit and the error output is visible in the text output directly in the CI and you can also see that the other tests are running.

We have never observed this outside of GitHub Actions, so it is possible it is specific just to that environment, but if it can happen anywhere, we want to know.

The number of observations we have for GitHub Actions is also much higher than anything else. If I've ever had that in daily tests I was running on my fatra machine between 2014 and 2019, I would probably just have restarted the tests or skipped a day without giving it a second thought.

If I remember correctly, it actually took some time before we have ever observed it on main/master branch. At first, it seemed it was only happening in PRs, maybe because there is more runs (commits) happening there than on the main branch.

@wenzeslaus
Copy link
Member Author

wenzeslaus commented Mar 4, 2022

Timeouts for a t.rast.algebra test:

Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py...
========================================================================
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az208_724_32381/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az208_724_32381/tgis/sqlite.db
Gathering map information...
0..100
Registering maps in the temporal database...
Registering maps in the space time dataset...
0..Updating space time dataset...
100
Gathering map information...
0..100
Registering maps in the temporal database...
Registering maps in the space time dataset...
0..Updating space time dataset...
100
Gathering map information...
0..100
Registering maps in the space time dataset...
0..Updating space time dataset...
100
Gathering map information...
0..100
Registering maps in the temporal database...
Registering maps in the space time dataset...
0..Updating space time dataset...
100
Gathering map information...
0..100
100
....................................========================================================================
FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s
Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py...
========================================================================
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az39_670_33056/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az39_670_33056/tgis/sqlite.db
0..100
0..100
0..100
0..100
0..100
0..100
0..100
0..100
0..100
100
.................................========================================================================
FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s

...and 1 more.

...and 4 more from the full config test.

...and 6 more from the min config test.

@wenzeslaus wenzeslaus changed the title [Bug] t.vect.import test can sometimes take more than 4 mins [Bug] Some temporal tests can sometimes take more than 4 mins Mar 4, 2022
@wenzeslaus
Copy link
Member Author

Please report here when you get a timeout for a test which is not yet reported.

Now, this is clearly applies to various temporal tests, but if it is not a temporal test, report it here to. That would broaden the scope of this issue.

@wenzeslaus
Copy link
Member Author

Timeout for a t.rast.export test:

Running ./temporal/t.rast.export/testsuite/test_rast_export.py...
========================================================================
.========================================================================
FAILED ./temporal/t.rast.export/testsuite/test_rast_export.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

Timeout for a t.rast.to.vect test:

Running ./temporal/t.rast.to.vect/testsuite/test_to_vect.py...
========================================================================
..========================================================================
FAILED ./temporal/t.rast.to.vect/testsuite/test_to_vect.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

wenzeslaus commented Mar 16, 2022

From the grass-addons repo (grass.guittest is not configured to apply any timeout there, so GitHub Action timeout applies):

...
Running ./temporal/t.rast.what.aggr/testsuite/test_whataggr.py...
Error: The operation was canceled.

At least 1 other timeout like this was observed.

@wenzeslaus
Copy link
Member Author

Timeout for a t.vect.extract test:

Running ./temporal/t.vect.extract/testsuite/test_vect_extract.py...
========================================================================
.========================================================================
FAILED ./temporal/t.vect.extract/testsuite/test_vect_extract.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

Timeout for a t.rast.contour test:

Running ./temporal/t.rast.contour/testsuite/test_convert.py...
========================================================================
...========================================================================
FAILED ./temporal/t.rast.contour/testsuite/test_convert.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

Temporal modules seem to get stuck at certain circumstances.

I was recording here all the events when I noticed tests failed due to timeout and it was always from a temporal module. Even considering the high number of temporal module tests comparing to other tests, I would say it is highly unlikely that the timeouts are due to the testing code or happening to other than temporal modules (29 recorded temporal timeouts versus no other timeouts).

@wenzeslaus wenzeslaus added the temporal Related to temporal data processing label Apr 13, 2022
@neteler
Copy link
Member

neteler commented Apr 13, 2022

Could we temporarily add in "main" some debug output to t.* test(s) to better understand where exactly they get stuck?

@wenzeslaus
Copy link
Member Author

The tests could run with debug on permanently (e.g., GRASS debugs and compiler settings if there is any use for it). I would prefer having another test job for that so that one does not have to search through tons of debug output to find basic errors. This would be additional matrix settings in the Ubuntu workflow similar to the minimal configuration I added recently to avoid the need to test locally each OpenMP PR without OpenMP on (ubuntu.yml, #2245).

More debug messages in the code, that's another issue. Maybe more standard debug messages can go there permanently if they are not a performance issue. If they would be really temporary than perhaps a patch to apply in CI before compilation would be a systematic way to add some arbitrary experimental code to tests (still perhaps just in that debug job).

Locally, you could try some repeated runs either directly on the machine or in a container. Looking at the tests for main, you are likely to get at least one timeout after 30 runs. (I restarted most of the failed jobs on main, so you won't see that data anymore. However, I'm not sure how random this is because it seemed to me that at one point there was not that many timeouts while now I had a feeling I struggle to get a clean run - this can be just incorrect impression or irrelevant random fluctuation.)

@neteler
Copy link
Member

neteler commented Apr 19, 2022

FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s

I tried this on my local Fedora box (a recent Intel i7 processor):

Ran 40 tests in 219.914s

Could the tests be split into two test scripts to not run into the timeout risk? Dunno if that makes sense and the CI support it...

@wenzeslaus
Copy link
Member Author

Are you thinking that the reason why we test_raster_algebra.py shows up more in the record is that it sometimes just needs more time, i.e., it is not actually stuck?

I was thinking that test_raster_algebra.py is just bigger, e.g., executes more processes, so it has a greater likelihood of getting stuck. So, I was thinking that splinting it won't decrease the likelihood of hitting the limit.

The test file is big (almost 900 lines, 40 test methods), but there is substantial setUpClass which, to keep things simple, would have to be copied unless some analysis is done to split it.

The 5 min limit is arbitrary and I decided that likely taking into consideration the long running tests such as test_raster_algebra.py (from the commit message: The value for timeout comes from the test for t.rast.algebra which takes 3.5 mins. (89fe85c)). The limit is actually 5 mins, I don't know why I said 4 mins, maybe a typo or the original value I tried. Can the limit be twice more? Sure, but why?

Before introduction of the timeout, the tests would just get stuck and would get canceled after they hit the 6 hour limit of GitHub Actions. The 5 minute timeout just makes it fail faster.

@neteler
Copy link
Member

neteler commented Apr 20, 2022

Are you thinking that the reason why we test_raster_algebra.py shows up more in the record is that it sometimes just needs more time, i.e., it is not actually stuck?

Who knows. I guess I'll only find out once we add some logging output to this particular test.

@petrasovaa
Copy link
Contributor

An attempt to fix this in #2332. My guess is it's caused by a bug in pygrass.Messanger class given all the multiprocessing Lock usage. If that's it, finding a bug and fixing it would be quite difficult, so #2332 is reimplementing the class using grass.script g.message (so process calls). It will be slower (calling g.message vs direct C calls), but other than that I am not sure if there are any other downsides to it.

@wenzeslaus
Copy link
Member Author

I just saw pytest timeout. Not sure if related. I didn't include any timeout plugin yet, so it was GitHub Actions 6 hour timeout. Both jobs, not much output.

============================= test session starts ==============================
platform linux -- Python 3.8.12, pytest-7.1.2, pluggy-1.0.0
rootdir: /home/runner/work/grass/grass, configfile: pyproject.toml
collected 64 items

Error: The operation was canceled.

neteler added a commit to neteler/grass that referenced this issue May 9, 2022
Since the `testsuite/test_raster_algebra.py` comes with many tests which overall may take more than 300s of runtime, this PR separates out the `test_temporal*()` tests into an own file. Thus, in the ideal case, a CI timeout is no longer reached in this case.

Partially addresses OSGeo#2185
@neteler
Copy link
Member

neteler commented May 9, 2022

Something is wrong for some time already and it is not clear when it started.

Could it be that the Ubuntu CI machines are simply overbooked? Since e.g. Centos doesn't show these timeout problems...

@wenzeslaus
Copy link
Member Author

Could it be that the Ubuntu CI machines are simply overbooked?

Just too many jobs doesn't seem to be the issue because other things run okay; that's more a queuing issue. I think it is also safe to assume that the individual jobs are separated from each other sufficiently in the GitHub Actions environment.

Since e.g. Centos doesn't show these timeout problems...

There is actually no difference between Ubuntu and CentOS in terms of available runners. Both run in a Ubuntu virtual machine. Ubuntu workflow runs directly there. CentOS uses Docker on top of an Ubuntu VM (runs-on: ubuntu-20.04, container: ${{ matrix.os }}, matrix.os==centos:7). The difference is that CentOS is not running the tests, only Ubuntu does.

@wenzeslaus
Copy link
Member Author

On 2022-05-17 on several tests failed in a 20.04 tests on the main branch. In all or almost all other cases, it was always only one which would time out, although theoretically with the time limit for one test, more than one can time out (because when one times out, the other still run and thus can also time out). For the first time, there is also non-temporal test which timed out: r.patch (test_rpatch_artificial.py). Severely shortened, but still long job log follows.

...
2022-05-17T13:27:57.1371307Z Running ./raster/r.patch/testsuite/test_rpatch_artificial.py...
2022-05-17T13:32:57.2547598Z ========================================================================
2022-05-17T13:32:57.2548340Z Process has timed out in 300.0s and produced no error output.
2022-05-17T13:32:57.2548927Z ========================================================================
2022-05-17T13:32:57.2549723Z FAILED ./raster/r.patch/testsuite/test_rpatch_artificial.py - Timeout >300.0s
...
2022-05-17T13:52:56.5302060Z Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py...
2022-05-17T13:57:56.5985236Z ========================================================================
2022-05-17T13:57:56.5985819Z Default TGIS driver / database set to:
...
2022-05-17T13:57:56.5991515Z 0..100
2022-05-17T13:57:56.5991760Z 100
2022-05-17T13:57:56.5992089Z ................................========================================================================
2022-05-17T13:57:56.5992991Z FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s
2022-05-17T13:57:56.5993541Z Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra_fails.py...
2022-05-17T13:58:08.2992130Z Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra_granularity.py...
2022-05-17T14:03:08.4304176Z ========================================================================
2022-05-17T14:03:08.4304869Z Default TGIS driver / database set to:
...
2022-05-17T14:03:08.4308014Z Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_granularity_fv_az380_640_32838/tgis/sqlite.db
2022-05-17T14:03:08.4308464Z 0..100
...
2022-05-17T14:03:08.4311980Z 0..100
2022-05-17T14:03:08.4312225Z 100
2022-05-17T14:03:08.4312553Z ========================================================================
2022-05-17T14:03:08.4313385Z FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra_granularity.py - Timeout >300.0s
...
2022-05-17T14:04:41.5752525Z Running ./temporal/t.rast.export/testsuite/test_rast_export.py...
2022-05-17T14:09:41.6608059Z ========================================================================
2022-05-17T14:09:41.6608917Z Process has timed out in 300.0s and produced no error output.
2022-05-17T14:09:41.6609367Z ========================================================================
2022-05-17T14:09:41.6610724Z FAILED ./temporal/t.rast.export/testsuite/test_rast_export.py - Timeout >300.0s
2022-05-17T14:09:41.6611358Z Running ./temporal/t.rast.extract/testsuite/test_t_rast_extract.py...
...
2022-05-17T14:16:11.9419530Z Running ./temporal/t.rast.to.vect/testsuite/test_to_vect.py...
2022-05-17T14:21:12.0531293Z ========================================================================
2022-05-17T14:21:12.0531916Z ....========================================================================
2022-05-17T14:21:12.0532666Z FAILED ./temporal/t.rast.to.vect/testsuite/test_to_vect.py - Timeout >300.0s
2022-05-17T14:21:12.0533117Z Running ./temporal/t.rast.univar/testsuite/test_t_rast_univar.py...
...
2022-05-17T14:25:47.2149331Z Running ./temporal/t.sample/testsuite/test.t.sample.sh...
2022-05-17T14:30:47.3363016Z ========================================================================
2022-05-17T14:30:47.3364086Z + g.region s=0 n=80 w=0 e=120 b=0 t=50 res=10 res3=10 -p3
2022-05-17T14:30:47.3364440Z + export GRASS_OVERWRITE=1
2022-05-17T14:30:47.3364853Z + r.mapcalc expr=prec_1 = rand(0, 550) -s
...
input=precip_abs0,precip_abs0,precip_abs0,precip_abs0 samtype=stvds sample=pnts_abs0
2022-05-17T14:30:47.3401025Z + t.sample method=equal input=precip_abs0,precip_abs0,precip_abs0,precip_abs0 samtype=stvds sample=pnts_abs0
2022-05-17T14:30:47.3401461Z ========================================================================
2022-05-17T14:30:47.3401956Z FAILED ./temporal/t.sample/testsuite/test.t.sample.sh - Timeout >300.0s
2022-05-17T14:30:47.3402377Z Running ./temporal/t.shift/testsuite/test_shift.py...
...
2022-05-17T14:36:34.4549309Z Running ./temporal/t.vect.import/testsuite/test.t.vect.import.sh...
2022-05-17T14:41:34.5778920Z ========================================================================
2022-05-17T14:41:34.5780094Z + g.region s=0 n=80 w=0 e=120 b=0 t=50 res=10 res3=10 -p3
2022-05-17T14:41:34.5780454Z + mkdir test
...
2022-05-17T14:41:34.5925506Z Overwrite space time vector dataset <soil_abs2@__temporal_t_vect_import_test.t.vect.import_fv_az380_640_32838> and unregister all maps.
2022-05-17T14:41:34.5925844Z 0..100
2022-05-17T14:41:34.5926014Z 0..100
2022-05-17T14:41:34.5926233Z ========================================================================
2022-05-17T14:41:34.5926678Z FAILED ./temporal/t.vect.import/testsuite/test.t.vect.import.sh - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

On 2022-05-20 six tests failed in a 20.04 tests on the main branch including r.patch with a different mix of temporal tests than before. The r.patch test was without error output:

Running ./raster/r.patch/testsuite/test_rpatch_artificial.py...
========================================================================
Process has timed out in 300.0s and produced no error output.
========================================================================
FAILED ./raster/r.patch/testsuite/test_rpatch_artificial.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

wenzeslaus commented May 25, 2022

Another case when a test of a non-temporal module timed out: g.extension timed out together with t.rast.algebra on 2022-05-25:

Running ./scripts/g.extension/testsuite/test_addons_modules.py...
========================================================================
..========================================================================
FAILED ./scripts/g.extension/testsuite/test_addons_modules.py - Timeout >300.0s
...
Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py...
========================================================================
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az82_653_32504/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_algebra_test_raster_algebra_fv_az82_653_32504/tgis/sqlite.db
0..100
0..100
0..100
0..100
0..100
0..100
0..100
0..100
0..100
100
........................................========================================================================
FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s

Result of tree --du -h for scripts/g.extension/test_addons_modules:

.
├── [ 27K]  data
│   ├── [ 11K]  modules.xml
│   ├── [ 12K]  sample_modules
│   │   ├── [5.0K]  r.plus.example
│   │   │   ├── [ 112]  Makefile
│   │   │   ├── [ 253]  r.plus.example.html
│   │   │   └── [ 709]  r.plus.example.py
│   │   ├── [ 802]  r.plus.example_sep.tar.gz
│   │   ├── [ 836]  r.plus.example.tar.gz
│   │   └── [1.4K]  r.plus.example.zip
│   └── [ 692]  toolboxes.xml
├── [ 35K]  gextension_test_install_path
│   ├── [ 26K]  docs
│   │   ├── [ 13K]  html
│   │   │   ├── [4.0K]  grassdocs.css
│   │   │   ├── [1.7K]  grass_logo.png
│   │   │   └── [3.2K]  r.plus.example.html
│   │   └── [9.5K]  man
│   │       └── [5.5K]  man1
│   │           └── [1.5K]  r.plus.example.1
│   ├── [ 102]  extensions.xml
│   ├── [ 112]  modules.xml
│   └── [4.7K]  scripts
│       └── [ 708]  r.plus.example
├── [1.1K]  index.html
├── [  74]  stderr.html
├── [   2]  stderr.txt
├── [  72]  stdout.html
├── [   0]  stdout.txt
└── [ 104]  test_keyvalue_result.txt

  68K used in 9 directories, 21 files

@wenzeslaus
Copy link
Member Author

r.patch had race condition for CELL rasters which caused extremely long processing times on some machines, so we can assume the timeout was caused by that. Issue fixed in #2410.

g.extension timeout can be caused by connection issues, although there is no evidence for that.

This leaves us again with only temporal modules producing consistent timeouts. One way to support that conclusion is to add more tests of all other functionality except for temporal (there is a lot of temporal tests comparing to the other tests; more other tests would increase the likelihood that the non-temporal tests will time out if all test have the same probability of timeout).

In the mean time, a less common timeout of two tests at the same time occurred:

Running ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py...
========================================================================
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
...
0..100
100
.....................................========================================================================
FAILED ./temporal/t.rast.algebra/testsuite/test_raster_algebra.py - Timeout >300.0s
...
Running ./temporal/t.rast.what/testsuite/test_what.py...
========================================================================
.========================================================================
FAILED ./temporal/t.rast.what/testsuite/test_what.py - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

Another failing one to the collection just to be clear that any temporal test can time out:

Running ./temporal/t.vect.list/testsuite/test.t.vect.list.layer.sh...
========================================================================
+ g.region s=0 n=80 w=0 e=120 b=0 t=50 res=10 res3=10 -p3
+ export GRASS_OVERWRITE=1
+ v.random -z seed=1 output=soil_orig n=20 zmin=0 zmax=100 column=sand
Default driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_orig@__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271>...
Registering primitives...

+ v.category input=soil_orig out=soils option=transfer layer=1,1
WARNING: Categories already exist in layer 1
Processing features...
Copying attribute table(s)...
Building topology for vector map <soils@__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271>...
Registering primitives...

Categories copied from layer 1 to layer 1
v.category complete. 20 features modified.
+ v.category input=soils out=soil_orig option=transfer layer=1,2
WARNING: Vector map <soil_orig> already exists and will be overwritten
Processing features...
Copying attribute table(s)...
Building topology for vector map <soil_orig@__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271>...
Registering primitives...
...
+ cat
+ cat
+ cat
+ cat
+ cat
+ t.create type=stvds temporaltype=absolute output=soils_abs1 title=A test with input files descr=A test with input files
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271/tgis/sqlite.db
+ t.create type=stvds temporaltype=absolute output=soils_abs2 title=A test with input files descr=A test with input files
+ t.create type=stvds temporaltype=absolute output=soils_abs3 title=A test with input files descr=A test with input files
+ t.create type=stvds temporaltype=absolute output=soils_abs4 title=A test with input files descr=A test with input files
+ t.create type=stvds temporaltype=absolute output=soils_abs5 title=A test with input files descr=A test with input files
+ t.register type=vector input=soils_abs1 file=/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_list_test.t.vect.list.layer_fv_az616_866_32271/.tmp/fv-az616-866/1.0 start=2001-01-01 increment=1 months
0..100
0..100
========================================================================
FAILED ./temporal/t.vect.list/testsuite/test.t.vect.list.layer.sh - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

I would almost close this issue and start over with a new report, but there is actually not much change as far as the main issue discussed here goes, i.e., temporal tests sometimes time out, likely get stuck forever.

A run in a PR from July 6 in this repo failed with:

 Running ./temporal/t.rast.aggregate/testsuite/test_aggregation_absolute_parallel.py...
========================================================================
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_aggregate_test_aggregation_absolute_parallel_fv_az455_615_32500/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_rast_aggregate_test_aggregation_absolute_parallel_fv_az455_615_32500/tgis/sqlite.db
.========================================================================
FAILED ./temporal/t.rast.aggregate/testsuite/test_aggregation_absolute_parallel.py - Timeout >300.0s

Tests in the grass-addons repo still time out once in a while being canceled by GitHub after 6h hours.

@wenzeslaus
Copy link
Member Author

Recent timeouts on main (link, link):

Running ./temporal/t.rast.contour/testsuite/test_convert.py...
========================================================================
..========================================================================
FAILED ./temporal/t.rast.contour/testsuite/test_convert.py - Timeout >300.0s
Running ./temporal/t.rast3d.extract/testsuite/test_t_rast3d_extract.py...
========================================================================
..========================================================================
FAILED ./temporal/t.rast3d.extract/testsuite/test_t_rast3d_extract.py - Timeout >300.0s

When the tests pass, the test duration is 51 seconds for test_convert.py and 46 seconds for test_t_rast3d_extract.py, i.e., nowhere close to the 300s per-test limit.

Artifact for the first failure is no longer available, stderr for test_t_rast3d_extract.py was just .. (which means "two tests passed" in the Python unittest output).

@wenzeslaus
Copy link
Member Author

A slightly different output for a test.t.vect.import.sh test failure from 2 weeks ago:

Running ./temporal/t.vect.import/testsuite/test.t.vect.import.sh...
========================================================================
+ g.region s=0 n=80 w=0 e=120 b=0 t=50 res=10 res3=10 -p3
+ mkdir test
+ v.random --o -z output=soil_1 n=100 zmin=0 zmax=100 column=height seed=1
Default driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/sqlite/sqlite.db
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_1@__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454>...
Registering primitives...

+ v.random --o -z output=soil_2 n=100 zmin=0 zmax=100 column=height seed=2
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_2@__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454>...
Registering primitives...

+ v.random --o -z output=soil_3 n=100 zmin=0 zmax=100 column=height seed=3
Generating points...
0..5..10..15..20..25..30..35..40..45..50..55..60..65..70..75..80..85..90..95..100
Building topology for vector map <soil_3@__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454>...
Registering primitives...

+ g.tempfile pid=1 -d
+ n1=/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454/.tmp/fv-az809-384/1.0
+ cat
+ t.create --o type=stvds temporaltype=absolute output=soil_abs1 title=A test descr=A test
Default TGIS driver / database set to:
driver: sqlite
database: $GISDBASE/$LOCATION_NAME/$MAPSET/tgis/sqlite.db
WARNING: Temporal database connection defined as:
/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454/tgis/sqlite.db
But database file does not exist.
Creating temporal database: /home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454/tgis/sqlite.db
+ t.register -i type=vector input=soil_abs1 file=/home/runner/nc_spm_full_v2alpha2/__temporal_t_vect_import_test.t.vect.import_fv_az809_384_32454/.tmp/fv-az809-384/1.0 start=2001-01-01 increment=1 months
0..100
0..100
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar.bz2 compression=bzip2 directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar.gz compression=gzip directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=GML input=soil_abs1 output=stvds_export_gml.tar compression=no directory=/tmp
WARNING: Vector map <soil_1> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_1> (GML format).
WARNING: Vector map <soil_2> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_2> (GML format).
WARNING: Vector map <soil_3> is 3D. Use format specific layer creation options (parameter 'lco') to export in 3D rather than 2D (default).
Exporting 100 features...
5..11..17..23..29..35..41..47..53..59..65..71..77..83..89..95..100
v.out.ogr complete. 100 features (Point type) written to <soil_3> (GML format).
+ t.vect.export format=pack input=soil_abs1 output=stvds_export_pack.tar.bz2 compression=bzip2 directory=/tmp
Pack file </tmp/tmp0r2y61ej/soil_1.pack> created
Pack file </tmp/tmp0r2y61ej/soil_2.pack> created
Pack file </tmp/tmp0r2y61ej/soil_3.pack> created
========================================================================
FAILED ./temporal/t.vect.import/testsuite/test.t.vect.import.sh - Timeout >300.0s

@wenzeslaus
Copy link
Member Author

I just restarted more than five jobs across main, 8.3 and 8.2 branches with various tests failing.

@wenzeslaus
Copy link
Member Author

Observed on Oct 9:

Failing on main with timeout in different two jobs:

temporal/t.rast.algebra/testsuite/test_raster_algebra_operators.py
temporal/t.vect.import/testsuite/test.t.vect.import.sh

Last running for almost 6h (stuck) in grass-addons in different two jobs:

temporal/t.rast.import.netcdf/testsuite/test_t_rast_import_netcdf.py
temporal/t.rast.what.aggr/testsuite/test_whataggr.py

@wenzeslaus
Copy link
Member Author

An unusual timeout:

Running ./raster/r.random.walk/testsuite/test_r_random_walk.py...
Error: The operation was canceled.

Even more interestingly, test_r_random_walk.py is not immediately followed by tests with temporal framework usage.

@wenzeslaus
Copy link
Member Author

Temporal-related tests timing out also through pytest:

=========================== short test summary info ============================
ERROR python/grass/jupyter/tests/timeseriesmap_test.py::test_collect_layers - Failed: Timeout >300.0s
ERROR python/grass/jupyter/tests/timeseriesmap_test.py::test_default_init - Failed: Timeout >300.0s
ERROR python/grass/jupyter/tests/timeseriesmap_test.py::test_render_layers[False] - Failed: Timeout >300.0s
ERROR python/grass/jupyter/tests/timeseriesmap_test.py::test_render_layers[True] - Failed: Timeout >300.0s
ERROR python/grass/jupyter/tests/timeseriesmap_test.py::test_save - Failed: Timeout >300.0s
============ 142 passed, 36 warnings, 5 errors in 363.09s (0:06:03) ============

@wenzeslaus
Copy link
Member Author

One of specific failures in pytest pointing with timeout when waiting for t.register (after a successful t.create):

 _________________________ ERROR at setup of test_json __________________________
[gw1] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

tmp_path_factory = TempPathFactory(_given_basetemp=PosixPath('/tmp/pytest-of-runner/pytest-0/popen-gw1'), _trace=<pluggy._tracing.TagTrac...c7e54f0>, _basetemp=PosixPath('/tmp/pytest-of-runner/pytest-0/popen-gw1'), _retention_count=3, _retention_policy='all')

    @pytest.fixture(scope="module")
    def space_time_raster_dataset(tmp_path_factory):
        """Start a session and create a raster time series
    
        Returns object with attributes about the dataset.
        """
        tmp_path = tmp_path_factory.mktemp("raster_time_series")
        location = "test"
        gs.core._create_location_xy(tmp_path, location)  # pylint: disable=protected-access
        with gs.setup.init(tmp_path / location):
            gs.run_command("g.region", s=0, n=80, w=0, e=120, b=0, t=50, res=10, res3=10)
            names = [f"precipitation_{i}" for i in range(1, 7)]
            max_values = [550, 450, 320, 510, 300, 650]
            for name, value in zip(names, max_values):
                gs.mapcalc(f"{name} = rand(0, {value})", seed=1)
            dataset_name = "precipitation"
            gs.run_command(
                "t.create",
                type="strds",
                temporaltype="absolute",
                output=dataset_name,
                title="Precipitation",
                description="Random series generated for tests",
            )
            dataset_file = tmp_path / "names.txt"
            dataset_file.write_text("\n".join(names))
>           gs.run_command(
                "t.register",
                type="raster",
                flags="i",
                input=dataset_name,
                file=dataset_file,
                start="2001-01-01",
                increment="1 month",
            )

temporal/t.rast.list/tests/conftest.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../install/grass84/etc/python/grass/script/core.py:486: in run_command
    returncode = ps.wait()
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/subprocess.py:1083: in wait
    return self._wait(timeout=timeout)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/subprocess.py:1822: in _wait
    (pid, sts) = self._try_wait(0)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <grass.script.core.Popen object at 0x7fd7b1184160>, wait_flags = 0

    def _try_wait(self, wait_flags):
        """All callers to this function MUST hold self._waitpid_lock."""
        try:
>           (pid, sts) = os.waitpid(self.pid, wait_flags)
E           Failed: Timeout >300.0s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed temporal Related to temporal data processing
Development

No branches or pull requests

4 participants