-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed re-write of tools/testharness.py #326
base: main
Are you sure you want to change the base?
Conversation
… that returned errors.
…on setup in YAML.
…ours (potentially missing some); Fix for examples/top_hat; Various minor fixes in tools/testharness.py, Makefile.in and .github/workflows/ubuntu.yml
Many thanks for this and apologies for the slow response - I'll do my best to look through it and comment by the end of the week! |
@PYTHONPATH=python tools/testharness.py --length=special --clean >/dev/null | ||
@PYTHONPATH=python tools/testharness.py --length=special --parallelism=parallel --clean >/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'special' tests might be worth removing at this point. The tag was introduced as a 'do not run, this is broken' option, and they've not been tested in a long time, so could probably go. If anyone really cares about them I think they'd have fixed them some time in the past decade :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With apologies for taking so long to get round to reviewing this (it's been an interesting week..) this looks superb - thank you so much for all your work on it!
As noted in my comments, the one thing I'd possibly wrap into this is dropping out the 'special' length tests which were always a hack to get round tests which were failing and which someone had a thought of fixing at some point. This was on the order of a decade ago, I guess, and since those tests are still languishing in special status and giving a false impression of test coverage, I guess they should be dropped along with the existence of the 'special' length.
Also as noted checking whether I've correctly interpreted the change from 'run tests on all successful builds' to 'fail quickly if any builds fail' and if so whether that's the preferred outcome.
I'm very much a blunt instrument python programmer so I'm not well placed to comment on the python itself (it's way better than anything I write!) but I found it very clear to read through and made complete sense of what it was doing, all of which seemed sensible and thorough.
The latest run at the time of this comment failed for what looks like Actions-internal reasons (ran out of space on a worker, other workers look to have been killed or otherwise gone unresponsive) which I've triggered re-run on to check.
Just as a sanity check - are there runs with intentional fails triggered to make sure they do indeed behave as expected in terms of catching make input / shell / python / variable errors?
Other than the 'special length' consideration and checking preferred behavior if builds fail this all looks great to me.
I'll defer to @stephankramer on all matters relating to the actual presentation of results on Actions, as I think this is far more the domain of someone actively developing as opposed to managing Fluidity :-)
tools/testharness.py
Outdated
# Define conditions the test must meet | ||
length_condition = ( | ||
(args.length == "any" | ||
and prob_length in ["special", "long", "vlong"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per note in Makefile, I think we should consider dropping specials at this point - they're a hack and make it look like we have test coverage in some areas where we actually don't.
tools/testharness.py
Outdated
elif set(args.length).difference(["short", "medium", "long", "vlong", | ||
"special"]): | ||
parser.error("""Specify length as either of short, medium, long, vlong, \ | ||
or special.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment as above regarding dropping special tag.
runs-on: ubuntu-latest | ||
needs: build | ||
if: always() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a half-memory that without this 'if' Actions won't run the testing if any of the builds fail, and my thought in including it had been that the testing of distros that did have a clean build were useful even if one or more of the other distro builds failed. The testing phase from the failed build should fail quickly due to the missing docker container.
The other approach which also seems good to me was that if any of the builds fail, you're more concerned about that than subsequent test failures on any build, so you want the whole Actions run to exit quickly and free up workers for following runs fixing the build.
I don't have strong views either way :-)
.github/workflows/ubuntu.yml
Outdated
exclude: | ||
- release: focal | ||
test-xml: tides_in_the_Mediterranean_Sea.xml | ||
- release: focal | ||
test-xml: flow_past_sphere_Re1000.xml | ||
- release: focal | ||
test-xml: driven_cavity.xml | ||
- release: focal | ||
test-xml: backward_facing_step_3d.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is so much neater 👍
@tmbgreaves Thank you very much for your review, and thank you for explaining what I have cancelled the Actions workflow you restarted as it will not be useful in the current circumstances - sorry about the confusion. To understand why, please have a look at the latest update of the OP in this PR. I have kept updating it after getting results from Actions (potentially not best practice, but I did not want to flood the PR with lots of comments). I would actually be grateful to have your thoughts on what I am proposing there. There are currently no intentionally failing tests. I have had a play locally with all the various kind of failures that I could think of, and it seemed to me everything was behaving properly. I could add some failures to illustrate the results, but it would potentially be preferable that a reviewer does it as they could come up with failure ideas I could have overlooked. Additionally, some Actions workflows returned failures and, thereby, provide examples of the current strategy used to deal with them. Thank you again for your review. |
Thanks @Patol75 - now gone back and reread the OP :-) Comments on thoughts there:
|
Thanks again :)
Additionally, I have noticed that, following the cancelled Actions workflow, the output archive has been overwritten. Fortunately, I had it downloaded. Please access it here. |
|
…ag for now; Re-assign length tag for all tests in /tests and /examples, but keep them where they are for now; Change two test-directory names to make them match with their associated XML filename; Exclude /longtests from Actions for now until a similar PR is merged there; Fix failure in examples/lock_exchange; Miscellaneous fixes.
Thank you @Patol75 :-) One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo. Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo? Thoughts very much welcomed on this :-) |
Hi Tim,
I don’t see any harm in moving some of the long tests back in to the main repo, provided they remain tagged as “long” - it’s always nice to be able to run the “short” and “medium” test suites on local systems.
Out of interest, which long tests are excessively large? Are there obvious reasons (for example meshes have been committed rather than just .geo files)? How many of these can be trimmed?
R
On 23 Jun 2021, at 5:54 pm, Tim Greaves ***@***.******@***.***>> wrote:
Thank you @Patol75<https://github.com/Patol75> :-)
One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.
Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?
Thoughts very much welcomed on this :-)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#326 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB25UKUDHBJYAU5NKEHM443TUGHJXANCNFSM46TGJYBA>.
|
Agreed, @drhodrid - definitely keen to keep the 'long' tags to ensure short/medium runs are manageable. Here are the >1MB results from tests/ and a size-sorted list from longtests: https://pastebin.com/gmapxhMX I'd propose at minimum moving all the <1MB tests from longtests/ to tests/ , and possibly renaming the 'longtests' repo to 'extratests' or something like that, to avoid confusion? |
@tmbgreaves @drhodrid Thank you for this additional information, I will keep/put all vshort, short and medium tests in Regarding the length re-assigning I did yesterday, it triggered a few interesting events. First, as I did not use To be more precise about the validation failures, the actual reason for failures is the attribute |
…ness based on stderr generated in diamond_validation; Moved tests from longtests inside tests to assess the behaviour on Actions (vlong tests excluded from the workflow).
I think @drhodrid is potentially right regarding why some tests have a large directory size. Many of the tests in To make things easier, I have pushed another branch called testharnessClean where I have only put essential changes made so far. This is much easier to look at than having to go through all the commits of this PR. Again, from my understanding, there are still some steps to undertake:
Depending on how much tests can be trimmed, the longtests repository might not be needed anymore. If it is still needed, then @tmbgreaves idea of renaming it makes sense: Additionally, there is still the question: is the fact that some XML fail in On testharnessClean, I have copied all tests from longtests inside tests and correctly re-assigned all test lengths (at least, this time, I hope so!). This is currently running on Actions. If results are good, then the first bullet point above will be done. |
I have made further progress in the testharnessClean branch. Excluding the now 35 vlong tests, the whole test suite takes about 6 hours to run on Actions. I have improved the way errors generated through Python tests are dealt with in testharness, and it looks robust now. Nonetheless, the short-tests job on bionic yielded an error in testharness, which did not occur in the focal and groovy jobs, and which I did not manage to reproduce locally. I guess I will have to trigger a specific Actions run with some print statements to figure out what is going on there. Other than that, many tests actually fail, as summarised in the Annotations panel of the workflow linked above. I think these are legitimate failures that need to be addressed. Regarding the bullet points I mentioned in the above message, length re-assigning is done, but might need further tuning when all tests actually pass. I believe no comments have been lost in the XML files this time, but I have removed the DOCTYPE entries as they were pointing to non-existing files, and likely unnecessary. Additionally, I have removed some commited MSH files and modified the associated Makefiles to generate meshes during test execution. The latter action came as a result of copying all longtests directories into tests and committing them, which by default excludes files with a .msh extension. Nonetheless, quite a few tests remain large and need attention. To know which ones, simply check out Finally, I will probably not be able to figure out all the failures by myself, so help would be appreciated. |
An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included |
Apologies for being silent on this one for so long, @Patol75 - chaos this end on multiple levels has pulled me away from Fluidity work. Currently aiming to refocus on Fluidity mid-September, assuming other work packages go as expected. |
Small comment on this: I've experienced significantly different mesh outputs for the same instructions across gmsh "minor" version changes, e.g. 4.4 vs 4.8. In other words, gmsh doesn't really use semantic versioning. Maybe the .msh files were created using a slightly older gmsh version? Another reason against committing .msh files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some prelim. line-by-line comments - to be continued...looking really good so far!
tools/testharness.py
Outdated
xml_entry = TestCase( | ||
name=xml_file.stem, classname=prob_classname, status="Skipped") | ||
xml_entry.add_skipped_info(message="Test suite conditions unmet.") | ||
xml_parser.test_cases.append(xml_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a bit wonky - you're modifying a global variable, which is unhelpfully called xml_parser
whereas it seems to be a TestSuite
. So test_suite
or junit_test_suite
would be a better name? And test_case
instead of xml_entry
? (I realize this was bad in the old testharness already)
But more importantly could you just return a list of skipped test cases? Or add it in the same dictionary but with a flag "skipped": True
and create the junit stuff all in one place at the end...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the naming conventions; is it a problem that test_suite
is defined in the outermost scope?
Regarding the skipped tests, they should already be included in the XML report. What did you have in mind with that list idea?
The problem with creating all TestCase entries for the JUnit report at the end is that different arguments are involved in the call depending on the test's final status and so multiple calls to TestCase arranged in an if, elif, else
block would still be required. I am not sure this is a better way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just bad code structure. As a general rule you should only use global variables (and even more so modify them) in exceptional cases. Otherwise your code will be very hard to understand/change later on. When I'm reading that routine I have no idea where test_suite is being defined. When I'm actually reading the bottom of the code where test_suite is created and later on the completed test results are appended, I have no way of telling that the inbetween call to filter_tests
also modifies test_suite
. So it becomes very hard to reason about the data flow - and for instance I might be tempted to move the creation of the test_suite
to after the call to filter_tests
.
So as a minimum you should have test_suite as an argument to filter_tests - to signify that this routine is doing something to that test_suite, and then tell us at the top what it is: namely "This routine filters tests that fulfill our chosen criteria and are returned in a dictionary. Information about skipped tests is stored in the junit TestSuite". Which brings us to the second part of what I suggested (but that's more of a suggestion and if it doesn't work easily feel free to ignore) - if you think about your routine in this way it becomes clear that it is doing multiple things to multiple data structures at the same time. Again as a general rule, you want to break up your code in such a way that each function has a single responsibility. This makes it much easier to later on to modify the code. Even when you're not using classes, try to think about your routines in terms of the data structures it operates on and ideally only modify one structure at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that Python allows it, I did not see an issue in doing it. However, it is true that after searching the Internet, I could not find anyone arguing that way, but I found a few discussions (e.g. here) where people also recommend passing the class instance to the function, so I am now doing it. Looking further through the code, I also treat the dictionary tests
as a global in quite a few functions. You would recommend changing that too, right?
Regarding where the JUnit report is created/modified, I think it would be quite annoying to have to store information elsewhere when we can actually do it right away in the JUnit objects as we process the tests. It does not seem straightforward to gather all possible scenarios in one function: process_error
looks too complex for that. If you have an idea about how to do it, I am happy to try to implement it, but for now, I have just expanded a few comments and docstrings as suggested.
Thank you for looking into these proposed changes to the testing script, @stephankramer. Unfortunately, the changes here are rather messy, and I had kept this PR opened mainly for discussion. I had created another branch that would likely be a better candidate for a merge, even though it is now quite outdated. The main problem that was left in that other branch was to decide what to do with those tests that had their MSH file(s) committed; most of them were hosted in the longtests repository. I guess there are three options: (i) keep the committed MSH files and potentially rename longtests to largetests, given that it would be hosting all these tests that result in a large directory size; (ii) generate GEO files that can reproduce meshes described by the MSH files; (iii) drop tests that have MSH files committed. A combination of (i) and (ii) is probably what we should aim for, but happy to have your thoughts about it. Regarding |
Yeah I agree that we should probably go for a combination of (i) and (ii) (and possibly (iii) :-)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great stuff. I'm not normally a fan of rewriting things from scratch, but sometimes it works out quite nicely.
One missing feature is the --just-test
option which I do use frequently: if you write a test that takes a while to run, you don't want to be rerunning it everytime you're tweaking some of the python tests.
Thanks again for all the reviews; the code looks better now. Happy to get more feedback if you have time. Regarding the version of GMSH used in the CI, I have actually used the latest GMSH in the CMake branch for all tests, and it is true that a large number of tests reported failures. Some of them behaved as expected when Ideally, I think we would want to have only GEO files and accept that answers can change, whilst also having tests sufficiently robust so that answers would not change too much. The problem here is that it could mean quite a lot of work for these tests that currently fail with the newer versions of GMSH. I am saying could because a lot of those tests look very similar to me; there could be a common strategy to make them all more robust. My experience is that only very few tests currently lack a GEO file, and an additional very few fail to run on Actions because of resource limits (time and/or number of CPU cores). These would indeed be good candidates for the alternative repository, although some committed MSH files are not that big. Moreover, other tests have committed files other than MSH that are quite big, so they could also be legitimate candidates. Finally, regarding your comment about pushing large files, I think I have three branches where it most likely happened: testharness, testharnessClean and cmake_build_2022. Would deleting the branches be sufficient when changes are incorporated into main (or discarded)? |
With this PR, I would like to propose a re-write of Fluidity's test harness. The main ideas are to potentially speed up the execution of the test suites and provide a complete, well-formatted XML test report that includes everything that was re-directed to stdout and stderr during test executions, thereby easing debugging when necessary. I say potentially in regards to speeding up the execution because I am unsure about how GitHub Actions will allocate resources, and this could require some changes. In the proposed implementation, a target is given for the number of logical cores and tests will run concurrently without exceeding that target, except in the case of a test that requires more cores than the target for which oversubscribing is allowed (one oversubscribed test at a time).
I have combined tools/testharness.py and python/fluidity/regressiontest.py into a single tools/testharness.py which handles the whole execution. I believe the file is clearly written, and I have added comments to improve readability - nevertheless, more could be needed.
I have modified a few tests which reported failures that are, I believe, mostly linked to a forgotten clean-up.
MMS_D.flml
, which I believe is not part of the test anymoremax
in some Python tests, which leads to failures in the proposed implementationplotfs_detec.py
, which I believe is not necessary anymore following the new HDF5 implementationI have modified Makefile.in and .github/workflows/ubuntu.yml to account for the changes made to tools/testharness.py - hopefully, nothing problematic there.
I attach here an example XML output produced running
tools/testharness.py -n 8 -e exodusii -x tests_results.xml
. Two test failures were reported, but I believe they are platform-dependent as they do not occur on Actions, though I have not investigated.From Actions results:
action="append"
instead ofnargs="*"
to properly handle tags supplied through the Makefile.length_condition
defined on line 79. I have changed the way the length argument is dealt with, so they should all run now. Additionally,particle_stratified_stable_layer
is running because its length is "medium" which, I believe, is a mistake.The above should yield a green from Actions, except if something arises in longtests (untested on my side). However, I am not committing yet to avoid another Actions run.
Unfortunately, there was no speed up. @tmbgreaves any idea if there is a way to request a given amount of logical cores from Actions and use them in a given job? The short suite completes in less than an hour locally for me.
Having a further look at Actions, it seems to me that concurrency is only possible through parallel jobs, and not through parallel tasks within a job. If this is correct, then the only way to speed up test suites is to execute a single test at a time, as is done for longtests.
This actually might be possible (see here). Say, in tools/testharness.py, I add
--github
as a command-line argument and use it in the following way:Then I have a JSON file at fluidity's root that contains all the tests that need to be run. In the YAML, we could have the following jobs: building, unittest, reading-JSON, test (both short and medium), longtest (both long and vlong). The idea is that in "reading-JSON", we use the outputs capability from Action:
And in "test":
which automatically generates the matrix as is done manually for longtests at the moment. I would be happy to try this and see if that gives a speedup. However, I am very unfamiliar with Docker and I am unsure of how to proceed. @tmbgreaves do you think that sounds possible?
As a proof of concept, the latest Actions workflow has successfully passed Building, Unittesting and Mediumtesting on Bionic, Focal and Groovy in just over two hours. In comparison, the same load took about five hours in the previous successful Actions workflow. However, it is not clear to me if the same benefit can be obtained for Shorttesting as the overhead before the actual test executes will be too big (for example, see manual.xml in the latest workflow). Additionally, there are over 300 short tests, and the maximum amount of jobs allowed for a given matrix is 256. Therefore, it is potentially better to just execute them sequentially through a single call to testharness (passing -n 0 should do the trick). For Longtesting, the strategy should stay the same, except that we avoid explicitly writing them one by one in the YAML. This being said, it looks like there are a few failures (e.g. top hat), so I will need to have a look. Finally, XML reports could potentially be fused to avoid having so many of them if the proposed strategy is to be kept, but I do not think it is a requirement.
Based on the latest workflow, I have compiled a textfile that lists all tests based on how long they took to run. For each test, I have included the folder where the test is located and the current length given to that test. Potentially, we would want to re-assign some tests, for example not to have any medium test faster than a short test (and all the other possibilities). Something to have in mind is that execution is fast on Actions: a given test takes three times longer to run on my desktop machine (for example,
popbal_nonhomog2_2d_dg_p2
, 6127 seconds on my machine, 2170 seconds on Actions). As a suggestion, we could have:Additionally,
lock_exchange_3d_parallel
andrestratification_after_oodc
yieldExit code 137
, which I have not looked at yet. The examplelock_exchange
is failing because ofle_tools.py
- I have fixed it and it should take about 2600 seconds to run.saltfinger2d_adaptive
fails while meshing - I cannot reproduce the failure locally (gmsh 3.0.6 and 4.7.1).particle_stratified_stable_layer
does not fail (see mistake mentioned above) - from a previous run, it completes in 11708.586653981 seconds (should be moved to vlong according to the above).Finally, GitHub says we are running out of disk space as, I believe, the Actions artifacts accumulate. Older ones should be deleted. Also, on an unrelated note, regarding Ubuntu releases, do we plan on replacing groovy by hirsute?