Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed re-write of tools/testharness.py #326

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Proposed re-write of tools/testharness.py #326

wants to merge 14 commits into from

Conversation

Patol75
Copy link
Contributor

@Patol75 Patol75 commented Jun 13, 2021

With this PR, I would like to propose a re-write of Fluidity's test harness. The main ideas are to potentially speed up the execution of the test suites and provide a complete, well-formatted XML test report that includes everything that was re-directed to stdout and stderr during test executions, thereby easing debugging when necessary. I say potentially in regards to speeding up the execution because I am unsure about how GitHub Actions will allocate resources, and this could require some changes. In the proposed implementation, a target is given for the number of logical cores and tests will run concurrently without exceeding that target, except in the case of a test that requires more cores than the target for which oversubscribing is allowed (one oversubscribed test at a time).

I have combined tools/testharness.py and python/fluidity/regressiontest.py into a single tools/testharness.py which handles the whole execution. I believe the file is clearly written, and I have added comments to improve readability - nevertheless, more could be needed.

I have modified a few tests which reported failures that are, I believe, mostly linked to a forgotten clean-up.

  • tests/mms_les_second_order_p1dgp2 looks for a file called MMS_D.flml, which I believe is not part of the test anymore
  • tests/square-convection re-defines max in some Python tests, which leads to failures in the proposed implementation
  • tests/wetting_and_drying_balzano*_cg* call a Python script named plotfs_detec.py, which I believe is not necessary anymore following the new HDF5 implementation

I have modified Makefile.in and .github/workflows/ubuntu.yml to account for the changes made to tools/testharness.py - hopefully, nothing problematic there.

I attach here an example XML output produced running tools/testharness.py -n 8 -e exodusii -x tests_results.xml. Two test failures were reported, but I believe they are platform-dependent as they do not occur on Actions, though I have not investigated.


From Actions results:

  • I have added the test name in all error messages so that it is clear right away from the Summary tab which test failed.
  • I have used action="append" instead of nargs="*" to properly handle tags supplied through the Makefile.
  • Longtests are failing because of the length_condition defined on line 79. I have changed the way the length argument is dealt with, so they should all run now. Additionally, particle_stratified_stable_layer is running because its length is "medium" which, I believe, is a mistake.

The above should yield a green from Actions, except if something arises in longtests (untested on my side). However, I am not committing yet to avoid another Actions run.

Unfortunately, there was no speed up. @tmbgreaves any idea if there is a way to request a given amount of logical cores from Actions and use them in a given job? The short suite completes in less than an hour locally for me.


Having a further look at Actions, it seems to me that concurrency is only possible through parallel jobs, and not through parallel tasks within a job. If this is correct, then the only way to speed up test suites is to execute a single test at a time, as is done for longtests.


This actually might be possible (see here). Say, in tools/testharness.py, I add --github as a command-line argument and use it in the following way:

if args.just_list:
    print("*** Found tests that match the input criteria")
    for test_xml in sorted(tests.keys()):
        print(f"\t-> {test_xml.stem}")
    if args.github:
        import json
        with open("github_tests.json", "w") as fid:
            json.dump([test.name for test in tests.keys()], fid)

Then I have a JSON file at fluidity's root that contains all the tests that need to be run. In the YAML, we could have the following jobs: building, unittest, reading-JSON, test (both short and medium), longtest (both long and vlong). The idea is that in "reading-JSON", we use the outputs capability from Action:

outputs:
  matrix: ${{ steps.set-matrix.outputs.matrix }}
steps:
  - name: Set matrix for build
      id: set-matrix
      run: |
        bin/testharness --just-list --github
        echo "::set-output name=matrix::$( echo "$(cat github_tests.json)" )"

And in "test":

needs: reading-JSON
strategy:
  matrix: ${{fromJson(needs.reading-JSON.outputs.matrix)}}

which automatically generates the matrix as is done manually for longtests at the moment. I would be happy to try this and see if that gives a speedup. However, I am very unfamiliar with Docker and I am unsure of how to proceed. @tmbgreaves do you think that sounds possible?


As a proof of concept, the latest Actions workflow has successfully passed Building, Unittesting and Mediumtesting on Bionic, Focal and Groovy in just over two hours. In comparison, the same load took about five hours in the previous successful Actions workflow. However, it is not clear to me if the same benefit can be obtained for Shorttesting as the overhead before the actual test executes will be too big (for example, see manual.xml in the latest workflow). Additionally, there are over 300 short tests, and the maximum amount of jobs allowed for a given matrix is 256. Therefore, it is potentially better to just execute them sequentially through a single call to testharness (passing -n 0 should do the trick). For Longtesting, the strategy should stay the same, except that we avoid explicitly writing them one by one in the YAML. This being said, it looks like there are a few failures (e.g. top hat), so I will need to have a look. Finally, XML reports could potentially be fused to avoid having so many of them if the proposed strategy is to be kept, but I do not think it is a requirement.


Based on the latest workflow, I have compiled a textfile that lists all tests based on how long they took to run. For each test, I have included the folder where the test is located and the current length given to that test. Potentially, we would want to re-assign some tests, for example not to have any medium test faster than a short test (and all the other possibilities). Something to have in mind is that execution is fast on Actions: a given test takes three times longer to run on my desktop machine (for example, popbal_nonhomog2_2d_dg_p2, 6127 seconds on my machine, 2170 seconds on Actions). As a suggestion, we could have:

  • vlong -> more than 3 hours (10,800 seconds), about 30 tests; excluded from Actions
  • long -> more than 30 minutes (1,800 seconds), about 41 tests; only tested on focal, one test at a time as in the current longtest matrix
  • medium -> more than 3 minutes (180 seconds), about 96 tests; tested on bionic, focal and groovy, one test at a time as in the current mediumtest matrix (though, in that case, we can only have up to 85 tests otherwise the 256 jobs per matrix limit will be exceeded, or we remove one Ubuntu release)
  • short -> more than 10 seconds, about 156 tests; tested on bionic, focal and groovy, altogether as in the current shorttest matrix
  • vshort -> remaining tests, about 184 tests; tested on bionic, focal and groovy, altogether as in the current shorttest matrix

Additionally, lock_exchange_3d_parallel and restratification_after_oodc yield Exit code 137, which I have not looked at yet. The example lock_exchange is failing because of le_tools.py - I have fixed it and it should take about 2600 seconds to run. saltfinger2d_adaptive fails while meshing - I cannot reproduce the failure locally (gmsh 3.0.6 and 4.7.1). particle_stratified_stable_layer does not fail (see mistake mentioned above) - from a previous run, it completes in 11708.586653981 seconds (should be moved to vlong according to the above).

Finally, GitHub says we are running out of disk space as, I believe, the Actions artifacts accumulate. Older ones should be deleted. Also, on an unrelated note, regarding Ubuntu releases, do we plan on replacing groovy by hirsute?

@tmbgreaves
Copy link
Contributor

Many thanks for this and apologies for the slow response - I'll do my best to look through it and comment by the end of the week!

@PYTHONPATH=python tools/testharness.py --length=special --clean >/dev/null
@PYTHONPATH=python tools/testharness.py --length=special --parallelism=parallel --clean >/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'special' tests might be worth removing at this point. The tag was introduced as a 'do not run, this is broken' option, and they've not been tested in a long time, so could probably go. If anyone really cares about them I think they'd have fixed them some time in the past decade :-)

Copy link
Contributor

@tmbgreaves tmbgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With apologies for taking so long to get round to reviewing this (it's been an interesting week..) this looks superb - thank you so much for all your work on it!

As noted in my comments, the one thing I'd possibly wrap into this is dropping out the 'special' length tests which were always a hack to get round tests which were failing and which someone had a thought of fixing at some point. This was on the order of a decade ago, I guess, and since those tests are still languishing in special status and giving a false impression of test coverage, I guess they should be dropped along with the existence of the 'special' length.

Also as noted checking whether I've correctly interpreted the change from 'run tests on all successful builds' to 'fail quickly if any builds fail' and if so whether that's the preferred outcome.

I'm very much a blunt instrument python programmer so I'm not well placed to comment on the python itself (it's way better than anything I write!) but I found it very clear to read through and made complete sense of what it was doing, all of which seemed sensible and thorough.

The latest run at the time of this comment failed for what looks like Actions-internal reasons (ran out of space on a worker, other workers look to have been killed or otherwise gone unresponsive) which I've triggered re-run on to check.

Just as a sanity check - are there runs with intentional fails triggered to make sure they do indeed behave as expected in terms of catching make input / shell / python / variable errors?

Other than the 'special length' consideration and checking preferred behavior if builds fail this all looks great to me.

I'll defer to @stephankramer on all matters relating to the actual presentation of results on Actions, as I think this is far more the domain of someone actively developing as opposed to managing Fluidity :-)

# Define conditions the test must meet
length_condition = (
(args.length == "any"
and prob_length in ["special", "long", "vlong"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per note in Makefile, I think we should consider dropping specials at this point - they're a hack and make it look like we have test coverage in some areas where we actually don't.

Comment on lines 404 to 407
elif set(args.length).difference(["short", "medium", "long", "vlong",
"special"]):
parser.error("""Specify length as either of short, medium, long, vlong, \
or special.""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment as above regarding dropping special tag.

runs-on: ubuntu-latest
needs: build
if: always()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a half-memory that without this 'if' Actions won't run the testing if any of the builds fail, and my thought in including it had been that the testing of distros that did have a clean build were useful even if one or more of the other distro builds failed. The testing phase from the failed build should fail quickly due to the missing docker container.

The other approach which also seems good to me was that if any of the builds fail, you're more concerned about that than subsequent test failures on any build, so you want the whole Actions run to exit quickly and free up workers for following runs fixing the build.

I don't have strong views either way :-)

Comment on lines 187 to 195
exclude:
- release: focal
test-xml: tides_in_the_Mediterranean_Sea.xml
- release: focal
test-xml: flow_past_sphere_Re1000.xml
- release: focal
test-xml: driven_cavity.xml
- release: focal
test-xml: backward_facing_step_3d.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is so much neater 👍

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 20, 2021

@tmbgreaves Thank you very much for your review, and thank you for explaining what special tests actually are - I had no idea. On these particular tests, if they are to be removed, I will be happy to make the changes you highlighted. Thank you as well for pointing out the change of strategy regarding any failure in the building jobs. I feel like it is preferable to entirely fail if any build fails because, in any case, we would have to re-run everything after the build is fixed. But, indeed, to be discussed.

I have cancelled the Actions workflow you restarted as it will not be useful in the current circumstances - sorry about the confusion. To understand why, please have a look at the latest update of the OP in this PR. I have kept updating it after getting results from Actions (potentially not best practice, but I did not want to flood the PR with lots of comments). I would actually be grateful to have your thoughts on what I am proposing there.

There are currently no intentionally failing tests. I have had a play locally with all the various kind of failures that I could think of, and it seemed to me everything was behaving properly. I could add some failures to illustrate the results, but it would potentially be preferable that a reviewer does it as they could come up with failure ideas I could have overlooked. Additionally, some Actions workflows returned failures and, thereby, provide examples of the current strategy used to deal with them.

Thank you again for your review.

@tmbgreaves
Copy link
Contributor

Thanks @Patol75 - now gone back and reread the OP :-)

Comments on thoughts there:

  • I'm not aware of any way to request specific core counts from Actions
  • A general reassignment of test lengths with your definitions sounds great to me and I'd be very much in favour of that change, particularly if it led to being able to run all the suites through your new scheduler. Agreed the lengths are way off at the moment in many cases; that had been on my list to rework once all tests were passing, so I'd be very happy for you to include it in this PR :-)
  • particle_stratified_stable_layer being on the longtest exclusion list was indeed a mistake on my part
  • Thanks for note on saltfinger2d_adaptive - I'll try switching out the gmsh version at some point to see if the error goes away
  • I'm neutral on hirsute - goovy was included to test gcc10, rather than to track latest release, as we'd had a general policy of only building for LTS releases up to that point. I think hirsute has buggy PETSc for our purposes (@stephankramer reported and had fixed a bug that hit in 14.6(?) and IIRC was set to be fixed in 16) so I'd been somewhat hoping to skip over to a version that had the bug fixed rather than go through the hoops of backporting fixes to the hirsute package. However, at some point in the not too distant future groovy support will end, presumably before 'J' is in early development, so if we want to keep testing gcc10 we might want to go to hirsute, or possibly track pre-release impish?
  • I've not entirely understood what's going on with disk space and artifacts - are they cumulative and collectively downloaded to a worker when a run starts? Suggestions for storage management if so would be gratefully received. I'd been assuming that with a pubic repo this wasn't something we had to worry about -- apparently I was being naive there!

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 21, 2021

Thanks again :)

  • I have looked through all the Actions doc I could find and I came to the same conclusion, hence why I have tried to go with matrix concurrency instead. The caveat is that each job has to go through docker pull, which is problematic if the test in itself executes faster than the latter call. I still do not understand what the behaviour is when one process within a job calls mpirun. For example, in the latest workflow, short tests were running serially through a single call to testharness, and that took about twice as long as when, in the previous implementation, we had THREADS=2 or, in the proposed implementation, -n 8. Potentially, this is 2-way SMT and, as a result, only two tests can be handled at the same time. What I understand is that 20 jobs can run concurrently, there is a limit of 256 jobs per matrix and one job cannot excess 6 real hours runtime, but in terms of actual logical processor cores, this is still confusing to me. PS: confirmed through this doc page - 2-core CPU. I am astonished 48-core jobs even start on such a CPU.
  • Re-assigning the lengths is quite trivial through Python, so I would be happy to do it once we settle on the time boundaries. After that, I guess it is just some moving around between /tests and /longtests, so it would require a PR in longtests as well? About running all suites, would we want to run vlong tests as well? What about those that exceed 6 hours?
  • particle_stratified_stable_layer: it is actually in the right folder, we just need its tag to change.
  • saltfinger2d_adaptive: Are we running on older gmsh versions?
  • I see. What I was thinking is that we could only support two Ubuntu releases. For example, at the moment, support 20.04 and 21.04, then when 22.04 is out, support 20.04 and 22.04, then when 23.04 is out, 22.04 and 23.04, etc... This way, there would be a yearly update. However, it is very true that it is probably not worth the effort if dependencies have bugs in them.
  • There is actually an artifact and log retention policy, with 500 MB available space from my understanding. See here. Our retention period is likely too long: we should decrease it to, say, somewhere between a couple of weeks and a month. I do not think I have administrative rights to do it, though.

Additionally, I have noticed that, following the cancelled Actions workflow, the output archive has been overwritten. Fortunately, I had it downloaded. Please access it here.

@tmbgreaves
Copy link
Contributor

  • GMSH version: As far as I know it's pretty recent but I'll check on my next docker base container rebuild
  • VLong tests: These are running through the longtests repo as they're running on self-hosted hardware (to get round the 6h time limit on hosted hardware) and mostly from security considerations I'd be keen to avoid them being able to be run by potentially arbitrary code on PRs
  • Log retention I've set to 14 days now. Good catch, thank you!
  • I'm also impressed at how much we get away with running on 2-core nodes :-)

…ag for now; Re-assign length tag for all tests in /tests and /examples, but keep them where they are for now; Change two test-directory names to make them match with their associated XML filename; Exclude /longtests from Actions for now until a similar PR is merged there; Fix failure in examples/lock_exchange; Miscellaneous fixes.
@tmbgreaves
Copy link
Contributor

Thank you @Patol75 :-)

One more thought on the back of your reassigning lengths - the 'longtests' repo was initially conceived to try and reduce the size of the Fluidity repo a bit by taking very large tests (which generally corresponded to very long tests) into a separate repo. However, there are now a reasonable number of 'long' tests which are easily small enough to be in the main repo.

Rather than moving newly-long tests from main repo to longtests repo, might it be worthwhile leaving them where they are, and potentially even moving some of the smaller longtests back into the main repo?

Thoughts very much welcomed on this :-)

@drhodrid
Copy link
Contributor

drhodrid commented Jun 23, 2021 via email

@tmbgreaves
Copy link
Contributor

Agreed, @drhodrid - definitely keen to keep the 'long' tags to ensure short/medium runs are manageable.

Here are the >1MB results from tests/ and a size-sorted list from longtests: https://pastebin.com/gmapxhMX

I'd propose at minimum moving all the <1MB tests from longtests/ to tests/ , and possibly renaming the 'longtests' repo to 'extratests' or something like that, to avoid confusion?

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 24, 2021

@tmbgreaves @drhodrid Thank you for this additional information, I will keep/put all vshort, short and medium tests in /tests, and do the same for small long tests - 1 MB seems like a reasonable threshold.
On second thoughts, what about putting all tests larger than 500 KB in the to-be-renamed longtests repo? Judging from the output of du -sh * | sort -h while in /tests and /longtests, this corresponds to 74 tests, as opposed to 52 tests larger than 1 MB. I agree that would take away some shorter-length tests, but if this additional repository is here to reduce the size of the main repository, well that would do it. It is worth to say that /tests/data is 34 MB, but I guess we have to live with this one.

Regarding the length re-assigning I did yesterday, it triggered a few interesting events. First, as I did not use lxml, quite a few comments got lost, which likely means I will have to revert the commit and do the re-assigning again, without losing the comments this time. Then, following the commit, the Short tests jobs would not complete on Action. The reason for this is that testharness was hanging in the pattern matching strategy included in the process_error function. What happened is that diamond_validation reported failures following the length re-assigning. Failure processing in the proposed implementation involves parsing stdout using a regex. However, stdout as yielded by diamond_validation is lengthy - I will include a re-write of the XML for this test to limit the generated output while keeping relevant information. It turned out that the regex I had designed was poorly equipped to deal with lengthy outputs, hence why testharness was hanging. I have, therefore, updated the regex to make it more efficient. It seems that it worked, as the current Actions workflow completed all three Short tests jobs and reported each time that diamond_validation failed (I will look into what causes the Annotations entries to display only diamond_validation). Looking at the output, indeed validation failures are reported for a number of XML files. However, what leaves me puzzled is that these tests for which XML validation fails actually succeed when executed through testharness. Is that the expected behaviour? Or is there a further problem in diamond_validation apart from the lengthy output?


To be more precise about the validation failures, the actual reason for failures is the attribute valid of the first item in the list returned by the read method of the Schema class in libspud/diamond/diamond/schema.py. The attribute is False for all failing XML files.

…ness based on stderr generated in diamond_validation; Moved tests from longtests inside tests to assess the behaviour on Actions (vlong tests excluded from the workflow).
@Patol75
Copy link
Contributor Author

Patol75 commented Jun 27, 2021

I think @drhodrid is potentially right regarding why some tests have a large directory size. Many of the tests in longtests that I moved to tests for a test-run on Actions failed because some of their input files had been cleaned up. Additionally, python_validation reported quite a few failures for some of them too. From my understanding, all of these tests are tests we intend to keep in Fluidity and, therefore, they should be fixed.

To make things easier, I have pushed another branch called testharnessClean where I have only put essential changes made so far. This is much easier to look at than having to go through all the commits of this PR. Again, from my understanding, there are still some steps to undertake:

  • Re-assign test lengths for all tests (examples, tests, longtests) using lxml not to lose any comments.
  • Trim tests that are too large where possible.
  • Move tests between tests and longtests.
  • Fix tests that fail in python_validation.

Depending on how much tests can be trimmed, the longtests repository might not be needed anymore. If it is still needed, then @tmbgreaves idea of renaming it makes sense: extratests, largetests, or anything else that is appropriate.

Additionally, there is still the question: is the fact that some XML fail in diamond_validation but execute properly through testharness the expected behaviour?


On testharnessClean, I have copied all tests from longtests inside tests and correctly re-assigned all test lengths (at least, this time, I hope so!). This is currently running on Actions. If results are good, then the first bullet point above will be done.

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 28, 2021

I have made further progress in the testharnessClean branch. Excluding the now 35 vlong tests, the whole test suite takes about 6 hours to run on Actions. I have improved the way errors generated through Python tests are dealt with in testharness, and it looks robust now. Nonetheless, the short-tests job on bionic yielded an error in testharness, which did not occur in the focal and groovy jobs, and which I did not manage to reproduce locally. I guess I will have to trigger a specific Actions run with some print statements to figure out what is going on there. Other than that, many tests actually fail, as summarised in the Annotations panel of the workflow linked above. I think these are legitimate failures that need to be addressed.

Regarding the bullet points I mentioned in the above message, length re-assigning is done, but might need further tuning when all tests actually pass. I believe no comments have been lost in the XML files this time, but I have removed the DOCTYPE entries as they were pointing to non-existing files, and likely unnecessary. Additionally, I have removed some commited MSH files and modified the associated Makefiles to generate meshes during test execution. The latter action came as a result of copying all longtests directories into tests and committing them, which by default excludes files with a .msh extension. Nonetheless, quite a few tests remain large and need attention. To know which ones, simply check out testharnessClean, cd into tests and execute du -sh * | sort -h.

Finally, I will probably not be able to figure out all the failures by myself, so help would be appreciated.

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 29, 2021

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. I have played around with mms_tracer_P1dg_cdg_diff_steady, for which the first element of ab_convergence_gal_stat must be between 1.8 and 2.2 for the first Pass test to yield success. When using the already commited MSH files, the value is 1.88. When generating the MSH files from the commited GEO files, the value is always lower than 1.8 (e.g. 1.75 or 1.68, depending on the GMSH version used). Visualising the MSH files, the only obvious difference I could spot is that all the commited files have a smaller amount of nodes than the generated ones. To give another example, the commited MSH file of viscosity_2d_p0_adaptive_parallel has a 20x11 regular grid node structure, while generating the MSH file through GMSH yields a 21x11 regular grid node structure. Am I missing something here in how to generate the MSH files?

@tmbgreaves
Copy link
Contributor

Apologies for being silent on this one for so long, @Patol75 - chaos this end on multiple levels has pulled me away from Fluidity work. Currently aiming to refocus on Fluidity mid-September, assuming other work packages go as expected.

@adigitoleo
Copy link

An additional note on the current failures reported by Actions: most, if not all, the tests that fail are some of the directories for which I removed the commited MSH files and included gmsh instructions in the Makefile. [...]

Small comment on this: I've experienced significantly different mesh outputs for the same instructions across gmsh "minor" version changes, e.g. 4.4 vs 4.8. In other words, gmsh doesn't really use semantic versioning. Maybe the .msh files were created using a slightly older gmsh version? Another reason against committing .msh files

tools/testharness.py Outdated Show resolved Hide resolved
Copy link
Contributor

@stephankramer stephankramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some prelim. line-by-line comments - to be continued...looking really good so far!

tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Show resolved Hide resolved
xml_entry = TestCase(
name=xml_file.stem, classname=prob_classname, status="Skipped")
xml_entry.add_skipped_info(message="Test suite conditions unmet.")
xml_parser.test_cases.append(xml_entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is a bit wonky - you're modifying a global variable, which is unhelpfully called xml_parser whereas it seems to be a TestSuite. So test_suite or junit_test_suite would be a better name? And test_case instead of xml_entry? (I realize this was bad in the old testharness already)
But more importantly could you just return a list of skipped test cases? Or add it in the same dictionary but with a flag "skipped": True and create the junit stuff all in one place at the end...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the naming conventions; is it a problem that test_suite is defined in the outermost scope?
Regarding the skipped tests, they should already be included in the XML report. What did you have in mind with that list idea?
The problem with creating all TestCase entries for the JUnit report at the end is that different arguments are involved in the call depending on the test's final status and so multiple calls to TestCase arranged in an if, elif, else block would still be required. I am not sure this is a better way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just bad code structure. As a general rule you should only use global variables (and even more so modify them) in exceptional cases. Otherwise your code will be very hard to understand/change later on. When I'm reading that routine I have no idea where test_suite is being defined. When I'm actually reading the bottom of the code where test_suite is created and later on the completed test results are appended, I have no way of telling that the inbetween call to filter_tests also modifies test_suite. So it becomes very hard to reason about the data flow - and for instance I might be tempted to move the creation of the test_suite to after the call to filter_tests.
So as a minimum you should have test_suite as an argument to filter_tests - to signify that this routine is doing something to that test_suite, and then tell us at the top what it is: namely "This routine filters tests that fulfill our chosen criteria and are returned in a dictionary. Information about skipped tests is stored in the junit TestSuite". Which brings us to the second part of what I suggested (but that's more of a suggestion and if it doesn't work easily feel free to ignore) - if you think about your routine in this way it becomes clear that it is doing multiple things to multiple data structures at the same time. Again as a general rule, you want to break up your code in such a way that each function has a single responsibility. This makes it much easier to later on to modify the code. Even when you're not using classes, try to think about your routines in terms of the data structures it operates on and ideally only modify one structure at the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that Python allows it, I did not see an issue in doing it. However, it is true that after searching the Internet, I could not find anyone arguing that way, but I found a few discussions (e.g. here) where people also recommend passing the class instance to the function, so I am now doing it. Looking further through the code, I also treat the dictionary tests as a global in quite a few functions. You would recommend changing that too, right?

Regarding where the JUnit report is created/modified, I think it would be quite annoying to have to store information elsewhere when we can actually do it right away in the JUnit objects as we process the tests. It does not seem straightforward to gather all possible scenarios in one function: process_error looks too complex for that. If you have an idea about how to do it, I am happy to try to implement it, but for now, I have just expanded a few comments and docstrings as suggested.

@Patol75
Copy link
Contributor Author

Patol75 commented Jun 15, 2022

Thank you for looking into these proposed changes to the testing script, @stephankramer.

Unfortunately, the changes here are rather messy, and I had kept this PR opened mainly for discussion. I had created another branch that would likely be a better candidate for a merge, even though it is now quite outdated. The main problem that was left in that other branch was to decide what to do with those tests that had their MSH file(s) committed; most of them were hosted in the longtests repository. I guess there are three options: (i) keep the committed MSH files and potentially rename longtests to largetests, given that it would be hosting all these tests that result in a large directory size; (ii) generate GEO files that can reproduce meshes described by the MSH files; (iii) drop tests that have MSH files committed. A combination of (i) and (ii) is probably what we should aim for, but happy to have your thoughts about it.

Regarding testharness.py, given that it is basically the same file in both branches, I will address your reviewing comments and include some changes I made for the CMake branch here. It will be easy to move it around to any branch.

@stephankramer
Copy link
Contributor

Yeah I agree that we should probably go for a combination of (i) and (ii) (and possibly (iii) :-))
This is also ties into the fact that we need to upgrade to some newer version of gmsh (current base images in which the tests are run all have gmsh 2.6 hard-coded). This is an old discussion and comes back to the question to how exactly a test should be reproducible/deterministic. On the one hand you would like to see any unintended consequence of a commit, so by that school of thought even if an answer changes just a little bit the test should fail. This can only be achieved by having exactly the same mesh as input (i.e. commit them) as the meshing process in general is not reproducible. But then we also want to test the meshing step as part of the CI (in an integrated sense!), so there you would have tests only fail if the answer changes significantly . Trouble is these tests can be hard to write. So as a sort of compromise we have tended to go for a mix of committed and non-committed meshes. But yes, size of these meshes is another practical concern.
I think what you and @tmbgreaves suggested, to integrate most of the long tests back into main repo, so that they can be scheduled in the most efficient way, and only have some exceptional ones that are either very long or have large meshes - we could also run these less frequently (as we currently do with vlong? and/or drop some of them) - sounds like a good plan. One thing to keep in mind, if you do (accidentally) end up putting some very large files back in, even on a branch, and want to undo that later, make sure to overwrite history, otherwise everyone is still downloading it when they clone.

Copy link
Contributor

@stephankramer stephankramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff. I'm not normally a fan of rewriting things from scratch, but sometimes it works out quite nicely.
One missing feature is the --just-test option which I do use frequently: if you write a test that takes a while to run, you don't want to be rerunning it everytime you're tweaking some of the python tests.

tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
tools/testharness.py Outdated Show resolved Hide resolved
@Patol75
Copy link
Contributor Author

Patol75 commented Jun 21, 2022

Thanks again for all the reviews; the code looks better now. Happy to get more feedback if you have time.

Regarding the version of GMSH used in the CI, I have actually used the latest GMSH in the CMake branch for all tests, and it is true that a large number of tests reported failures. Some of them behaved as expected when -format msh2 was additionally supplied to the GMSH command line, but only a few. At that point, from an outsider's perspective, it is hard to judge if the Python variable values obtained through the test are still appropriate, in which case the thresholds can be updated, or if they are not. In the latter situation, once again, it is not straightforward to determine if the test was not robust enough against mesh changes or if something failed silently in the building/execution process.

Ideally, I think we would want to have only GEO files and accept that answers can change, whilst also having tests sufficiently robust so that answers would not change too much. The problem here is that it could mean quite a lot of work for these tests that currently fail with the newer versions of GMSH. I am saying could because a lot of those tests look very similar to me; there could be a common strategy to make them all more robust.

My experience is that only very few tests currently lack a GEO file, and an additional very few fail to run on Actions because of resource limits (time and/or number of CPU cores). These would indeed be good candidates for the alternative repository, although some committed MSH files are not that big. Moreover, other tests have committed files other than MSH that are quite big, so they could also be legitimate candidates.

Finally, regarding your comment about pushing large files, I think I have three branches where it most likely happened: testharness, testharnessClean and cmake_build_2022. Would deleting the branches be sufficient when changes are incorporated into main (or discarded)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants