-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing protocol #16
Comments
note that the text out file has a time stamp so md5 checks are problematic. |
I think we would need to have a more advanced testing mechanism, which checks all the functions separately, the sets/gets, on different platforms (x64, x86 etc). Also something which we should not overlook is the connection with the EPANET-MSX, making sure that the latest EPANET version is compatible. Together with Marios Kyriakou (@Mariosmsk) we have developed such a code in Matlab which uses the DLL. You can run with it all the functions, and it also reads the binary output file and can parse the INP file. I think it would be easy to port this in Python at some point if needed. In any case, we'll try to improve it a bit in order to release it on github, and we can improve it further as needed. |
i think i've seen automated python frameworks that do testing for you -- i wonder if we can integrate with something out-of-the-box. in general, i see the need for two levels of testing: unit tests and regression tests. the regression tests would arguable be simpler, and would just track any significant changes to hydraulic and quality solutions between software versions - this could be handled by your ( @eldemet ) hydraulic binary file parser, or by running the analysis through toolkit functions and tracking/comparing results in memory. unit testing will require some more internal tweaks to accommodate the granularity one would expect. this should definitely be on the roadmap, but maybe not as high-urgency. |
With regards to unit testing It might be safer to do this from C (list of C testing frameworks) and test Python/Matlab bindings separately. Also, a nice goal would be automated testing using a continuous integration service (e.g. travis), with a test suite status widget displayed on GH (by including it in |
@willfurnass excellent. do you have any experience with travisCI? |
@samhatchett Encountered it for the first time recently. Not experienced but happy to try setting it up once we've got a basic test suite. |
Has there been any action on this? I have some code that provides a convenient API for reading the binary output file that may come in handy for automated testing. |
nope - not yet to my knowledge. can you make your code available here? As an initial "straw man" test, what if we just had a static binary output file generated from 2.0.12 ("known good output"), and then as the one and only test confirmed that the binary output from the dev version matches this standard? This would just be the first step in getting better testing. That's the way the Kilogram used to work - there was a canonical Kilogram that you would go to and weigh against your "new kilogram" to make sure that what you have is still a kilogram. |
@samhatchett how do you determine if the two output files "matches"? First, output files would always have the same size even if the numbers are totally different. Second, I think looking at hashes would also won't work since we already saw some small differences between the versions (after 4 decimal places). I think we should compare the results "number by number". @eldemet what kind of comparison are you doing with your Matlab tool? |
Hi @eladsal we are doing a full comparison of all the elements and functions, comparing two versions of epanet (e.g. 2.00.12 with any other). We can also read from the Binaries, but we also check the run-time results. We plan to release an open-source tool for this on github. We also examine multiple epanet benchmarks, with different quality, rules, controls etc. So at the moment it just returns a flag if the output is different by doing a function-by-function comparison. Also note that, in some cases, results will be different. For instance, corrections in the tank diameters may affect quality, which should be acceptable, but we need to keep this in mind. We can compile the code as an executable (so that matlab is not a requirement), or we can do this in Python (however we do have this running in matlab, so it's easier :-) ). I'll discuss this with Marios (@Mariosmsk) to see when we can have this software released. @eladsal, @samhatchett, if we release this tool, what would be a convenient use case? What would be a desired output? A report with the differences? |
I was suggesting using Michael's binary parser to compare.
|
I haven’t looked at Michael’s binary format API but, for example, if it allows access to the node- and link- element value arrays at particular time steps, then I’d say that we should do the following for saying that two files are “acceptably similar"
Where we’d specify the value of minNumberCorrectDigits to be, say, 5?
|
Nice discussion. Lots of great ideas. @samhatchett by all means I am happy to share my code. Right now it is written in C and compiles as a dll. It has about a dozen functions and offers convenient methods for getting different "slices" of data out of the binary file. My thought is that it can remain as a dll or we could eventually expand the toolkit functionality in this direction. Where do you think it belongs in the source tree? |
Great, how about in the 'tools' directory in dev-2.1 - maybe a subdirectory in there. |
it pains me to say it, but I think this is an important part of the 2.1 release -- what really is needed here? assuming Michael's code works, we could probably very quickly set up a Travis CI thing (like that suggested by @willfurnass ) to automatically test the output of dev versions against a known-good results file (or set of them). This would have the advantage of generating neat graphics to announce that the code passes all tests. does that sound like overkill? |
@samhatchett from my perspective, this has always been on the critical path to completing 2.1. I figured regression testing could be done by wrapping the output API in Python and using a framework to drive a battery of tests. I am interested in trying Travis CI. One question I have is how does it handle testing on different platforms? It looks like it only allows testing on OS X and Linux as a beta feature. Since the primary platform for EPANET has historically been Windows it makes sense to have the ability to test on Windows. When you ported EPANET to OS X how extensively did you bench marked the code to determine what effect, if any, platform changes have on the computational results? |
That’s a good question about TravisCI. Let’s face it, we don’t, to my knowledge, have a set of official correct answers for a valid set of test problems. And so I’ve never heard of anyone talking about differences on Linux/OS X compared to windows. I’m assuming that we would need to start by identifying the network input files that the devcom and community thinks represent a relatively complete set of simulation options. I’ll also assume that the “correct” answers are represented by binary output files from running the published pre 2.1 EPA version, on windows. These correct answers would need to be stored in the repo somewhere, and be the gold standard. The answer to Michael’s question will then be derived for the first time, from a comparison of results produced by running the tests on whatever platform they’re running on. I also don’t think it’s a bad thing to just use a simple Python application to process binary files against toolkit simulation results. That would certainly be simple. But I don’t know TravisCI. Maybe it brings a lot of other benefits, although I admit I’d be a little concerned about the “CI” part, mainly because I’m suspicious about imposing that overhead. But maybe it is really transparent.
|
my understanding is that Travis can be pretty easy, and in fact can do exactly what you describe. It just runs a configurable set of scripts whenever there's a push to the repo, and gives you some feedback on whether you broke something or not. |
I am genuinely interested in trying Travis CI. I think there is a way we can have our cake and eat it too, if we write the testing code compatible with Travis CI and so that it can be run manually as well. |
I have a semi-automated code that basically compares two binary output files and give an output such as this: |
I feel pretty strongly that all comparison results should be in terms of correct decimal digits. These will be scale/unit independent.
|
@jamesuber here it is using min( — log10( abs( X1 - X2 ) ) ): |
@eladsal so what is your comparison code written in? What is most valuable is that you have worked through the algorithm of doing the comparison. |
The code is in VB6 and runs on my Windows machine. If there is any use for it I can polish it a bit and make it public. |
Initial commit of VB6 code to perform comparison of binary output files ec04b7f. Would anyone want the compiled EXE (don't think it should go into GH)? |
@samhatchett @michaeltryby The python wrapper currently lives in pull request #52. |
good discussion over on #54 - so related directly to the prototype testing protocol, once we have the script that actually does the comparison, we will need:
any volunteers wish to think about these issues and throw out a plan? |
So what we are talking about here is network based regression testing. All we need is a few networks to get the ball rolling and then design the tests in such a way that its super easy to add new networks over time. It would be super convenient, if for example, all we had to do was add a folder containing an input file, a binary file, and some meta data to a designated test directory to include a new test. As opposed to having to modify code to do so. Make sense? Lew accumulated a large collection of networks while developing EPANET. I'm sure they could be useful. One issue is the need to anonymize them before they could be publicly distributed as part of a test suite. |
|
✋ i will volunteer to set up the shell script for iterating / recursing through directories in order to pass paths into the testing tool |
so this turned out to be pretty cool. PR #58 will add a .sh script which recurses through a test-network directory (mocked up in the directory tree), run the command-line epanet, and call into a python script to compare binary files. the actual epanet-running and comparison are all fake, but this is the structure. Travis output is here: https://travis-ci.org/OpenWaterAnalytics/EPANET/builds/97126305 |
I agree! This is pretty cool. 😄 |
"i have no brains" - code is poetry @sam cool 👍 |
@michaeltryby its great using the networks in Lew's collection. Maybe we can start by looking at networks in the public domain like the ones used in the battles and some other benchmark networks. These are well known and many people have worked with. I'll try to make up a list and we can see if that is sufficient to get us started. |
@michaeltryby and @jamesuber related to the discussion about true/false testing / and others as well: ... what should the tolerances be? the "real" binary output file was generated on my mac under epanet 2.1, uploaded to github, and then used in testing to compare against what should be the same binary file (which was generated by Travis after building the same version of epanet). interesting. |
Yeah, so we are comparing Mac to Linux for the first time and right out of the gate there are differences. Platform fuzz. We could modify the script to accumulate statistics on the differences and then apply some set of criteria to determine whether they are acceptable. What statistics should we accumulate while we process the file? |
I have compared @samhatchett 's output (Mac) to my Windows DLL (both 2.1) and got no difference at all. |
@eladsal That's really interesting and useful information. I'm pleasantly surprised there is no difference between Mac and Windows. Has anyone been working on the water quality routines? A difference of 0.25 seems significant ... |
I think there was a change in water quality - stepwise calculations. However, I see no differences in nodes quality results. I think I found the problem, it's the place |
Until we get more experience with what is "unexpected," I'd suggest that the comparison and test outputs be built with compiler optimizations completely disabled. It just removes another variable. Plus, if we find that differences are dependent on compiler optimization, or even purely on platform even when optimizations are enabled, then we should question the stability of our algorithms. That might mean we should look into adhering to certain coding practices, and examine more carefully the entire suite of compiler warnings (e.g. by turning on ANSI adherence, paying close attention to implicit type conversions,...). I still do think that basing the "true/false" designation on the number of correct digits is the way to go. I do not think that efficiency should be a consideration at this point, and I don't see how we have a chance of defining "the tolerance" if we are using input files with potentially different units, different water quality boundary conditions, etc. It's a hard enough problem, without introducing those factors. Sadly, people who have thought about this a lot more than me have pointed out that even calculating the correct number of digits needs to be done carefully. See section 3 of this frequently cited paper by McCullough: http://www.pages.drexel.edu/~bdm25/tas1.pdf. He points out that if the correct and test values are significantly different (more than a factor of 2), then the formula I cited above won't give the number of correct digits. It's not hard, but you have to account for that. And, I noted that McCullough recommends normalizing the absolute error by the test value (as long as the test value is not too close to zero - ugh), in which case you are measuring the number of correct significant digits. Until this gets worked out, it seems that the process might also benefit from simplification in the test problems. And it's probably healthy anyway to lean towards a hierarchy of test problems ranging in level of "difficulty" (e.g. "Oh my God it failed that!" or "Oh yeah, that's really a touchy one"). So for example, the initial tests maybe should be a variety of steady state hydraulic simulations of different complexity, in terms of size, pumps/valves, controls/rules. Admittedly this tests only a portion of the code, but that would be the point. From there it makes sense to move into EPS hydraulic tests, and finally into water quality. |
Testing fit for purpose. @jamesuber I agree with your suggestions, but I don't believe that Travis-CI is the right place to exercise them. The CI regression test is analogous to a build test in that it keeps you from checking in code with obvious defects. Its a safety net that helps catch stupid mistakes. I wouldn't want to use it to port to a different compiler/platform. Think of how many unknowns we could eliminate by running more rigorous tests locally as opposed to in the cloud. Travis-CI runs on a Linux server somewhere. We are getting a glimpse that, as expected, there are cross platform issues we need to consider. Does that mean we need to commit to a porting exercise at this juncture? I suggest we use Travis-CI to help narrow in on the robust Windows benchmark that we have and tackle porting issues in a later release. |
Well that’s certainly all OK with me - heck I’m just an observer at present anyway. Just seemed to me that the entire thread was not about this at all, but rather focussed on as @samhatchett put it - "what should the tolerances be?” I mean, really, if we’re obviously going to be doing something different for the “real” testing, then the Travis-CI stuff seems a little overkill to me, at the present. But I guess I’m really missing the point of the discussion.
|
Due to changes in the order initsegs was called the links average quality value, for the initial time step, was wrong. This was found in issue #16
@jamesuber clearly, when I think testing and you think testing we are thinking two different things. What I've got in mind right now is part of what the dev team at Haestad Methods was doing. Sounds to me like what you have in mind is the other part of what they were doing. I don't think of one as "real" and the other as "overkill." I think EPANET can benefit from both and its simply a matter of which one makes sense to do first using Travis-CI. |
This bug, fixed in 1c06908, happened due to changes in the order |
think we're in pretty good shape here. a few rough edges, but it's serviceable. closing so that the 2.1 milestone can be met. |
need one. should do a comparison of hydraulic and quality results with canonical version. probably need a binary parser to do this.
The text was updated successfully, but these errors were encountered: