Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing protocol #16

Closed
samhatchett opened this issue Jun 12, 2015 · 58 comments
Closed

testing protocol #16

samhatchett opened this issue Jun 12, 2015 · 58 comments
Assignees
Milestone

Comments

@samhatchett
Copy link
Member

need one. should do a comparison of hydraulic and quality results with canonical version. probably need a binary parser to do this.

@samhatchett samhatchett added this to the v2.0.13 milestone Jun 12, 2015
@eladsal
Copy link
Member

eladsal commented Jun 12, 2015

note that the text out file has a time stamp so md5 checks are problematic.

@eldemet
Copy link
Contributor

eldemet commented Jun 13, 2015

I think we would need to have a more advanced testing mechanism, which checks all the functions separately, the sets/gets, on different platforms (x64, x86 etc). Also something which we should not overlook is the connection with the EPANET-MSX, making sure that the latest EPANET version is compatible.

Together with Marios Kyriakou (@Mariosmsk) we have developed such a code in Matlab which uses the DLL. You can run with it all the functions, and it also reads the binary output file and can parse the INP file. I think it would be easy to port this in Python at some point if needed. In any case, we'll try to improve it a bit in order to release it on github, and we can improve it further as needed.

@samhatchett
Copy link
Member Author

i think i've seen automated python frameworks that do testing for you -- i wonder if we can integrate with something out-of-the-box.

in general, i see the need for two levels of testing: unit tests and regression tests. the regression tests would arguable be simpler, and would just track any significant changes to hydraulic and quality solutions between software versions - this could be handled by your ( @eldemet ) hydraulic binary file parser, or by running the analysis through toolkit functions and tracking/comparing results in memory.

unit testing will require some more internal tweaks to accommodate the granularity one would expect. this should definitely be on the roadmap, but maybe not as high-urgency.

@samhatchett samhatchett modified the milestones: v2.0.13, v2.1 Sep 14, 2015
@eldemet eldemet self-assigned this Sep 14, 2015
@willfurnass
Copy link

With regards to unit testing It might be safer to do this from C (list of C testing frameworks) and test Python/Matlab bindings separately.

Also, a nice goal would be automated testing using a continuous integration service (e.g. travis), with a test suite status widget displayed on GH (by including it in README.md).

@samhatchett
Copy link
Member Author

@willfurnass excellent. do you have any experience with travisCI?

@willfurnass
Copy link

@samhatchett Encountered it for the first time recently. Not experienced but happy to try setting it up once we've got a basic test suite.

@michaeltryby
Copy link

Has there been any action on this? I have some code that provides a convenient API for reading the binary output file that may come in handy for automated testing.

@samhatchett
Copy link
Member Author

nope - not yet to my knowledge. can you make your code available here?

As an initial "straw man" test, what if we just had a static binary output file generated from 2.0.12 ("known good output"), and then as the one and only test confirmed that the binary output from the dev version matches this standard? This would just be the first step in getting better testing.

That's the way the Kilogram used to work - there was a canonical Kilogram that you would go to and weigh against your "new kilogram" to make sure that what you have is still a kilogram.

@eladsal
Copy link
Member

eladsal commented Dec 2, 2015

@samhatchett how do you determine if the two output files "matches"?

First, output files would always have the same size even if the numbers are totally different. Second, I think looking at hashes would also won't work since we already saw some small differences between the versions (after 4 decimal places).

I think we should compare the results "number by number". @eldemet what kind of comparison are you doing with your Matlab tool?

@eldemet
Copy link
Contributor

eldemet commented Dec 2, 2015

Hi @eladsal we are doing a full comparison of all the elements and functions, comparing two versions of epanet (e.g. 2.00.12 with any other). We can also read from the Binaries, but we also check the run-time results. We plan to release an open-source tool for this on github. We also examine multiple epanet benchmarks, with different quality, rules, controls etc.

So at the moment it just returns a flag if the output is different by doing a function-by-function comparison.

Also note that, in some cases, results will be different. For instance, corrections in the tank diameters may affect quality, which should be acceptable, but we need to keep this in mind.

We can compile the code as an executable (so that matlab is not a requirement), or we can do this in Python (however we do have this running in matlab, so it's easier :-) ).

I'll discuss this with Marios (@Mariosmsk) to see when we can have this software released.

@eladsal, @samhatchett, if we release this tool, what would be a convenient use case? What would be a desired output? A report with the differences?

@samhatchett
Copy link
Member Author

I was suggesting using Michael's binary parser to compare.

On Dec 2, 2015, at 2:04 AM, Elad Salomons notifications@github.com wrote:

@samhatchett how do you determine if the two output files "matches"?

First, output files would always have the same size even if the numbers are totally different. Second, I think looking at hashes would also won't work since we already saw some small differences between the versions (after 4 decimal places).

I think we should compare the results "number by number". @eldemet what kind of comparison are you doing with your Matlab tool?


Reply to this email directly or view it on GitHub.

@jamesuber
Copy link
Member

I haven’t looked at Michael’s binary format API but, for example, if it allows access to the node- and link- element value arrays at particular time steps, then I’d say that we should do the following for saying that two files are “acceptably similar"

  • check that the times are identical in the two candidates, for every step
  • if X1 and X2 are the two solution arrays, do a computation of min( — log10( abs( X1 - X2 ) ) ) and require it to be >= minNumberCorrectDigits

Where we’d specify the value of minNumberCorrectDigits to be, say, 5?

On Dec 2, 2015, at 5:57 AM, Sam Hatchett notifications@github.com wrote:

I was suggesting using Michael's binary parser to compare.

On Dec 2, 2015, at 2:04 AM, Elad Salomons notifications@github.com wrote:

@samhatchett how do you determine if the two output files "matches"?

First, output files would always have the same size even if the numbers are totally different. Second, I think looking at hashes would also won't work since we already saw some small differences between the versions (after 4 decimal places).

I think we should compare the results "number by number". @eldemet what kind of comparison are you doing with your Matlab tool?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub #16 (comment).

@michaeltryby
Copy link

Nice discussion. Lots of great ideas. @samhatchett by all means I am happy to share my code. Right now it is written in C and compiles as a dll. It has about a dozen functions and offers convenient methods for getting different "slices" of data out of the binary file. My thought is that it can remain as a dll or we could eventually expand the toolkit functionality in this direction. Where do you think it belongs in the source tree?

@samhatchett
Copy link
Member Author

Great, how about in the 'tools' directory in dev-2.1 - maybe a subdirectory in there.

@samhatchett
Copy link
Member Author

it pains me to say it, but I think this is an important part of the 2.1 release -- what really is needed here? assuming Michael's code works, we could probably very quickly set up a Travis CI thing (like that suggested by @willfurnass ) to automatically test the output of dev versions against a known-good results file (or set of them). This would have the advantage of generating neat graphics to announce that the code passes all tests. does that sound like overkill?

@michaeltryby
Copy link

@samhatchett from my perspective, this has always been on the critical path to completing 2.1. I figured regression testing could be done by wrapping the output API in Python and using a framework to drive a battery of tests.

I am interested in trying Travis CI. One question I have is how does it handle testing on different platforms? It looks like it only allows testing on OS X and Linux as a beta feature. Since the primary platform for EPANET has historically been Windows it makes sense to have the ability to test on Windows.

When you ported EPANET to OS X how extensively did you bench marked the code to determine what effect, if any, platform changes have on the computational results?

@jamesuber
Copy link
Member

That’s a good question about TravisCI. Let’s face it, we don’t, to my knowledge, have a set of official correct answers for a valid set of test problems. And so I’ve never heard of anyone talking about differences on Linux/OS X compared to windows.

I’m assuming that we would need to start by identifying the network input files that the devcom and community thinks represent a relatively complete set of simulation options. I’ll also assume that the “correct” answers are represented by binary output files from running the published pre 2.1 EPA version, on windows. These correct answers would need to be stored in the repo somewhere, and be the gold standard. The answer to Michael’s question will then be derived for the first time, from a comparison of results produced by running the tests on whatever platform they’re running on.

I also don’t think it’s a bad thing to just use a simple Python application to process binary files against toolkit simulation results. That would certainly be simple. But I don’t know TravisCI. Maybe it brings a lot of other benefits, although I admit I’d be a little concerned about the “CI” part, mainly because I’m suspicious about imposing that overhead. But maybe it is really transparent.

On Dec 4, 2015, at 9:31 AM, Michael Tryby notifications@github.com wrote:

@samhatchett https://github.com/samhatchett from my perspective, this has always been on the critical path to completing 2.1. I figured regression testing could be done by wrapping the output API in Python and using a framework to drive a battery of tests.

I am interested in trying Travis CI. One question I have is how does it handle testing on different platforms? It looks like it only allows testing on OS X and Linux as a beta feature. Since the primary platform for EPANET has historically been Windows it makes sense to have the ability to test on Windows.

When you ported EPANET to OS X how extensively did you bench marked the code to determine what effect, if any, platform changes have on the computational results?


Reply to this email directly or view it on GitHub #16 (comment).

@samhatchett
Copy link
Member Author

my understanding is that Travis can be pretty easy, and in fact can do exactly what you describe. It just runs a configurable set of scripts whenever there's a push to the repo, and gives you some feedback on whether you broke something or not.

@michaeltryby
Copy link

I am genuinely interested in trying Travis CI. I think there is a way we can have our cake and eat it too, if we write the testing code compatible with Travis CI and so that it can be run manually as well.

@eladsal
Copy link
Member

eladsal commented Dec 4, 2015

I have a semi-automated code that basically compares two binary output files and give an output such as this:
Links Flow max diff = 0.003601074000000000
Links Headloss max diff = 0.001220703000000000
Links Quality max diff = 0.000000000000000000
Links Velocity max diff = 0.000010627320000000
Nodes Demand max diff = 0.003601074000000000
Nodes Head max diff = 0.000061035160000000
Nodes Pressure max diff = 0.000030517580000000
Nodes Quality max diff = 0.000000000000000000

@jamesuber
Copy link
Member

I feel pretty strongly that all comparison results should be in terms of correct decimal digits. These will be scale/unit independent.

On Dec 4, 2015, at 1:39 PM, Elad Salomons notifications@github.com wrote:

I have a semi-automated code that basically compares two binary output files and give an output such as this:
Links Flow max diff = 0.003601074000000000
Links Headloss max diff = 0.001220703000000000
Links Quality max diff = 0.000000000000000000
Links Velocity max diff = 0.000010627320000000
Nodes Demand max diff = 0.003601074000000000
Nodes Head max diff = 0.000061035160000000
Nodes Pressure max diff = 0.000030517580000000
Nodes Quality max diff = 0.000000000000000000


Reply to this email directly or view it on GitHub.

@eladsal
Copy link
Member

eladsal commented Dec 4, 2015

@jamesuber here it is using min( — log10( abs( X1 - X2 ) ) ):
(a value of two means that the third second decimal point is different and 100 is no difference at all)
Links Flow max diff = 2
Links Headloss max diff = 3
Links Quality max diff = 100
Links Velocity max diff = 5
Nodes Demand max diff = 2
Nodes Head max diff = 4
Nodes Pressure max diff = 5
Nodes Quality max diff = 100

@michaeltryby
Copy link

@eladsal so what is your comparison code written in? What is most valuable is that you have worked through the algorithm of doing the comparison.

@eladsal
Copy link
Member

eladsal commented Dec 4, 2015

The code is in VB6 and runs on my Windows machine. If there is any use for it I can polish it a bit and make it public.

@eladsal
Copy link
Member

eladsal commented Dec 10, 2015

Initial commit of VB6 code to perform comparison of binary output files ec04b7f.

Would anyone want the compiled EXE (don't think it should go into GH)?

@bemcdonnell
Copy link
Member

@samhatchett @michaeltryby The python wrapper currently lives in pull request #52.

@samhatchett
Copy link
Member Author

good discussion over on #54 - so related directly to the prototype testing protocol, once we have the script that actually does the comparison, we will need:

  1. another .sh script to iterate through several networks and create bin files
  2. a set of network .inp files from which to generate results
  3. a set of "official" bin files for comparison

any volunteers wish to think about these issues and throw out a plan?

@michaeltryby
Copy link

So what we are talking about here is network based regression testing. All we need is a few networks to get the ball rolling and then design the tests in such a way that its super easy to add new networks over time.

It would be super convenient, if for example, all we had to do was add a folder containing an input file, a binary file, and some meta data to a designated test directory to include a new test. As opposed to having to modify code to do so. Make sense?

Lew accumulated a large collection of networks while developing EPANET. I'm sure they could be useful. One issue is the need to anonymize them before they could be publicly distributed as part of a test suite.

@jamesuber
Copy link
Member

On Dec 15, 2015, at 2:42 PM, Michael Tryby notifications@github.com wrote:

One issue is the need to anonymize them before they could be publicly distributed as part of a test suite.

https://github.com/OpenWaterAnalytics/epanet-tools?files=1

@samhatchett
Copy link
Member Author

✋ i will volunteer to set up the shell script for iterating / recursing through directories in order to pass paths into the testing tool

@samhatchett
Copy link
Member Author

so this turned out to be pretty cool. PR #58 will add a .sh script which recurses through a test-network directory (mocked up in the directory tree), run the command-line epanet, and call into a python script to compare binary files. the actual epanet-running and comparison are all fake, but this is the structure.

Travis output is here: https://travis-ci.org/OpenWaterAnalytics/EPANET/builds/97126305

@michaeltryby
Copy link

I agree! This is pretty cool. 😄

@eladsal
Copy link
Member

eladsal commented Dec 16, 2015

"i have no brains" - code is poetry

@sam cool 👍

@eladsal
Copy link
Member

eladsal commented Dec 16, 2015

@michaeltryby its great using the networks in Lew's collection. Maybe we can start by looking at networks in the public domain like the ones used in the battles and some other benchmark networks. These are well known and many people have worked with. I'll try to make up a list and we can see if that is sufficient to get us started.

@samhatchett
Copy link
Member Author

@michaeltryby and @jamesuber related to the discussion about true/false testing / and others as well:
take a look at the results here: https://travis-ci.org/OpenWaterAnalytics/EPANET/builds/97515364

... what should the tolerances be? the "real" binary output file was generated on my mac under epanet 2.1, uploaded to github, and then used in testing to compare against what should be the same binary file (which was generated by Travis after building the same version of epanet). interesting.

@michaeltryby
Copy link

Yeah, so we are comparing Mac to Linux for the first time and right out of the gate there are differences. Platform fuzz.

We could modify the script to accumulate statistics on the differences and then apply some set of criteria to determine whether they are acceptable.

What statistics should we accumulate while we process the file?

@eladsal
Copy link
Member

eladsal commented Dec 17, 2015

I have compared @samhatchett 's output (Mac) to my Windows DLL (both 2.1) and got no difference at all.
I do get differences compared to the version 2.0.12:
Links Flow max diff = .00003051758
Links Quality max diff = 0.2500000000

@michaeltryby
Copy link

@eladsal That's really interesting and useful information. I'm pleasantly surprised there is no difference between Mac and Windows.

Has anyone been working on the water quality routines? A difference of 0.25 seems significant ...

@eladsal
Copy link
Member

eladsal commented Dec 17, 2015

I think there was a change in water quality - stepwise calculations. However, I see no differences in nodes quality results.
My findings so far: the difference is only in the first time step for link 110 which connects junction 12 and tank 2. As far as I can see its related to the initial water quality assigned to the pipe. The initial quality for the junction is 0.5mg/L and for the tank it's 1mg/L. version 2.0.12 reports 0.75mg/L while 2.1 reports 1mg/L. As far as I can see the correct value should be 0.75mg/L since the link quality is defined as the average of the link's segments quality (or the average of teh nodes quality if no segments are in the link).

I think I found the problem, it's the place initsegs is called. It was moved form gethyd to initqual. I have to dig in a bit more on this.

@jamesuber
Copy link
Member

Until we get more experience with what is "unexpected," I'd suggest that the comparison and test outputs be built with compiler optimizations completely disabled. It just removes another variable. Plus, if we find that differences are dependent on compiler optimization, or even purely on platform even when optimizations are enabled, then we should question the stability of our algorithms. That might mean we should look into adhering to certain coding practices, and examine more carefully the entire suite of compiler warnings (e.g. by turning on ANSI adherence, paying close attention to implicit type conversions,...).

I still do think that basing the "true/false" designation on the number of correct digits is the way to go. I do not think that efficiency should be a consideration at this point, and I don't see how we have a chance of defining "the tolerance" if we are using input files with potentially different units, different water quality boundary conditions, etc. It's a hard enough problem, without introducing those factors.

Sadly, people who have thought about this a lot more than me have pointed out that even calculating the correct number of digits needs to be done carefully. See section 3 of this frequently cited paper by McCullough: http://www.pages.drexel.edu/~bdm25/tas1.pdf. He points out that if the correct and test values are significantly different (more than a factor of 2), then the formula I cited above won't give the number of correct digits. It's not hard, but you have to account for that.

And, I noted that McCullough recommends normalizing the absolute error by the test value (as long as the test value is not too close to zero - ugh), in which case you are measuring the number of correct significant digits.

Until this gets worked out, it seems that the process might also benefit from simplification in the test problems. And it's probably healthy anyway to lean towards a hierarchy of test problems ranging in level of "difficulty" (e.g. "Oh my God it failed that!" or "Oh yeah, that's really a touchy one"). So for example, the initial tests maybe should be a variety of steady state hydraulic simulations of different complexity, in terms of size, pumps/valves, controls/rules. Admittedly this tests only a portion of the code, but that would be the point. From there it makes sense to move into EPS hydraulic tests, and finally into water quality.

@michaeltryby
Copy link

Testing fit for purpose. @jamesuber I agree with your suggestions, but I don't believe that Travis-CI is the right place to exercise them.

The CI regression test is analogous to a build test in that it keeps you from checking in code with obvious defects. Its a safety net that helps catch stupid mistakes. I wouldn't want to use it to port to a different compiler/platform. Think of how many unknowns we could eliminate by running more rigorous tests locally as opposed to in the cloud.

Travis-CI runs on a Linux server somewhere. We are getting a glimpse that, as expected, there are cross platform issues we need to consider. Does that mean we need to commit to a porting exercise at this juncture?

I suggest we use Travis-CI to help narrow in on the robust Windows benchmark that we have and tackle porting issues in a later release.

@jamesuber
Copy link
Member

Well that’s certainly all OK with me - heck I’m just an observer at present anyway.

Just seemed to me that the entire thread was not about this at all, but rather focussed on as @samhatchett put it - "what should the tolerances be?” I mean, really, if we’re obviously going to be doing something different for the “real” testing, then the Travis-CI stuff seems a little overkill to me, at the present. But I guess I’m really missing the point of the discussion.

On Dec 17, 2015, at 8:41 PM, Michael Tryby notifications@github.com wrote:

Testing fit for purpose. @jamesuber https://github.com/jamesuber I agree with your suggestions, but I don't believe that Travis-CI is the right place to exercise them.

The CI regression test is analogous to a build test in that it keeps you from checking in code with obvious defects. Its a safety net that helps catch stupid mistakes. I wouldn't want to use it to port to a different compiler/platform. Think of how many unknowns we could eliminate by running more rigorous tests locally as opposed to in the cloud.

Travis-CI runs on a Linux server somewhere. We are getting a glimpse that, as expected, there are cross platform issues we need to consider. Does that mean we need to commit to a porting exercise at this juncture?

I suggest we use Travis-CI to help narrow in on the robust Windows benchmark that we have and tackle porting issues in a later release.


Reply to this email directly or view it on GitHub #16 (comment).

eladsal added a commit that referenced this issue Dec 18, 2015
Due to changes in the order initsegs was called the links average
quality value, for the initial time step, was wrong. This was found in
issue #16
@michaeltryby
Copy link

@jamesuber clearly, when I think testing and you think testing we are thinking two different things. What I've got in mind right now is part of what the dev team at Haestad Methods was doing. Sounds to me like what you have in mind is the other part of what they were doing. I don't think of one as "real" and the other as "overkill." I think EPANET can benefit from both and its simply a matter of which one makes sense to do first using Travis-CI.

@eladsal
Copy link
Member

eladsal commented Dec 18, 2015

This bug, fixed in 1c06908, happened due to changes in the order initsegs was called. The links average quality value, only for the initial time step, was wrong. This average value is not used in any calculations in the engine but only reported to the user. The water quality model uses segments of water with different quality in the pipes.

@samhatchett
Copy link
Member Author

think we're in pretty good shape here. a few rough edges, but it's serviceable. closing so that the 2.1 milestone can be met.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants