New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the set of criteria (a test plan) for marking builds 'good' #186

Open
smlambert opened this Issue Dec 15, 2017 · 11 comments

Comments

7 participants
@smlambert
Copy link
Contributor

smlambert commented Dec 15, 2017

This issue is to discuss and decide what criteria we should use to mark an AdoptOpenJDK binary "good"? To kick off the discussion, I propose the following goals:

For release builds, all tests at our disposal should pass, where "all" includes:

  • openjdk regression tests
  • functional tests
  • system tests
  • jck tests (including the manual/interactive tests)
  • optionally external (third party application tests)
  • optionally performance benchmarks

For nightly builds, a subset of all tests should be run and pass, where we explicitly state what tests are in the subset, and as more machines are available, we keep adding to the subset (to be as close to the entire list of tests that get run against release builds as we can, given the set of resources we have), starting off with:

  • openjdk regression tests
  • functional tests
  • system tests
  • optionally a subset of the jck tests (not including the manual/interactive tests)

For pull request builds, a small subset of tests should be run and pass. Ideally, this set is dynamic and selected to best test the change in the PR, but as a starting point, this set would be a short list that represents a sample from the broad spectrum of full tests we have.

For background, here is brief presentation on testing at AdoptOpenJDK: https://youtu.be/R3rdLIC089k

@karianna karianna added this to In Progress in openjdk-test Dec 15, 2017

@tellison

This comment has been minimized.

Copy link
Member

tellison commented Dec 18, 2017

A good discussion Shelley. Let me try to describe how I might define those criteria you have suggested.

Constraints
In an ideal world, we would be running all the tests all the time, however, there are a number of factors that impact on the amount of testing we can perform, and these will guide the criteria we apply around the different quality assurances we assign to binaries.

The factors that impact our test plans include: the length of time available to run the tests, the number and types of machines available to run tests, and the number of tests we have at our disposal. Any more constraints?

Quality Levels
As you note, externally we have identified three groups of binaries that are consumed by users of AdoptOpenJDK. Releases which are our highest quality binaries designed for production use, nightlies which have a reasonable level of confidence that they are working as expected and in particular that no serious regressions have been introduced since the last good nightly build, and the pull request build that assures an individual developer that their change will not cause a nightly or release build to be marked broken.

In theory we can introduce more quality levels, such as builds that pass a longer set of sanity checks (called 1 hr sanity?), those that pass all our automated release tests on one fast platform, but have not had any manual testing (called JCK candidates?), and binaries that have the assurance of a modest level of testing on at least every different CPU / OS type we distribute (called platform coverage?). However, there is a risk that introducing too many binary quality markers causes confusion -- but worth thinking about what is going to be useful to our target users.

Practical Issues
For now, let's just think about the three levels you mention above. We can think about the testing scenarios and their constraints, and give each a weighting of High, Medium, or Low. For example, the importance of running quickly is of higher importance to a PR build than it is to a full release build.

So, as a starting point, here's my test playlist criteria:

  • Release = time (L), platform coverage (H), code/function coverage (H).

Release builds are our gold standard of binaries. These have undergone the best testing we can perform, including functional tests, running applications, and checking the performance is acceptable. The tests represent real-world usage of the binaries.

Ideally, the test framework would accept containers configured with third-party test suites and a standard way of invoking the image with a candidate binary. If each of the third-party tests (e.g. Tomcat tests, Eclipse tests, Lucene tests, Scala tests, etc) report success with the candidate then we can be assured that it is capable of running real applications. Application owners can ask for their tests to be part of our release testing by following the container rules.

At some point, the release builds are pushed into the JCK pipeline which requires at least some some level of manual testing before being flagged as a JCK-compatible release.

  • Nightly = time (H), platform coverage (H), code/function coverage (M).

Nightly builds that are marked as good give developers and users confidence that the changes introduced since the last good nightly build have not caused any significant regression on any platform that we cover. The builds are limited by the number of tests we can run within the 24hr period between nightlies being produced, so is predominantly limited by time and build farm capacity. These should be as close to release quality as possible.

The tests run nightly will be those that give best bug finding value (based on historical evidence), are fully automated, and cover a broad spectrum of platform and functional assurances. In some cases, users may pick a nightly build to get a "hot" patch that is not yet available in a release -- because releases are expensive and infrequent, and PR builds have significantly lower quality assurances.

Nightlies are missing the long "burn-in" tests and heavy performance runs, and do not have the full suite of application and JCK tests applied because we have to complete all the tests in a reasonable time. They are called "nightlies" to encourage the feedback to the community within 24hrs.

  • PR build = time (H), platform coverage (M), code/function coverage (L)

PR builds are highly time sensitive. Ideally, the PR build is very quick so the developer gets immediate feedback before they move on to their next task. For example, by using an incremental rather than full clean and rebuild the developer should know within 5 mins if the code has compiled and passed basic sanity checks on one platform. Within an hour the developer should know that the change has passed PR sanity checks on all platforms, and is permitted to go into the nightly testing regime..

Ideally the test framework will target the most appropriate tests to run within the given time/machine budget, e.g. by figuring out what areas of the build are impacted by the change, such as the module that was modified, and select appropriate PR tests for that functional area and it's dependencies.

--
p.s. I wrote the above before watching the Youtube video -- apologies for repeating stuff you already said, but luckily it seems we are in sync ;-)

@karianna

This comment has been minimized.

Copy link
Member

karianna commented Dec 20, 2017

I think @ShelleyLambert suggested levels broadly make sense and as @tellison mentions the nightly builds will have to be able to complete in a reasonable timeframe (maybe that's an hour). All seems sound to me.

@lumpfish

This comment has been minimized.

Copy link

lumpfish commented Dec 20, 2017

Do we have the data yet to produce a table with execution times for the various test suites (with further breakdown into candidates for subsets of the test suites)?
If release testing = run everything, then it takes as long as it takes, but for the nightly and PR testing we need such data to tune the tests to meet the time constraints.

@judovana

This comment has been minimized.

Copy link

judovana commented Dec 20, 2017

Those times terribly vary from HW and setup to another HW and setup. If you eg run whole testsuite in ramdisk, you can get to 1/2 of time. Also time for jtregs or tck is simpy divided by number of cores mahcine have and moment X starts to fail.
Also if you are testing hotspot, j9, zero, shendoah, fastdebug or similar hugely specified JDK, interacts with about number really a much
Also I'm testing testsuites (not benchamrks) on fastdebug builds (if it can fail randomly, it will fail more probably :)
Note that you can not run benchmarks on virtual machines. If you do, the results have nothig to do with benchmarking, but just says "JVM endured that run". Well it may say enough
My expereince for heavily avarge values are:
specjbb: about 1-3hours
specjvm: about 1-3 hours
dacapo: about 10-20 minutes
radargun1 with 3 salves 15min-1h
radargun3 with one huge slave - around 20 minutes
jcstress: 2-6h
jtregs: 6-12h
jck: 6-24h
lucene: 10-20minutes
wycheproof: 5-20 minutes
install tests: about 5 minutes
various analyses about debuginfo symbols, api comaptibiility, class files debug info: about 30 minutes

@judovana

This comment has been minimized.

Copy link

judovana commented Dec 20, 2017

I believe that the set of the test should be named for each project x variant.
The developers runs should not be run at all. Developer should be confident to run jtregs or more locally (by deploying this infrastructure?). Once he poushes, his changeset will be taken to consideration via nightly build anyway.

@judovana

This comment has been minimized.

Copy link

judovana commented Dec 20, 2017

As for release x release candidates, Once oracle stops taging based on theirs good feeling/internal tesitng, following coudl be applied:

  • every nightlybuild, passing all tests is release candidate
  • every tagged build, based on developers opinions/votes/whatever is release candidate
  • if such a release candidate passes all tests, it can be release itself
  • the release should be tagged one more times, to mark it.

This actually means that each release was tested at least two times, before becoming public. That is good. I do simialry in RH - both sources and RPMs must pass all to be considered release candidates.

And only ater that, suc build can be published on frotn page.

@sxa555

This comment has been minimized.

Copy link
Member

sxa555 commented Dec 20, 2017

We're looking at approximately 27 hours (single-threaded) for a JCK8 HotSpot run on the hardware we've got running all the non-manual tests

@tellison

This comment has been minimized.

Copy link
Member

tellison commented Dec 21, 2017

@sxa555 my understanding is that we are only obliged to run the full JCK on a major release. Security updates and bug fix releases etc do not require a full re-run; though we may want to include as much of the JCK as we can (time) afford to ensure any regressions are caught early.

So I think we need to separate out the "full JCK recorded test run with interactive tests etc." as a special case that may require some out of band intervention for the regular release pipeline testing.

@jaysk1

This comment has been minimized.

Copy link

jaysk1 commented Dec 26, 2017

JCK8 has 3 test suites available as executable JAR - Runtime, Devtools and Compiler.
Which is further categorized into 7 test types :
Compiler_API, Compiler_LANG, Devtools , Runitme_API, Runitme_VM, Runitme_LANG and Runitme_XML_Schema

  • The manual/interactive tests are present in 4 sub-packages - java_awt, javax_swing, java_io and javax_sound of the JCK-Runtime_API test suite.

Approximate execution time of these testsuites on single(good) machine:
Automated Runtime - 14hrs
Automated Compiler - 10hrs
Automated Devtools - 3hrs
Interactive Runtime_API - 3.5hr - 5hrs (Depends on the individual doing it- First timers may take more than 5 as well)

These can further decrease when the tests on harness is run with multi JVMgroup execution modes.
However, The optimal mode of execution of tests on JCK harness is multi JVM mode and all the above estimation is for that and for JCK8 test package... JCK9 had 10 sub-packages more than previous - i havn't checked if the new packages execution time is substantial.


  • Most of the features of nightly Java builds can be tested with JCK-Runtime_API and JCK-Runtime_VM alone. [Runtime_API & VM can be done under 8hrs]
    JCK team used to run these ever so often as most bugs would be exposed with these tests

For JCK, the plan can be:

release builds-
Complete JCK (Automated, Including Manuals) can be done once per platform in between the cycles.
On the final candidate build Complete JCK ( Automated, excluding Manuals )

nightly builds -
Every Night - JCK Runtime_API & VM
Fortnightly - Complete JCK (Automated, excluding Manuals)

Pull Request builds -
Any sub-packages within the 3 testsuites where the developers code fix is present.
(eg: java_awt, java_net, java_util, jvmti, java_rmi, signaturetest, CLSS, LMBD, java2schema etc )

FYI: Manual/Interactive tests can be done only once per platform per cycle.

@jaysk1

This comment has been minimized.

Copy link

jaysk1 commented Jan 5, 2018

Few more points, after listening to "AdoptOpenJDK - Hangout 04/01/2018"
(which might already be known, don't mind me re-iterating)

  • To run manual tests on platforms other than windows and Zos(headless machine, need not be done)- we could have one/two machines(preferably 1 aix and 1 linux) with display and export to these display from test machine while running interactive tests.

  • For the audit, we'll need to store the results(.jtr files produced by test harness) of the Release Build only (Both automated and interactive) - Be aware: Every time you restart a failed job, the Test harness will re-write the results.

    • Even if single package is re-run, the complete previous set of results will be lost from work directory, unless "-overwrite" is removed from the option to the JCK test harness.
@smlambert

This comment has been minimized.

Copy link
Contributor Author

smlambert commented Mar 6, 2018

Updating this discussion with a few more pieces to this puzzle:

  1. We are adding a way to divide and run JCKs in parallel, if there are multiple machines to run on. #291 - Note that currently we are using @sxa freestyle scripts for JCK runs (which is working fine for now, but is launched manually), but as we want to tie into the main pipeline and trigger tests from successful completion of compile, and run in parallel we can shift to this... we are verifying this approach internally with JCK materials we have under our agreement.

  2. We already have a way to tag certain JCK subgroups/tests as 'sanity' which would be the nightly set to run on nightly builds, while full set reserved for release builds. I have yet to understand how my automation will know the difference between a nightly build and a release build, but once we do, we can trigger 'sanity' or 'sanity'+'extended'=full set of JCKs depending. Current plan is same as @jaysk1 suggests, sanity == nightly == Runtime API + VM.

  3. I have added the average test execution times to Sheet 2/Column H of this AdoptTests spreadsheet. I will keep updating that sheet, as we enable more parallelism, with plan to reduce our execution times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment