remove global shared test cluster #8854

rmuir · 2014-12-09T18:45:07Z

In my experiments, this doesnt really buy much, but it causes reproducibility grief, and prevents things like leak detection from working (I would like to do more e.g. with the filesystem).

master:
[INFO] Execution time total: 5 minutes 51 seconds
[INFO] Tests summary: 672 suites, 4412 tests, 478 errors, 51 ignored (49 assumptions)

patch:
[INFO] Execution time total: 6 minutes 40 seconds
[INFO] Tests summary: 672 suites, 4412 tests, 479 errors, 51 ignored (49 assumptions)

So the savings is really not much in the scheme of things: I think reproducibility is better.

s1monw · 2014-12-09T21:03:42Z

+1 lets remove this beast

rjernst · 2014-12-09T22:09:11Z

src/test/java/org/elasticsearch/test/ElasticsearchIntegrationTest.java

 * <li>{@link Scope#TEST} - uses a new cluster for each individual test method.</li>
 * <li>{@link Scope#SUITE} - uses a cluster shared across all test method in the same suite</li>
 * </ul>
 * <p/>
- * The most common test scope it {@link Scope#GLOBAL} which shares a cluster per JVM. This cluster is only set-up once


it -> is (existing typo)

rjernst · 2014-12-09T22:12:05Z

LGTM!

bleskes · 2014-12-10T09:31:02Z

src/test/java/org/elasticsearch/test/discovery/ClusterDiscoveryConfiguration.java

+            //ports can be reused as suite or test clusters are never run concurrently
+            //we don't reuse the same port immediately though but leave some time to make sure ports are freed
+            //reserve 0 to global cluster, prevent conflicts between jvms by never going above 9
+            return 1 + portCounter.incrementAndGet() % 9;


we can remove the 1 + , now that 0 is not reserved for the global scope. This should be portCounter.incrementAndGet() % 10;

javanna · 2014-12-19T15:26:34Z

I think this requires some more work... the global cluster has the property to be an external one too, and REST tests can run against this external cluster. We use this before releasing, to run REST tests against the distribution that we are going to ship. We might want to move that logic to SUITE clusters, but this needs be done before merging otherwise the release process breaks.

javanna · 2014-12-19T15:58:04Z

One more thing, how did you run your experiments? I tend to agree that the global cluster doesn't buy us much in local mode (default), but when running tests with network, I'd expect rebuilding the cluster for each suite would have a cost. Any measurements around this?

rmuir · 2014-12-20T17:05:31Z

{quote}
but when running tests with network, I'd expect rebuilding the cluster for each suite would have a cost. Any measurements around this?
{quote}

Running tests with the network is useless. It just means test failures. We should not even support it. I'm certainly not going to waste 1 millisecond of my time on it.

rmuir · 2014-12-20T17:07:43Z

I'm closing this: I dont want to fight for it, and I dont want to be blamed for test failures in flaky tests.

I will just say for the record: having randomized tests that do not reproduce is useless.

javanna · 2014-12-22T09:32:38Z

I think removing the global cluster is a good idea, since repeatibility of tests is super important. We should go ahead with this, make the external cluster work too and understand what the implications are with tests that use network as long as we have them. If tests with network are a bad idea let's discuss on a separate issue why we support them and why we shouldn't.

This was previously attempted in elastic#8854. I revived that branch and did some performance testing as was suggested in the comments there. I fixed all the errors, mostly just the rest tests, which needed to have http enabled on the node settings (the global cluster previously had this always enabled). I also addressed the comments from that issue. My performance tests involved running the entire test suite on my desktop which has 6 cores, 16GB of ram, and nothing else was being run on the box at the time. I ran each set of settings 3 times and took the average time. | mode | master | patch | diff | | ------- | ------ | ----- | ---- | | local | 409s | 417s | +2% | | network | 368s | 380s | +3% | This increase in average time is clearly worthwhile to pay to achieve isolation of tests. One caveat is the way I fixed the rest tests is still to have one cluster for the entire suite, so all the rest tests can still potentially affect each other, but this is an issue for another day. There were some oddities that I noticed while running these tests that I would like to point out, as they probably deserve some investigation (but orthogonal to this PR): * The total test run times are highly variable (more than a minute between the min and max) * Running in network mode is on average actually *faster* than local mode. How is this possible!?

This was previously attempted in #8854. I revived that branch and did some performance testing as was suggested in the comments there. I fixed all the errors, mostly just the rest tests, which needed to have http enabled on the node settings (the global cluster previously had this always enabled). I also addressed the comments from that issue. My performance tests involved running the entire test suite on my desktop which has 6 cores, 16GB of ram, and nothing else was being run on the box at the time. I ran each set of settings 3 times and took the average time. | mode | master | patch | diff | | ------- | ------ | ----- | ---- | | local | 409s | 417s | +2% | | network | 368s | 380s | +3% | This increase in average time is clearly worthwhile to pay to achieve isolation of tests. One caveat is the way I fixed the rest tests is still to have one cluster for the entire suite, so all the rest tests can still potentially affect each other, but this is an issue for another day. There were some oddities that I noticed while running these tests that I would like to point out, as they probably deserve some investigation (but orthogonal to this PR): * The total test run times are highly variable (more than a minute between the min and max) * Running in network mode is on average actually *faster* than local mode. How is this possible!? closes #9781

remove global shared test cluster

180cb19

rjernst reviewed Dec 9, 2014
View reviewed changes

bleskes reviewed Dec 10, 2014
View reviewed changes

rmuir closed this Dec 20, 2014

rjernst mentioned this pull request Feb 20, 2015

Tests: Remove global shared cluster #9781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove global shared test cluster #8854

remove global shared test cluster #8854

rmuir commented Dec 9, 2014

s1monw commented Dec 9, 2014

rjernst Dec 9, 2014

rjernst commented Dec 9, 2014

bleskes Dec 10, 2014

javanna commented Dec 19, 2014

javanna commented Dec 19, 2014

rmuir commented Dec 20, 2014

rmuir commented Dec 20, 2014

javanna commented Dec 22, 2014

remove global shared test cluster #8854

remove global shared test cluster #8854

Conversation

rmuir commented Dec 9, 2014

s1monw commented Dec 9, 2014

rjernst Dec 9, 2014

Choose a reason for hiding this comment

rjernst commented Dec 9, 2014

bleskes Dec 10, 2014

Choose a reason for hiding this comment

javanna commented Dec 19, 2014

javanna commented Dec 19, 2014

rmuir commented Dec 20, 2014

rmuir commented Dec 20, 2014

javanna commented Dec 22, 2014