New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove global shared test cluster #8854
Conversation
+1 lets remove this beast |
* <li>{@link Scope#TEST} - uses a new cluster for each individual test method.</li> | ||
* <li>{@link Scope#SUITE} - uses a cluster shared across all test method in the same suite</li> | ||
* </ul> | ||
* <p/> | ||
* The most common test scope it {@link Scope#GLOBAL} which shares a cluster per JVM. This cluster is only set-up once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it -> is (existing typo)
LGTM! |
//ports can be reused as suite or test clusters are never run concurrently | ||
//we don't reuse the same port immediately though but leave some time to make sure ports are freed | ||
//reserve 0 to global cluster, prevent conflicts between jvms by never going above 9 | ||
return 1 + portCounter.incrementAndGet() % 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove the 1 + , now that 0 is not reserved for the global scope. This should be portCounter.incrementAndGet() % 10;
I think this requires some more work... the global cluster has the property to be an external one too, and REST tests can run against this external cluster. We use this before releasing, to run REST tests against the distribution that we are going to ship. We might want to move that logic to SUITE clusters, but this needs be done before merging otherwise the release process breaks. |
One more thing, how did you run your experiments? I tend to agree that the global cluster doesn't buy us much in local mode (default), but when running tests with network, I'd expect rebuilding the cluster for each suite would have a cost. Any measurements around this? |
{quote} Running tests with the network is useless. It just means test failures. We should not even support it. I'm certainly not going to waste 1 millisecond of my time on it. |
I'm closing this: I dont want to fight for it, and I dont want to be blamed for test failures in flaky tests. I will just say for the record: having randomized tests that do not reproduce is useless. |
I think removing the global cluster is a good idea, since repeatibility of tests is super important. We should go ahead with this, make the external cluster work too and understand what the implications are with tests that use network as long as we have them. If tests with network are a bad idea let's discuss on a separate issue why we support them and why we shouldn't. |
This was previously attempted in elastic#8854. I revived that branch and did some performance testing as was suggested in the comments there. I fixed all the errors, mostly just the rest tests, which needed to have http enabled on the node settings (the global cluster previously had this always enabled). I also addressed the comments from that issue. My performance tests involved running the entire test suite on my desktop which has 6 cores, 16GB of ram, and nothing else was being run on the box at the time. I ran each set of settings 3 times and took the average time. | mode | master | patch | diff | | ------- | ------ | ----- | ---- | | local | 409s | 417s | +2% | | network | 368s | 380s | +3% | This increase in average time is clearly worthwhile to pay to achieve isolation of tests. One caveat is the way I fixed the rest tests is still to have one cluster for the entire suite, so all the rest tests can still potentially affect each other, but this is an issue for another day. There were some oddities that I noticed while running these tests that I would like to point out, as they probably deserve some investigation (but orthogonal to this PR): * The total test run times are highly variable (more than a minute between the min and max) * Running in network mode is on average actually *faster* than local mode. How is this possible!?
This was previously attempted in #8854. I revived that branch and did some performance testing as was suggested in the comments there. I fixed all the errors, mostly just the rest tests, which needed to have http enabled on the node settings (the global cluster previously had this always enabled). I also addressed the comments from that issue. My performance tests involved running the entire test suite on my desktop which has 6 cores, 16GB of ram, and nothing else was being run on the box at the time. I ran each set of settings 3 times and took the average time. | mode | master | patch | diff | | ------- | ------ | ----- | ---- | | local | 409s | 417s | +2% | | network | 368s | 380s | +3% | This increase in average time is clearly worthwhile to pay to achieve isolation of tests. One caveat is the way I fixed the rest tests is still to have one cluster for the entire suite, so all the rest tests can still potentially affect each other, but this is an issue for another day. There were some oddities that I noticed while running these tests that I would like to point out, as they probably deserve some investigation (but orthogonal to this PR): * The total test run times are highly variable (more than a minute between the min and max) * Running in network mode is on average actually *faster* than local mode. How is this possible!? closes #9781
In my experiments, this doesnt really buy much, but it causes reproducibility grief, and prevents things like leak detection from working (I would like to do more e.g. with the filesystem).
master:
[INFO] Execution time total: 5 minutes 51 seconds
[INFO] Tests summary: 672 suites, 4412 tests, 478 errors, 51 ignored (49 assumptions)
patch:
[INFO] Execution time total: 6 minutes 40 seconds
[INFO] Tests summary: 672 suites, 4412 tests, 479 errors, 51 ignored (49 assumptions)
So the savings is really not much in the scheme of things: I think reproducibility is better.