Simplify BalanceUnbalancedClusterTest #12629

rjernst · 2015-08-04T06:28:28Z

This takes 10 seconds or more, while other allocation tests are almost instantaneous. Can we simplify this? It looks like it tries to do a basic allocation (5 shards, 1 replica) of a new index when a ton of indexes already exist on just 4 nodes. Perhaps we could test similar circumstances without thousands of shards? Alternatively, we could just make this an integration test (leave the impl, but rename to IT). It doesn't really seem like a unit test as it is now.

Also, as a side note, this test is the only user of CatAllocationTestCase. Perhaps we can also eliminate this abstraction and just test directly (eliminating the zipped shard state)? @s1monw do you have any thoughts here?

DaveCTurner · 2018-04-19T11:08:59Z

I looked into trying to make this test be smaller without being less useful.

This test was introduced in #9149 as a fix for #9023. I'll admit I don't completely follow what happened in #9149 - the words indicate there was an unexpected sign inversion, but it looks like a bigger change; the key seems to be the following:

diff --git a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
index 5484498237..7c75b2a061 100644
--- a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
+++ b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
@@ -393,8 +393,7 @@ public class BalancedShardsAllocator extends AbstractComponent implements Shards
                             final ModelNode maxNode = modelNodes[highIdx];
                             advance_range:
                             if (maxNode.numShards(index) > 0) {
-                                float delta = weights[highIdx] - weights[lowIdx];
-                                delta = lessThan(delta, threshold) ? delta : sorter.weight(Operation.THRESHOLD_CHECK, maxNode) - sorter.weight(Operation.THRESHOLD_CHECK, minNode);
+                                float delta = absDelta(weights[lowIdx], weights[highIdx]);
                                 if (lessThan(delta, threshold)) {
                                     if (lowIdx > 0 && highIdx-1 > 0 // is there a chance for a higher delta?
                                         && (weights[highIdx-1] - weights[0] > threshold) // check if we need to break at all

My best guess is that the bug was that sorter.weight(Operation.THRESHOLD_CHECK, maxNode) was less than sorter.weight(Operation.THRESHOLD_CHECK, minNode). However, this condition is long-gone, and this code is now quite different from how it was then, so I cannot see a way to make this test fail meaningfully.

Additionally, the only real failure of this test that I can find is a PR build (build-1463180427010/t/20161125152515-B0401F02) in November 2016 (in which lots of other allocation tests also failed).

@s1monw can you suggest a way forward here? Perhaps one of the following:

Find a way to make it fail meaningfully again (so that I can work on finding a smaller input that triggers the same failure).
Move this to a test suite more appropriate for slower tests.
Remove it entirely.

elasticmachine · 2019-07-24T13:29:23Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-07-24T13:37:58Z

I looked at this again recently. This test now runs in ~1sec on my laptop (after warming up) and CI reports it taking ~3seconds. That's still a bit much, but nothing like as bad as the 10s reported above.

I think we could perhaps replace it with a more generic test that builds up a large-ish cluster with a large-ish number of indices and asserts that the cluster is appropriately balanced at the end. I say this tentatively because it'll be a bit delicate to get an assertion that's usefully strong without being false.

DaveCTurner · 2019-07-31T13:46:04Z

We discussed this in today's team meeting and decided to close this issue. The test is acceptably fast these days and a number of us have all thought about how to generalise or improve this test without any success.

DaveCTurner · 2022-07-25T10:34:59Z

Reopening this because over the last 3 years this test has slowed down enormously.

elasticsearchmachine · 2022-07-25T10:35:32Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2022-07-26T06:58:29Z

The test is building a routing tble with 4330 shards in it.

I executed the test once and got this flame graph:

It looks like the slow part is: org.elasticsearch.cluster.routing.RoutingNode#invariant.
This method is only called from assert and should not slow down production clusters.

I will check if anything could be simplified/speed up in it.

Average test duration with it is ~ 60 seconds
Average test duration run without is ~ 400ms

rjernst · 2022-07-26T13:03:24Z

IIRC the shards state the test builds is from a really production cluster, which tripped an edge case in allocation. When I originally created this issue, my desire was to understand the edge case, so that it could be reproduced without such a massive number of shards. @DaveCTurner do you understand the original issue well enough to make a synthetic replacement for this test?

rjernst added the >test Issues or PRs that are addressing/adding tests label Aug 4, 2015

clintongormley added help wanted adoptme :Allocation labels Jan 26, 2016

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018

DaveCTurner added the team-discuss label Nov 9, 2018

DaveCTurner added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 24, 2019

DaveCTurner closed this as completed Jul 31, 2019

DaveCTurner reopened this Jul 25, 2022

DaveCTurner removed the help wanted adoptme label Jul 25, 2022

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jul 25, 2022

idegtiarenko self-assigned this Jul 25, 2022

idegtiarenko mentioned this issue Jul 26, 2022

Speedup BalanceUnbalancedClusterTests #88794

Merged

idegtiarenko closed this as completed in #88794 Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify BalanceUnbalancedClusterTest #12629

Simplify BalanceUnbalancedClusterTest #12629

rjernst commented Aug 4, 2015

DaveCTurner commented Apr 19, 2018

elasticmachine commented Jul 24, 2019

DaveCTurner commented Jul 24, 2019

DaveCTurner commented Jul 31, 2019

DaveCTurner commented Jul 25, 2022

elasticsearchmachine commented Jul 25, 2022

idegtiarenko commented Jul 26, 2022

rjernst commented Jul 26, 2022

Simplify BalanceUnbalancedClusterTest #12629

Simplify BalanceUnbalancedClusterTest #12629

Comments

rjernst commented Aug 4, 2015

DaveCTurner commented Apr 19, 2018

elasticmachine commented Jul 24, 2019

DaveCTurner commented Jul 24, 2019

DaveCTurner commented Jul 31, 2019

DaveCTurner commented Jul 25, 2022

elasticsearchmachine commented Jul 25, 2022

idegtiarenko commented Jul 26, 2022

rjernst commented Jul 26, 2022