Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify BalanceUnbalancedClusterTest #12629

Closed
rjernst opened this issue Aug 4, 2015 · 8 comments · Fixed by #88794
Closed

Simplify BalanceUnbalancedClusterTest #12629

rjernst opened this issue Aug 4, 2015 · 8 comments · Fixed by #88794
Assignees
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team team-discuss >test Issues or PRs that are addressing/adding tests

Comments

@rjernst
Copy link
Member

rjernst commented Aug 4, 2015

This takes 10 seconds or more, while other allocation tests are almost instantaneous. Can we simplify this? It looks like it tries to do a basic allocation (5 shards, 1 replica) of a new index when a ton of indexes already exist on just 4 nodes. Perhaps we could test similar circumstances without thousands of shards? Alternatively, we could just make this an integration test (leave the impl, but rename to IT). It doesn't really seem like a unit test as it is now.

Also, as a side note, this test is the only user of CatAllocationTestCase. Perhaps we can also eliminate this abstraction and just test directly (eliminating the zipped shard state)? @s1monw do you have any thoughts here?

@rjernst rjernst added the >test Issues or PRs that are addressing/adding tests label Aug 4, 2015
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 14, 2018
@DaveCTurner
Copy link
Contributor

I looked into trying to make this test be smaller without being less useful.

This test was introduced in #9149 as a fix for #9023. I'll admit I don't completely follow what happened in #9149 - the words indicate there was an unexpected sign inversion, but it looks like a bigger change; the key seems to be the following:

diff --git a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
index 5484498237..7c75b2a061 100644
--- a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
+++ b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
@@ -393,8 +393,7 @@ public class BalancedShardsAllocator extends AbstractComponent implements Shards
                             final ModelNode maxNode = modelNodes[highIdx];
                             advance_range:
                             if (maxNode.numShards(index) > 0) {
-                                float delta = weights[highIdx] - weights[lowIdx];
-                                delta = lessThan(delta, threshold) ? delta : sorter.weight(Operation.THRESHOLD_CHECK, maxNode) - sorter.weight(Operation.THRESHOLD_CHECK, minNode);
+                                float delta = absDelta(weights[lowIdx], weights[highIdx]);
                                 if (lessThan(delta, threshold)) {
                                     if (lowIdx > 0 && highIdx-1 > 0 // is there a chance for a higher delta?
                                         && (weights[highIdx-1] - weights[0] > threshold) // check if we need to break at all

My best guess is that the bug was that sorter.weight(Operation.THRESHOLD_CHECK, maxNode) was less than sorter.weight(Operation.THRESHOLD_CHECK, minNode). However, this condition is long-gone, and this code is now quite different from how it was then, so I cannot see a way to make this test fail meaningfully.

Additionally, the only real failure of this test that I can find is a PR build (build-1463180427010/t/20161125152515-B0401F02) in November 2016 (in which lots of other allocation tests also failed).

@s1monw can you suggest a way forward here? Perhaps one of the following:

  • Find a way to make it fail meaningfully again (so that I can work on finding a smaller input that triggers the same failure).
  • Move this to a test suite more appropriate for slower tests.
  • Remove it entirely.

@DaveCTurner DaveCTurner added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

I looked at this again recently. This test now runs in ~1sec on my laptop (after warming up) and CI reports it taking ~3seconds. That's still a bit much, but nothing like as bad as the 10s reported above.

I think we could perhaps replace it with a more generic test that builds up a large-ish cluster with a large-ish number of indices and asserts that the cluster is appropriately balanced at the end. I say this tentatively because it'll be a bit delicate to get an assertion that's usefully strong without being false.

@DaveCTurner
Copy link
Contributor

We discussed this in today's team meeting and decided to close this issue. The test is acceptably fast these days and a number of us have all thought about how to generalise or improve this test without any success.

@DaveCTurner
Copy link
Contributor

Reopening this because over the last 3 years this test has slowed down enormously.

@DaveCTurner DaveCTurner reopened this Jul 25, 2022
@DaveCTurner DaveCTurner removed the help wanted adoptme label Jul 25, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Jul 25, 2022
@idegtiarenko idegtiarenko self-assigned this Jul 25, 2022
@idegtiarenko
Copy link
Contributor

The test is building a routing tble with 4330 shards in it.

I executed the test once and got this flame graph: flamegraph

It looks like the slow part is: org.elasticsearch.cluster.routing.RoutingNode#invariant.
This method is only called from assert and should not slow down production clusters.

I will check if anything could be simplified/speed up in it.

Average test duration with it is ~ 60 seconds
Average test duration run without is ~ 400ms

@rjernst
Copy link
Member Author

rjernst commented Jul 26, 2022

IIRC the shards state the test builds is from a really production cluster, which tripped an edge case in allocation. When I originally created this issue, my desire was to understand the edge case, so that it could be reproduced without such a massive number of shards. @DaveCTurner do you understand the original issue well enough to make a synthetic replacement for this test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team team-discuss >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants