New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify BalanceUnbalancedClusterTest #12629
Comments
I looked into trying to make this test be smaller without being less useful. This test was introduced in #9149 as a fix for #9023. I'll admit I don't completely follow what happened in #9149 - the words indicate there was an unexpected sign inversion, but it looks like a bigger change; the key seems to be the following: diff --git a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
index 5484498237..7c75b2a061 100644
--- a/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
+++ b/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
@@ -393,8 +393,7 @@ public class BalancedShardsAllocator extends AbstractComponent implements Shards
final ModelNode maxNode = modelNodes[highIdx];
advance_range:
if (maxNode.numShards(index) > 0) {
- float delta = weights[highIdx] - weights[lowIdx];
- delta = lessThan(delta, threshold) ? delta : sorter.weight(Operation.THRESHOLD_CHECK, maxNode) - sorter.weight(Operation.THRESHOLD_CHECK, minNode);
+ float delta = absDelta(weights[lowIdx], weights[highIdx]);
if (lessThan(delta, threshold)) {
if (lowIdx > 0 && highIdx-1 > 0 // is there a chance for a higher delta?
&& (weights[highIdx-1] - weights[0] > threshold) // check if we need to break at all My best guess is that the bug was that Additionally, the only real failure of this test that I can find is a PR build ( @s1monw can you suggest a way forward here? Perhaps one of the following:
|
Pinging @elastic/es-distributed |
I looked at this again recently. This test now runs in ~1sec on my laptop (after warming up) and CI reports it taking ~3seconds. That's still a bit much, but nothing like as bad as the 10s reported above. I think we could perhaps replace it with a more generic test that builds up a large-ish cluster with a large-ish number of indices and asserts that the cluster is appropriately balanced at the end. I say this tentatively because it'll be a bit delicate to get an assertion that's usefully strong without being false. |
We discussed this in today's team meeting and decided to close this issue. The test is acceptably fast these days and a number of us have all thought about how to generalise or improve this test without any success. |
Reopening this because over the last 3 years this test has slowed down enormously. |
Pinging @elastic/es-distributed (Team:Distributed) |
IIRC the shards state the test builds is from a really production cluster, which tripped an edge case in allocation. When I originally created this issue, my desire was to understand the edge case, so that it could be reproduced without such a massive number of shards. @DaveCTurner do you understand the original issue well enough to make a synthetic replacement for this test? |
This takes 10 seconds or more, while other allocation tests are almost instantaneous. Can we simplify this? It looks like it tries to do a basic allocation (5 shards, 1 replica) of a new index when a ton of indexes already exist on just 4 nodes. Perhaps we could test similar circumstances without thousands of shards? Alternatively, we could just make this an integration test (leave the impl, but rename to IT). It doesn't really seem like a unit test as it is now.
Also, as a side note, this test is the only user of CatAllocationTestCase. Perhaps we can also eliminate this abstraction and just test directly (eliminating the zipped shard state)? @s1monw do you have any thoughts here?
The text was updated successfully, but these errors were encountered: