Failing to upgrade ES from 7.17.10 to 8.9.1: incorrect validation of custom write thread pool size #101206
Labels
>bug
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
:Distributed/Cluster Coordination
Cluster formation and cluster state publication, including cluster membership and fault detection.
Team:Distributed
Meta label for distributed team
Elasticsearch Version
7.17.10
Installed Plugins
No response
Java Version
openjdk 20.0.1 2023-04-18
OS Version
Linux es-es-data-1-0 5.4.231-137.341.amzn2.aarch64 #1 SMP Tue Feb 14 21:50:56 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
Problem Description
Background
It has been found an issue during the upgrade from ES v7.17.10 to v8.9.1.
Initially, the issue seemed to be due to an eck-operator validation defect, but after discussing it with eck-operator team, they indicated that it was an ES problem (elastic/cloud-on-k8s#7173).
Issue
We cannot set a custom thread pool write size for our data nodes.
Master nodes config:
Data nodes config:
The issue only happens when we try to apply
thread_pool.write.size=6
for our data nodes.As per ES documentation, the write thread pool is fixed with a size of # of allocated processors. The maximum size for this pool is 1 + # of allocated processors. It means that, in our case, as we are setting tp size to 6 only for data nodes, it should work, since data nodes have 13 CPU cores allocated.
However, it looks like that ES is considering the nodes with less CPU allocated in the cluster, regardless of the node type, to make this validation, which is not correct.
We did some tests and, for example,
thread_pool.write.size=5
already works. Increasing the CPUs for master nodes from 4 to 5 also makesthread_pool.write.size=6
work.So, to summarize, it seems that the node type should be taken into consideration for this validation. Can you please take a look at that?
Steps to Reproduce
thread_pool.write.size=6
for data nodes only.Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: