Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to upgrade ES from 7.17.10 to 8.9.1: incorrect validation of custom write thread pool size #101206

Open
gustavosci opened this issue Oct 23, 2023 · 1 comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team

Comments

@gustavosci
Copy link

Elasticsearch Version

7.17.10

Installed Plugins

No response

Java Version

openjdk 20.0.1 2023-04-18

OS Version

Linux es-es-data-1-0 5.4.231-137.341.amzn2.aarch64 #1 SMP Tue Feb 14 21:50:56 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Problem Description

Background
It has been found an issue during the upgrade from ES v7.17.10 to v8.9.1.
Initially, the issue seemed to be due to an eck-operator validation defect, but after discussing it with eck-operator team, they indicated that it was an ES problem (elastic/cloud-on-k8s#7173).

Issue
We cannot set a custom thread pool write size for our data nodes.

Master nodes config:

          resources:
            limits:
              cpu: "4"
              memory: 16Gi
            requests:
              cpu: "4"
              memory: 16Gi

Data nodes config:

          resources:
            limits:
              cpu: "13"
              memory: 57Gi
            requests:
              cpu: "13"
              memory: 57Gi

The issue only happens when we try to apply thread_pool.write.size=6 for our data nodes.
As per ES documentation, the write thread pool is fixed with a size of # of allocated processors. The maximum size for this pool is 1 + # of allocated processors. It means that, in our case, as we are setting tp size to 6 only for data nodes, it should work, since data nodes have 13 CPU cores allocated.

However, it looks like that ES is considering the nodes with less CPU allocated in the cluster, regardless of the node type, to make this validation, which is not correct.
We did some tests and, for example, thread_pool.write.size=5 already works. Increasing the CPUs for master nodes from 4 to 5 also makes thread_pool.write.size=6 work.

So, to summarize, it seems that the node type should be taken into consideration for this validation. Can you please take a look at that?

Steps to Reproduce

  • Set up a cluster with 3 master nodes and 3 data nodes
    • ES version v7.17.10.
    • Each master node has 4 CPUs
    • Each data nodes has 13 CPUs
    • Set thread_pool.write.size=6 for data nodes only.
  • Try to upgrade the cluster to v8.9.1 using eck-operator

Logs (if relevant)

No response

@gustavosci gustavosci added >bug needs:triage Requires assignment of a team area label labels Oct 23, 2023
@JVerwolf JVerwolf added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed needs:triage Requires assignment of a team area label labels Oct 27, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Oct 27, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants