Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only send diff of cluster state instead of full cluster state #6295

Closed
bluelu opened this issue May 23, 2014 · 14 comments
Closed

Only send diff of cluster state instead of full cluster state #6295

bluelu opened this issue May 23, 2014 · 14 comments

Comments

@bluelu
Copy link

bluelu commented May 23, 2014

If you have many nodes, and many index, even a small cluster state will trigger 500 MB of data to be sent to all nodes from the master. A few small rebalancing operations will kill the cluster.

This is a big issue.

Would be good if only the diff could be sent, then later merged. If node isn't at previous version, then fall back to current behaviour and send full state.

@kimchy
Copy link
Member

kimchy commented Jul 4, 2014

@bluelu which version are you using? I haven't seen a compressed cluster state that is 500mb, we compress and all in our infra when we publish and so on.

Changing to send deltas will require quite a big change, its an option of course, but its a really big change since we rely on the full cluster state to be continuously published.

I have helped several users with large cluster state, and improved things internally. The size was not the issue, it was things like inefficient processing of large cluster state.

@bluelu
Copy link
Author

bluelu commented Jul 4, 2014

We are currently using elasticsearch-1.0.2.

{
"cluster_name" : "abc",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 505,
"number_of_data_nodes" : 489,
"active_primary_shards" : 8667,
"active_shards" : 17334,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
The uncompressed state is 2 MB (curl -XGET 'http://localhost:9200/_cluster/state') and compressed about 700 KB.

We have more than 500 nodes in the cluster, so each update (e.g. when a shard gets rebalanced) will cause the state to be send to all nodes, which in total is close to 500 MB in total to be sent. This takes a few seconds to complete and takes up all the bandwith.

I can also send you the complete cluster state in private if that helps.

@bluelu
Copy link
Author

bluelu commented Sep 22, 2014

@kimchy

Do you have any update on this if it will be integrated or not?

We also plan on using query warmers in the future, and then the updates also need to be dispatched to all the nodes.

Creating a text diff of state (diff algorithm) and then applying that on the non master nodes, with fallback to the full cluster state if previous and current versions differ, might not be the cleanest solution, but certainly shouldn't be all to difficult, as there is already code that the master waits for the nodes to confirm the updated cluster state.
I agree that a cleaner solution (full updates) requires more work though.

This would dramatically reduce traffic and potentially as well the heap usage on the master node on larger clusters.

@kimchy
Copy link
Member

kimchy commented Sep 22, 2014

@bluelu no work has happened on this yet. Diff is one of the options, though generating the diff is one of the tricky parts here (we could do it more easily when we work on the object model, btw, of the cluster state). Indeed, we could send a diff and if the node receiving it can't apply due to changes, they can then request the full cluster state to be sent (with careful checks not to create a storm of full updates that are not needed).

It would be interesting how things would work in more recent versions. In your cluster state size, note that we do 2 things when publishing the cluster state, we serialize it using our internal serialization mechanism, so comparing it to json representation is not a real comparison, as its considerably smaller, and then we compress it. Also, on recent versions there is a better logic in applying cluster states on the receiving nodes.

Also, in upcoming 1.4, with the new zen discovery that is slated to it, there is much better support for multiple nodes. This will help as well.

The diff is something that would be interesting to explore in future versions, I am mainly trying to asses the urgency of it here, as in, is it really a problem in your use case?

@bluelu
Copy link
Author

bluelu commented Sep 23, 2014

@kimchy

We throttled the bandwith on our master node so that it doesn't take all bandwith on the switch. It's not optimal but it seems to work for now. Before we had spikes of 1 gigabit over multiple seconds blocking all other connections on that switch (1 stream to each node). Now it just takes a little longer with less bandwith.

Apart from the slow restart (which we reduced by not starting all nodes at once, we will check behaviour on 1.3) it's the last one of the bigger issues which we saw so far on larger clusters.

For us, it's a problem, but it "works" with our workaround. But we would love to see this in >1.5 if possible.

We will upgrade to the newest 1.3.* in 2 weeks and then to 1.4.* (when it's stable for sure) and report back then.

@kimchy
Copy link
Member

kimchy commented Sep 23, 2014

@bluelu thanks for the feedback!. I have another question, when you saw the 1gb saturation, was that when the cluster was forming (since there are a lot of cluster states updates happening then, which we reduced significantly in upcoming 1.4). If you issue a simple reroute after the cluster formed, what did you see then?

@bluelu
Copy link
Author

bluelu commented Sep 23, 2014

@kimchy
I described this in another issue: #6372

The cluster is forming without any issues and we didn't check the traffic then. This doesn't take much time at all until all nodes are added.

But it takes more minutes then until the cluster state jumps from 0 unassigned shards to let's say 5000 shards. After the first update, it get's faster. The more accurate the unassigned shards number is, the faster it seems to update. It's a little scary the first time, as one might expect that all data is lost ;-).
If we add the nonssd servers, it even takes much longer (up to 10-15 minutes until the counter goes up). (might be related to slower storage, or more indexed shards, or even more nodes). We will try to test this with the new version to identify the exact cause.
During that phase we observed no traffic spikes (I guess because the state is just smaller, as no shards are allocated so far).

The traffic spikes occur if nodes are being rebalanced, we update the mapping, or do some other operation which requires the state to be sent.

We actually found out the bandwith issue as we had custom rebalancing code which shuffled indexes between 2 nodes back and forth and we wondered why everything felt so slow. This should be equal to a reroute command?

Here are more details about the slow start up (master and non master log):
https://gist.github.com/bluelu/3a9a9f7b629c3adb7e89
A colleague of mine added more information in the issue

@bluelu
Copy link
Author

bluelu commented Dec 6, 2014

Fyi, We upgraded to 1.4.1 today.

We are still seeing a lot of load on the network interface when the master node sends the cluster state on startup time. (About every minute one update)

I also observed that it takes some time for the cluster state to reach the nodes in that case (about 20 seconds...). I don't know if this will cause an issue in the future? hopefully not.

Node:
[2014-12-06 15:28:29,273][DEBUG][cluster.service ] [node] set local cluster state to version 991 {elasticsearch[I80N5][clusterService#updateTask][T#1]}
[2014-12-06 15:29:11,667][DEBUG][cluster.service ] [node] set local cluster state to version 992 {elasticsearch[I80N5][clusterService#updateTask][T#1]}

Master node:
[2014-12-06 15:28:12,758][DEBUG][cluster.service ] [master] set local cluster state to version 991 {elasticsearch[I51N16][clusterService#updateTask][T#1]}
[2014-12-06 15:28:54,976][DEBUG][cluster.service ] [master] set local cluster state to version 992 {elasticsearch[I51N16][clusterService#updateTask][T#1]}

@clintongormley clintongormley removed the help wanted adoptme label Dec 30, 2014
imotov added a commit to imotov/elasticsearch that referenced this issue Jan 9, 2015
imotov added a commit to imotov/elasticsearch that referenced this issue Feb 24, 2015
Refactor how settings filters are handled. Instead of specifying settings filters as filtering class, settings filters are now specified as list of settings that needs to be filtered out. Regex syntax is supported. This is breaking change and will require small change in plugins that are using settingsFilters. This change is needed in order to simplify cluster state diff implementation.

Contributes to elastic#6295
@clintongormley clintongormley mentioned this issue Mar 3, 2015
14 tasks
@ywelsch-t
Copy link

Hey @imotov, I'm working with @bluelu.

A more immediate patch to reduce the cluster state size which we plan to apply looks as follows: As we have many indices with the same mapping, deduplicating the index mappings reduces the cluster state to 1/3 of its size.

imotov added a commit to imotov/elasticsearch that referenced this issue Mar 23, 2015
First iteration of cluster state diffs that adds support for diffs to the most frequently changing elements - cluster state, meta data and routing table.

Closes elastic#6295
imotov added a commit to imotov/elasticsearch that referenced this issue Apr 27, 2015
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table.

Closes elastic#6295
@imotov imotov closed this as completed in d746e14 Apr 27, 2015
@s1monw
Copy link
Contributor

s1monw commented Apr 27, 2015

we had a boat load of failures related to this so I branched off feature/cluster_state_diffs and reverted the commit from master.

@s1monw s1monw reopened this Apr 27, 2015
imotov added a commit that referenced this issue Apr 28, 2015
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table.

Closes #6295
s1monw pushed a commit to s1monw/elasticsearch that referenced this issue Apr 29, 2015
Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table.

Closes elastic#6295
@imotov imotov closed this as completed in 478c253 Apr 29, 2015
@tmc24win
Copy link

I was wondering if there are plans to address the cluster state in an upcoming 2.x release. We currently need to have thousands of small indexes to best support our data flow. Right now we have almost 3000 indices and the GET request for cluster state takes around 40 seconds and effectively makes most admin tools (HQ, Head) non-functional.

@imotov
Copy link
Contributor

imotov commented Mar 14, 2016

@DrGonzo424 this change (which was merged into 2.0.0) only improves node-to-node communication which works over transport protocol. HQ and Head are using REST API and will not be able to take advantage of cluster state diffs.

@tmc24win
Copy link

Thanks for the quick reply, that is great to know that node-to-node communication will benefit from cluster state diffs as we are currently on 2.2. Do you know if any of the admin tools do a better job of paging or caching the cluster state, to improve usability? Right now it makes me a but uncomfortable that our administrative tools are not scaling with our cluster. Perhaps there are some better admin tools out there.

@imotov
Copy link
Contributor

imotov commented Mar 15, 2016

@DrGonzo424 I think https://discuss.elastic.co/ would be a much better place to ask this question. My favorite admin tool is _cat API it works fast even on large cluster states and does everything I need, but it might be too minimalistic for your purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants