Only send diff of cluster state instead of full cluster state #6295

bluelu · 2014-05-23T13:23:06Z

If you have many nodes, and many index, even a small cluster state will trigger 500 MB of data to be sent to all nodes from the master. A few small rebalancing operations will kill the cluster.

This is a big issue.

Would be good if only the diff could be sent, then later merged. If node isn't at previous version, then fall back to current behaviour and send full state.

kimchy · 2014-07-04T14:59:49Z

@bluelu which version are you using? I haven't seen a compressed cluster state that is 500mb, we compress and all in our infra when we publish and so on.

Changing to send deltas will require quite a big change, its an option of course, but its a really big change since we rely on the full cluster state to be continuously published.

I have helped several users with large cluster state, and improved things internally. The size was not the issue, it was things like inefficient processing of large cluster state.

bluelu · 2014-07-04T15:27:35Z

We are currently using elasticsearch-1.0.2.

{
"cluster_name" : "abc",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 505,
"number_of_data_nodes" : 489,
"active_primary_shards" : 8667,
"active_shards" : 17334,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
The uncompressed state is 2 MB (curl -XGET 'http://localhost:9200/_cluster/state') and compressed about 700 KB.

We have more than 500 nodes in the cluster, so each update (e.g. when a shard gets rebalanced) will cause the state to be send to all nodes, which in total is close to 500 MB in total to be sent. This takes a few seconds to complete and takes up all the bandwith.

I can also send you the complete cluster state in private if that helps.

bluelu · 2014-09-22T10:23:50Z

@kimchy

Do you have any update on this if it will be integrated or not?

We also plan on using query warmers in the future, and then the updates also need to be dispatched to all the nodes.

Creating a text diff of state (diff algorithm) and then applying that on the non master nodes, with fallback to the full cluster state if previous and current versions differ, might not be the cleanest solution, but certainly shouldn't be all to difficult, as there is already code that the master waits for the nodes to confirm the updated cluster state.
I agree that a cleaner solution (full updates) requires more work though.

This would dramatically reduce traffic and potentially as well the heap usage on the master node on larger clusters.

kimchy · 2014-09-22T10:33:30Z

@bluelu no work has happened on this yet. Diff is one of the options, though generating the diff is one of the tricky parts here (we could do it more easily when we work on the object model, btw, of the cluster state). Indeed, we could send a diff and if the node receiving it can't apply due to changes, they can then request the full cluster state to be sent (with careful checks not to create a storm of full updates that are not needed).

It would be interesting how things would work in more recent versions. In your cluster state size, note that we do 2 things when publishing the cluster state, we serialize it using our internal serialization mechanism, so comparing it to json representation is not a real comparison, as its considerably smaller, and then we compress it. Also, on recent versions there is a better logic in applying cluster states on the receiving nodes.

Also, in upcoming 1.4, with the new zen discovery that is slated to it, there is much better support for multiple nodes. This will help as well.

The diff is something that would be interesting to explore in future versions, I am mainly trying to asses the urgency of it here, as in, is it really a problem in your use case?

bluelu · 2014-09-23T15:00:26Z

@kimchy

We throttled the bandwith on our master node so that it doesn't take all bandwith on the switch. It's not optimal but it seems to work for now. Before we had spikes of 1 gigabit over multiple seconds blocking all other connections on that switch (1 stream to each node). Now it just takes a little longer with less bandwith.

Apart from the slow restart (which we reduced by not starting all nodes at once, we will check behaviour on 1.3) it's the last one of the bigger issues which we saw so far on larger clusters.

For us, it's a problem, but it "works" with our workaround. But we would love to see this in >1.5 if possible.

We will upgrade to the newest 1.3.* in 2 weeks and then to 1.4.* (when it's stable for sure) and report back then.

kimchy · 2014-09-23T15:04:58Z

@bluelu thanks for the feedback!. I have another question, when you saw the 1gb saturation, was that when the cluster was forming (since there are a lot of cluster states updates happening then, which we reduced significantly in upcoming 1.4). If you issue a simple reroute after the cluster formed, what did you see then?

bluelu · 2014-09-23T15:52:27Z

@kimchy
I described this in another issue: #6372

The cluster is forming without any issues and we didn't check the traffic then. This doesn't take much time at all until all nodes are added.

But it takes more minutes then until the cluster state jumps from 0 unassigned shards to let's say 5000 shards. After the first update, it get's faster. The more accurate the unassigned shards number is, the faster it seems to update. It's a little scary the first time, as one might expect that all data is lost ;-).
If we add the nonssd servers, it even takes much longer (up to 10-15 minutes until the counter goes up). (might be related to slower storage, or more indexed shards, or even more nodes). We will try to test this with the new version to identify the exact cause.
During that phase we observed no traffic spikes (I guess because the state is just smaller, as no shards are allocated so far).

The traffic spikes occur if nodes are being rebalanced, we update the mapping, or do some other operation which requires the state to be sent.

We actually found out the bandwith issue as we had custom rebalancing code which shuffled indexes between 2 nodes back and forth and we wondered why everything felt so slow. This should be equal to a reroute command?

Here are more details about the slow start up (master and non master log):
https://gist.github.com/bluelu/3a9a9f7b629c3adb7e89
A colleague of mine added more information in the issue

bluelu · 2014-12-06T14:39:44Z

Fyi, We upgraded to 1.4.1 today.

We are still seeing a lot of load on the network interface when the master node sends the cluster state on startup time. (About every minute one update)

I also observed that it takes some time for the cluster state to reach the nodes in that case (about 20 seconds...). I don't know if this will cause an issue in the future? hopefully not.

Node:
[2014-12-06 15:28:29,273][DEBUG][cluster.service ] [node] set local cluster state to version 991 {elasticsearch[I80N5][clusterService#updateTask][T#1]}
[2014-12-06 15:29:11,667][DEBUG][cluster.service ] [node] set local cluster state to version 992 {elasticsearch[I80N5][clusterService#updateTask][T#1]}

Master node:
[2014-12-06 15:28:12,758][DEBUG][cluster.service ] [master] set local cluster state to version 991 {elasticsearch[I51N16][clusterService#updateTask][T#1]}
[2014-12-06 15:28:54,976][DEBUG][cluster.service ] [master] set local cluster state to version 992 {elasticsearch[I51N16][clusterService#updateTask][T#1]}

Closes elastic#6295

Refactor how settings filters are handled. Instead of specifying settings filters as filtering class, settings filters are now specified as list of settings that needs to be filtered out. Regex syntax is supported. This is breaking change and will require small change in plugins that are using settingsFilters. This change is needed in order to simplify cluster state diff implementation. Contributes to elastic#6295

ywelsch-t · 2015-03-09T11:18:51Z

Hey @imotov, I'm working with @bluelu.

A more immediate patch to reduce the cluster state size which we plan to apply looks as follows: As we have many indices with the same mapping, deduplicating the index mappings reduces the cluster state to 1/3 of its size.

First iteration of cluster state diffs that adds support for diffs to the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295

Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295

s1monw · 2015-04-27T07:09:17Z

we had a boat load of failures related to this so I branched off feature/cluster_state_diffs and reverted the commit from master.

Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes #6295

Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes elastic#6295

tmc24win · 2016-03-14T20:34:05Z

I was wondering if there are plans to address the cluster state in an upcoming 2.x release. We currently need to have thousands of small indexes to best support our data flow. Right now we have almost 3000 indices and the GET request for cluster state takes around 40 seconds and effectively makes most admin tools (HQ, Head) non-functional.

imotov · 2016-03-14T21:20:33Z

@DrGonzo424 this change (which was merged into 2.0.0) only improves node-to-node communication which works over transport protocol. HQ and Head are using REST API and will not be able to take advantage of cluster state diffs.

tmc24win · 2016-03-14T23:04:55Z

Thanks for the quick reply, that is great to know that node-to-node communication will benefit from cluster state diffs as we are currently on 2.2. Do you know if any of the admin tools do a better job of paging or caching the cluster state, to improve usability? Right now it makes me a but uncomfortable that our administrative tools are not scaling with our cluster. Perhaps there are some better admin tools out there.

imotov · 2016-03-15T00:08:51Z

@DrGonzo424 I think https://discuss.elastic.co/ would be a much better place to ask this question. My favorite admin tool is _cat API it works fast even on large cluster states and does everything I need, but it might be too minimalistic for your purposes.

bluelu mentioned this issue Jun 2, 2014

When starting up, recovery of shards takes up to 50 minutes #6372

Closed

bluelu mentioned this issue Jul 4, 2014

Slow cluster startup with zen discovery. #5232

Closed

clintongormley added the discuss label Jul 4, 2014

clintongormley added help wanted adoptme high hanging fruit and removed discuss labels Oct 24, 2014

clintongormley added v2.0.0-beta1 >feature labels Nov 19, 2014

clintongormley mentioned this issue Dec 30, 2014

Scalable cluster state propagation (zen discovery) #6186

Closed

clintongormley removed the help wanted adoptme label Dec 30, 2014

clintongormley assigned imotov Dec 30, 2014

imotov added a commit to imotov/elasticsearch that referenced this issue Jan 9, 2015

Internal: Add support for sending cluster state using diffs [WIP]

a2b2d54

Closes elastic#6295

imotov mentioned this issue Jan 9, 2015

Internal: Add support for sending cluster state using diffs #9220

Closed

imotov mentioned this issue Feb 18, 2015

Refactor settings filtering, adding regex support #9748

Merged

clintongormley mentioned this issue Mar 3, 2015

Roadmap for 2.0 #9970

Closed

14 tasks

imotov closed this as completed in d746e14 Apr 27, 2015

s1monw reopened this Apr 27, 2015

imotov added a commit that referenced this issue Apr 28, 2015

Add support for cluster state diffs

a8d9451

Adds support for calculating and sending diffs instead of full cluster state of the most frequently changing elements - cluster state, meta data and routing table. Closes #6295

imotov closed this as completed in 478c253 Apr 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only send diff of cluster state instead of full cluster state #6295

Only send diff of cluster state instead of full cluster state #6295

bluelu commented May 23, 2014

kimchy commented Jul 4, 2014

bluelu commented Jul 4, 2014

bluelu commented Sep 22, 2014

kimchy commented Sep 22, 2014

bluelu commented Sep 23, 2014

kimchy commented Sep 23, 2014

bluelu commented Sep 23, 2014

bluelu commented Dec 6, 2014

ywelsch-t commented Mar 9, 2015

s1monw commented Apr 27, 2015

tmc24win commented Mar 14, 2016

imotov commented Mar 14, 2016

tmc24win commented Mar 14, 2016

imotov commented Mar 15, 2016

Only send diff of cluster state instead of full cluster state #6295

Only send diff of cluster state instead of full cluster state #6295

Comments

bluelu commented May 23, 2014

kimchy commented Jul 4, 2014

bluelu commented Jul 4, 2014

bluelu commented Sep 22, 2014

kimchy commented Sep 22, 2014

bluelu commented Sep 23, 2014

kimchy commented Sep 23, 2014

bluelu commented Sep 23, 2014

bluelu commented Dec 6, 2014

ywelsch-t commented Mar 9, 2015

s1monw commented Apr 27, 2015

tmc24win commented Mar 14, 2016

imotov commented Mar 14, 2016

tmc24win commented Mar 14, 2016

imotov commented Mar 15, 2016