Internal: Upgrade caused shard data to stay on nodes #7386

nik9000 · 2014-08-21T18:34:53Z

Upgrade caused shard data to stay on nodes even after it isn't useful any more.

This comes from https://groups.google.com/forum/#!topic/elasticsearch/Mn1N0xmjsL8

What I did:
Started upgrading from Elasticsearch 1.2.1 to Elasticsearch 1.3.2. For each of the 6 nodes I updated:

Set allocation to primaries only
Sync new plugins into place
Update deb package
Restart Elasticsearch
Wait for Elasticsearch to respond on the local host
Set allocation to all
Wait for Elasticsearch to report GREEN
Sleep for half an hour so the cluster can rebalance itself a bit

What happened:
The new version of Elasticsearch came up but didn't remove all the shard data it can't use. This picture from Whatson shows the problem pretty well:
https://wikitech.wikimedia.org/wiki/File:Whatson_out_of_disk.png
The nodes on the left were upgraded and blue means disk usage by Elasticsearch and brown is "other" disk usage.

When I dig around on the filesystem all the space usage is in the shard storage directory (/var/lib/elasticsearch/production-search-eqiad/nodes/0/indices) but when I compare the list of open files to the list of files on the file system with this I see that whole directories are just sitting around, unused. Hitting the /_cat/shards/<directory_name> corroborates that the shard in the directory isn't on the node. Oddly, if we keep poking around we find open files in directories representing shards that we don't expect to be on the node either....

What we're doing now:
We're going to try restarting the upgrade and blasting the data directory on the node as we upgrade it.

Reproduction steps:
No idea. And I'm a bit afraid to keep pushing things on our cluster with it in the state that it is in.

The text was updated successfully, but these errors were encountered:

s1monw · 2014-08-21T19:09:42Z

could this be related to #6692 did you upgrade all nodes to 1.3 or do you still have nodes < 1.3.0 in the cluster?

nik9000 · 2014-08-21T19:16:30Z

Only about 1/3 of the nodes before we got warnings about disk space.

s1monw · 2014-08-21T19:30:02Z

I guess it's not freeing the space unless an upgraded node holds a copy of the shard. That is new in 1.3 and I still try to remember what the background was. Can you check if that assumption is true, are the shards that are not delete allocated on old nodes?

nik9000 · 2014-08-21T19:32:12Z

Well, this is almost certainly the cause:

            // If all nodes have been upgraded to >= 1.3.0 at some point we get back here and have the chance to
            // run this api. (when cluster state is then updated)
            if (node.getVersion().before(Version.V_1_3_0)) {
                logger.debug("Skip deleting deleting shard instance [{}], a node holding a shard instance is < 1.3.0", shardRouting);
                return false;
            }

1.3 won't delete stuff from the disks until the whole cluster is 1.3. That's ugly. I run with disks 50% full and the upgrade process almost filled them just with shuffling.

Side note: if the shards are still in the routing table it'd be nice to see them. Right now they seem to be invisble to he _cat api.

s1monw · 2014-08-21T19:36:34Z

@nik9000 this was a temporary thing to add extra safety. It will get lower the more nodes you upgrade. I agree we could expose some more infos here if stuff is still on disk.

nik9000 · 2014-08-21T19:44:11Z

This gave me quite a scare! I was running this upgrade over night with a script with extra sleeping to keep the cluster balanced. It woke me up with 99% disk utilization on one of the nodes. I'll keep pushing the upgrade through carefully.

nik9000 · 2014-08-21T19:49:17Z

For posterity: if you nuke the contents of your node's disk after stopping Elasticsearch 1.2 but before starting Elasticsearch 1.3 then you won't end up with too much data that can't be cleared. The more nodes you upgrade the more shards you'll be able to delete any way - like @s1monw said.

s1monw · 2014-08-21T20:11:30Z

just to clarify a bit more we added some safety in 1.3 that required a new API and we can only call this API if we know that we are allocated on another 1.3 or newer node that is why we keep the data around longer. thanks for opening this nik!

nik9000 · 2014-08-22T17:30:28Z

So far we haven't seen any cleanup of old shards and we've just restarted the last node to pick up 1.3.2.

Deleting the contents of the node slowed down the upgrade but allowed us to continue the process without space being taken up by indexes we couldn't remove.

martijnvg · 2014-08-22T17:48:10Z

The unused shard copies only get deleted if all its active copies can be verified. Maybe shard to be cleaned up had copies on this not yet upgraded node?

Unused shard copies should get cleaned up now, if that isn't the case then that is bad.

If you enable trace logging for the indices.store category then we can get a peek in ES' decision making.

nik9000 · 2014-08-22T18:19:47Z

@martijnvg - I'll see what happens once all the cluster goes green after the last upgrade - that'll be in under an hour.

Did we do anything to allow changing log levels on the fly? I remember seeing something about it but #6416 is still open.

nik9000 · 2014-08-22T18:20:04Z

And by we I mean you, I guess :)

martijnvg · 2014-08-22T18:24:30Z

:) Well this has been in for a while: #2517

Which allows to change the log settings via the cluster update api.

nik9000 · 2014-08-22T18:31:10Z

OK! Here is something: https://gist.github.com/nik9000/89013550ec78da5808e4

nik9000 · 2014-08-22T18:36:43Z

That is getting spit out constantly.

nik9000 · 2014-08-22T18:48:15Z

Looks like it is on every node as well.

nik9000 · 2014-08-22T18:50:08Z

Cluster is now green and lots of old data still sitting around.

bleskes · 2014-08-22T19:06:59Z

@nik9000 this is very odd. The line points at a null clusterName . All the nodes are continuously logging this? Can I ask you to enable debug logging for the root logger and share the log? I hope to get more context into when this can happen.

nik9000 · 2014-08-22T19:08:18Z

I see that cluster name is something that as introduced in 1.1.1. Maybe a coincidence - but I haven't performed a full cluster restart since upgrading to 1.1.0.

nik9000 · 2014-08-22T19:09:26Z

Let me see about that debug logging - seems like that'll be a ton of data. Also - looks like this is the only thing that doesn't check if the cluster name is non null. Probably just a coincidence because it supposed to be non-null since 1.1.1 I guess.....

bleskes · 2014-08-22T19:13:20Z

@nik9000 I'm not sure I follow what you mean by

looks like this is the only thing that doesn't check if the cluster name is non null.

I was referring to this line: https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/indices/store/IndicesStore.java#L418

nik9000 · 2014-08-22T19:20:54Z

@bleskes - sorry, yeah. I was looking at other code that looked at the cluster name and its pretty careful around the cluster name potentially being null. Like
https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/cluster/ClusterState.java#L577 and https://github.com/elasticsearch/elasticsearch/blob/v1.3.2/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java#L551 .

I guess what I'm saying is that if the cluster state never picked up the name somehow this looks like the only thing that would break.

nik9000 · 2014-08-22T19:25:33Z

Tried setting logger to debug and didn't get anything super interesting. Here is some of it: https://gist.github.com/nik9000/b9c40805abb4bcbb5b61

bleskes · 2014-08-22T19:37:40Z

Thx Nik. I have a theory. Indeed the cluster name as part of the cluster state was introduced in 1.1.1 . When a node of version >=1.1.1 reads the cluster state from an older node, that field will be populated with null. During the upgrade from 1.1.0 this happened and the cluster state in memory has it's name set to null. Since you never restarted the complete cluster since then, all nodes have kept communicating it keep it alive. This trips this new code. A full cluster restart should fix it but that's obviously totally not desirable. I'm still trying to come up with a potential work around...

bleskes · 2014-08-22T19:40:12Z

@nik9000 do you use dedicated master nodes? it doesn't look so from the logs but I want to double check

nik9000 · 2014-08-22T19:42:46Z

@bleskes no dedicated master nodes.

nik9000 · 2014-08-22T19:44:02Z

@bleskes that's what I was thinking - I was digging through places where the cluster state is built from name and they are pretty rare. Still, it'd take me some time to validate that they never get saved.

…ter state who misses it ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version. This commit changes the default to the node's cluster name. Relates to elastic#7386

nik9000 · 2014-08-22T21:36:41Z

More posterity: this broke for me because when I started the cluster I was using 1.1.0 and I haven't done a full restart since - only rolling restarts. If you are in that boat - do not upgrade to 1.3 until 1.3.3 is released.

…ter state who misses it ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version. This commit changes the default to the node's cluster name. Relates to #7386 Closes #7414

bleskes · 2014-08-27T19:42:30Z

I'm going to close this as it is fixed by the change my in #7414

nik9000 · 2014-08-27T20:11:17Z

Thanks!

…ter state who misses it ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version. This commit changes the default to the node's cluster name. Relates to #7386 Closes #7414

ajhalani · 2014-09-21T08:21:11Z

Ran into same issue when upgrading v1.2.2 to v1.3.2. Could you please help by answering -

Besides error traces/wasted disk space, does this actually cause search/indexing failures?
Until v1.3.3 is released, what is the fix ? Will full cluster turnaround fix this?

nik9000 · 2014-09-21T11:01:12Z

The error i has was caused be more doing a full restart some 1.0.1 or so.
Full cluster restart will fix it. Like turn the whole thing off then back
on again.

You can also did it by applying the patch to fix this issue directly to
1.3.2 and building a release and making that the master node, even if it
only stays the master for a minute. That is a bit involved if you aren't
used to building elasticsearch though.
On Sep 21, 2014 4:21 AM, "ajhalani" notifications@github.com wrote:

Ran into same issue when upgrading v1.2.2 to v1.3.2. Could you please help
by answering -

Besides error traces/wasted disk space, does this actually cause
search/indexing failures?

Until v1.3.3 is released, what is the fix ? Will full cluster
turnaround fix this?

—
Reply to this email directly or view it on GitHub
#7386 (comment)
.

ajhalani · 2014-09-21T15:37:25Z

Thanks Nik,, Yes we have been doing rolling upgrade since v1,0,x, and the issue explosed with last upgrade from v1.2.2/

Really curious what is the impact of leaving v1.3.2. So far I only see error traces, but no search/index/alert failures.

Also I am not sure how can we make an upgraded node master, is their an option for that?

---- Edit 8:49 PM GMT Time ----
DId a full cluster upgrade, things are back online and green. Don't see the error traces at the moment.

nik9000 · 2014-09-21T22:33:30Z

On Sep 21, 2014 11:37 AM, "ajhalani" notifications@github.com wrote:

Thanks Nik,, Yes we have been doing rolling upgrade since v1,0,x, and the
issue explosed with last upgrade from v1.2.2/

Yup that sounds like this issue then.

Really curious what is the impact of leaving v1.3.2. So far I only see
error traces, but no search/index/alert failures.

The errors at trace log you can ignore. The trouble will be that the disks
will get full. You can delete the files that elasticsearch isn't using on
your own at it is safe so long as you were right that it wasn't using them.
You have to be careful though.

Also I am not sure how can we make an upgraded node master, is their an
option for that?

That isn't super easy. I can't explain on mobile so it'll have to wait
until Monday. I did it because I'm familiar with the source code. If a
full restart isn't too much of a problem for you is suggest it. If not ping
here and I can explain on Monday when I'm more awake.

—
Reply to this email directly or view it on GitHub.

ajhalani · 2014-09-22T15:24:17Z

Yea don't worry explaining how to make a node master it if it's not a straightforward option.. As I said in a later edit, did a full cluster restart and issue went away. thanks again!

nik9000 · 2014-09-22T15:29:22Z

Cool! I'm glad it worked for you.

@bleskes I've seen a few people with this issue over the past month - maybe 4. I wonder if it is worth thinking of cutting a 1.3.3 soonish to pick this up?

kimchy · 2014-09-22T15:40:16Z

@nik9000 yea, we should release 1.3.3 as soon as possible, we were waiting on Lucene 4.9.1, which was released and we pushed it in yesterday. I am still waiting for review on #7811 and a discussion if it makes sense to get it into 1.3.3 as well.

…ter state who misses it ClusterState has a reference to the cluster name since version 1.1.0 (df7474b) . However, if the state was sent from a master of an older version, this name can be set to null. This is an unexpected and can cause bugs. The bad part is that it will never correct it self until a full cluster restart where the cluster state is rebuilt using the code of the latest version. This commit changes the default to the node's cluster name. Relates to elastic#7386 Closes elastic#7414

jmwilkinson · 2019-12-18T16:40:25Z

In case anyone else comes across this, I've encountered the exact same behavior following the rolling upgrade docshere going from 5.4 to 5.6

nik9000 closed this as completed Aug 21, 2014

clintongormley assigned s1monw Aug 22, 2014

nik9000 reopened this Aug 22, 2014

bleskes mentioned this issue Aug 22, 2014

Use node's cluster name as a default for an incoming cluster state who misses it #7414

Closed

nik9000 mentioned this issue Aug 23, 2014

nodes stats API slower after upgrade 1.2 -> 1.3 #7385

Closed

bleskes added v1.4.0 labels Aug 27, 2014

bleskes closed this as completed Aug 27, 2014

clintongormley changed the title ~~Upgrade caused shard data to stay on nodes~~ Internal: Upgrade caused shard data to stay on nodes Sep 8, 2014

bleskes mentioned this issue Sep 14, 2014

Recovery files left behind when replica building fails #7315

Closed

imotov mentioned this issue Sep 25, 2014

Extra copy of index kept after upgrade #7820

Closed

Internal: Upgrade caused shard data to stay on nodes #7386

Internal: Upgrade caused shard data to stay on nodes #7386

Comments

nik9000 commented Aug 21, 2014

s1monw commented Aug 21, 2014

nik9000 commented Aug 21, 2014

s1monw commented Aug 21, 2014

nik9000 commented Aug 21, 2014

s1monw commented Aug 21, 2014

nik9000 commented Aug 21, 2014

nik9000 commented Aug 21, 2014

s1monw commented Aug 21, 2014

nik9000 commented Aug 22, 2014

martijnvg commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

martijnvg commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

bleskes commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

bleskes commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

bleskes commented Aug 22, 2014

bleskes commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

nik9000 commented Aug 22, 2014

bleskes commented Aug 27, 2014

nik9000 commented Aug 27, 2014

ajhalani commented Sep 21, 2014

nik9000 commented Sep 21, 2014

ajhalani commented Sep 21, 2014

nik9000 commented Sep 21, 2014

ajhalani commented Sep 22, 2014

nik9000 commented Sep 22, 2014

kimchy commented Sep 22, 2014

jmwilkinson commented Dec 18, 2019