Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch continues to attempt writes to an index that has shards on disks that are full #19584

Closed
evanvolgas opened this issue Jul 25, 2016 · 5 comments

Comments

@evanvolgas
Copy link

evanvolgas commented Jul 25, 2016

Using

  1. Elasticsearch: 2.3.1
  2. JVM version: 1.8.0_60-b27
  3. OS version: Centos 7

With

.
.
.
node:
  data: true
  master: false
  name: esd02-data4
  tier: hot
os:
  available_processors: 8
path:
  data:
      - /storage/sdb1
      - /storage/sdx1
.
.
.

in elasticsearch.yml we have observed twice now where one of the disks (say, /storage/sdx1) will fill up to 100% while the other /storage/sdb1 still has space. The error logs fill up with errors like:

[logstash-syslog-2016.07.21][[logstash-syslog-2016.07.21][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog.ckp -> /storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog-7395039383191700052.tlog: No space left on device];
Caused by: [logstash-syslog-2016.07.21][[logstash-syslog-2016.07.21][1]] EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog.ckp -> /storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog-7395039383191700052.tlog: No space left on device];
Caused by: java.nio.file.FileSystemException: /storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog.ckp -> /storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/1/translog/translog-7395039383191700052.tlog: No space left on device
java.nio.file.FileSystemException: /storage/sdb1/stage/nodes/0/indices/logstash-syslog-2016.07.21/_state/state-13787.st -> /storage/sdx1/stage/nodes/0/indices/logstash-syslog-2016.07.21/_state/state-13787.st.tmp: No space left on device

It's pretty obvious what's happening... storage/sdx1 owns a shard of logstash-syslog-2016.07.21 and ES is refusing to "break" the shard across multiple disk paths....

but it's also failing to recover from this on its own, or to stop sending data to the shard whose disk is full. I would expect either (A) in the event of a full disk, ES would nominate the other shards to accept 100% of the writes to the index or (B) ES would loosen its requirement that shards are not allowed to span multiple write paths.

@bleskes
Copy link
Contributor

bleskes commented Jul 25, 2016

answering in reverse order:

B) we went to a great pain to actually make it so, in order to limit the scope of such a failure (and disk loss) to a few shards on node rather then all of them.

A is our chosen strategy, assuming that by "other shards" you mean that the shard on the full disk will be failed and other other shards will carry the load. Note that all active shards copies are indexing all operations in ES and that each document is mapped to single set of shard copies (and this can not be changed).

It may however take some time for the master to process the shard failed, remove it and publish to the node in question which in turn will clean it up.

From you exception it seem you are talking about multiple attempts to assign a primary to that node and it seems you have no other valid copy for ES to use. Is that the case? if so, this is a duplicate of #19446

@evanvolgas
Copy link
Author

evanvolgas commented Jul 25, 2016

Ah, interesting. So this is a staging cluster where we run without replicas. In retrospect I definitely should have mentioned that. My apologies.

If I follow you correctly, the failover mechanics you are describing require that a promotable replica exists... which I could see leading to some bizarre edge cases like a a node failure on node "Foo" taking down replica copies of indices "A" and "B". The cluster might rightfully attempt to reallocate replica copies of "A" and "B to nodes "Bar" and " Baz." Then if node Bar had the only primary copy of a shard of index A (the replica having been lost when node Foo died, and currently in the process of being streamed from node Bar to node Baz) and one of Bar's write paths filled up, the cluster might enter the sort of deadlock state I observed where the cluster was stuck for hours trying to write data to a disk location that doesn't have any space available on one of its write paths...

which is an edge case that might best be subtitled "If many bad things happen at the same time, Elastic might enter a state from which it can't recover on its own."

The behavior you are describing makes a lot of sense and the scenarios in which I can see that strategy being inadequate are esoteric. This may very well be such an edge case that it's not worth worrying about.

As far as #19446, that looks related to me but not a duplicate of this ticket. But I may be overlooking something.

@bleskes
Copy link
Contributor

bleskes commented Jul 26, 2016

If many bad things happen at the same time, Elastic might enter a state from which it can't recover on its own

For what it's worth, if many bad things happen Elasticsearch may very well end up not being able to recover. It's about documenting those scenarios - for example - if you loose all copies of a shard, data will be lost.

in which I can see that strategy being inadequate are esoteric.

What are they?

As far as #19446, that looks related to me but not a duplicate of this ticket. But I may be overlooking something.

As far as I can ES does the only thing it can do which is try to salvage the only copy of the primary it has with the hope that some space was freed (but a user/completed merge). The problem is it never gives up and waits for a human intervention which is what #19446 is about.

I'm going to close this for now as a duplicate. if it turns out to be something else we can reopen.

@bleskes bleskes closed this as completed Jul 26, 2016
@evanvolgas
Copy link
Author

in which I can see that strategy being inadequate are esoteric.

The scenarios I can think of basically all boil down to a node failing and a disk containing a primary shard filling up while ES tries to replicate the shards that were lost when a node failed. And with 2 replica copies instead of 1, the odds of these kinds of failures happening seem extremely unlikely to me.

As an example of what I have in mind, imagine a cluster with one replica of all its indices and a server dies. It seems possible to me that recovery could fill up one of several disks on a node and, assuming no watermarks are exceeded overall (it's worth noting that _cat/allocation was reporting 60% used space on my node that entered a coma), you could have a cluster lock up and freeze. Am I mistaken that this could happen? Suppose you have disks A, B, C, and D. Suppose A is 90% full and the others are 50% full. Is there anything preventing recovery from placing a shard on disk A?

Inre "For what it's worth, if many bad things happen Elasticsearch may very well end up not being able to recover. It's about documenting those scenarios - for example - if you loose all copies of a shard, data will be lost."

I agree. But if you loose all copies of a shard, you should expect to lose data. If you have 6 write paths with TBs of free space, node-level disk allocation significantly under the high and low water marks, and a single disk that fills up, I would argue that ES freezing for hours on end until someone clears it up manually is a bit strange to see and probably not what most people would expect. But I could be wrong.

@bleskes
Copy link
Contributor

bleskes commented Jul 26, 2016

I'm not 100% sure I understand what you mean exactly, but I think what you mean is whether ES will transfer shard data from one path to another on the same node (remember each shard is always completely on one path), if one is getting full and the other is still empty. Sadly, this is not yet the case - there are existing issues to track this one- for example #16763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants