failed to rollback writer on close #18508

makeyang · 2016-05-22T14:42:55Z

Elasticsearch version:
2.1.2
JVM version:
jdk1.8.0_60
OS version:
CentOS release 6.6 (Final)
Description of the problem including expected versus actual behavior:

running a cluster for a long time, rolling update from 2.1.0 to 2.1.2 and one node keeps throw below exception:
[2016-05-22 22:34:58,678][WARN ][index.engine ] [d_172.20.122.110:9204] [op_logs_2016-05-26][7] failed to read latest segment infos on flush
java.nio.file.FileSystemException: /data6/elasticsearch/jiesi-59/jiesi-59/nodes/0/indices/op_logs_2016-05-26/7/index: Input/output error
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:179)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:191)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:127)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:567)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:563)
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:146)
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:783)
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:732)
at org.elasticsearch.index.engine.Engine.flushAndClose(Engine.java:1128)
at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:829)
at org.elasticsearch.index.IndexService.closeShardInjector(IndexService.java:443)
at org.elasticsearch.index.IndexService.removeShard(IndexService.java:416)
at org.elasticsearch.index.IndexService.close(IndexService.java:253)
at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:413)
at org.elasticsearch.indices.IndicesService.access$000(IndicesService.java:108)
at org.elasticsearch.indices.IndicesService$1.run(IndicesService.java:174)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Steps to reproduce:
1.
2.
3.

Provide logs (if relevant):

Describe the feature:

makeyang · 2016-05-22T14:47:04Z

one more thing to add up:
I find below logs in dmesg:
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.

is it cause by file system or disk broken?
dose ES have ability to detected this and make the cluster recovery from this?

jasontedor · 2016-05-22T14:48:53Z

java.nio.file.FileSystemException: /data6/elasticsearch/jiesi-59/jiesi-59/nodes/0/indices/op_logs_2016-05-26/7/index: Input/output error

This is almost always indicative of a filesystem error (usually hardware), not an application error.

makeyang · 2016-05-22T14:50:19Z

ok. got it.
so I guess the last question is: can ES detected this and make the cluster health under this condition happen?

jasontedor · 2016-05-22T15:04:53Z

XFS (sdh1): xfs_log_force: error 5 returned.

You definitely have a hardware issue. The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.

Since this is a hardware issue and not an Elasticsearch issue, I'll close this issue but let me know if you think otherwise.

makeyang · 2016-05-22T15:16:19Z

@jasontedor pay attention to the very first log I post, actually it is one of many disks in that node have issue, so the engine should not fail completely but exclude allocation shard on that one, right?

jasontedor · 2016-05-22T15:37:07Z

Engines are per shard, and there is not an allocation decider for excluding a disk that experienced an exception. If you want to exclude that disk you have to remove it from path.data. But I wouldn't trust that node even removing that failing disk until you run checks (SMART status etc.) on all the disks on that node and even a memory test.

Relates #18279

makeyang · 2016-05-22T15:40:36Z

I get what u are talling about engine.
The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
is this the current action or is it the future action?
I bet it is the future action, right? otherwise, the es server shouldn't keep throw exception.

jasontedor · 2016-05-22T15:44:23Z

I mean that you should take the node out of the cluster.

makeyang · 2016-05-22T15:49:33Z

sure. it has been done.
my question is: can ES handle this situation by itself or it must be recoveried manually?

jasontedor · 2016-05-22T16:31:00Z

Removing a node from a cluster is a course of action that should only be done by an operator that is aware of end-user SLAs, maintenance windows, etc.

makeyang · 2016-05-23T03:13:29Z

@jasontedor I am sorry I don't make myself clear. what I concerned is below:
The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
so ES can do this or not currently?

s1monw · 2016-05-23T07:16:14Z

The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
so ES can do this or not currently?

it won't take the node our of the cluster currently.

wongder · 2017-03-14T18:54:24Z

Hey guys, can you pls point me to some info or links on elasticsearch rollback to undo the last updates....maybe this can be done via the _version ?? Thank you.

jasontedor · 2017-03-14T20:40:12Z

can you pls point me to some info or links on elasticsearch rollback to undo the last updates....maybe this can be done via the _version ?

This is not possible. If you have additional questions, please ask them on the forum.

wongder · 2017-03-15T00:19:14Z

Thank you Jason, i actually did enter this question in the forum but will make sure to use your forum link in your reply next time, thanks again! On 03-14-2017, at 4:40 PM, Jason Tedor <notifications@github.com<mailto:notifications@github.com>> wrote: can you pls point me to some info or links on elasticsearch rollback to undo the last updates....maybe this can be done via the _version ? This is not possible. If you have additional questions, please ask them on the forum<https://discuss.elastic.co/>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#18508 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AC6Zb1kfrSBVnlAoxVuZmdaBMq_Vo0fzks5rlvtTgaJpZM4IkADH>.

jasontedor closed this as completed May 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to rollback writer on close #18508

failed to rollback writer on close #18508

makeyang commented May 22, 2016

makeyang commented May 22, 2016

jasontedor commented May 22, 2016 •

edited

makeyang commented May 22, 2016

jasontedor commented May 22, 2016

makeyang commented May 22, 2016

jasontedor commented May 22, 2016

makeyang commented May 22, 2016 •

edited

jasontedor commented May 22, 2016

makeyang commented May 22, 2016 •

edited

jasontedor commented May 22, 2016

makeyang commented May 23, 2016

s1monw commented May 23, 2016

wongder commented Mar 14, 2017

jasontedor commented Mar 14, 2017

wongder commented Mar 15, 2017 via email

failed to rollback writer on close #18508

failed to rollback writer on close #18508

Comments

makeyang commented May 22, 2016

makeyang commented May 22, 2016

jasontedor commented May 22, 2016 • edited

makeyang commented May 22, 2016

jasontedor commented May 22, 2016

makeyang commented May 22, 2016

jasontedor commented May 22, 2016

makeyang commented May 22, 2016 • edited

jasontedor commented May 22, 2016

makeyang commented May 22, 2016 • edited

jasontedor commented May 22, 2016

makeyang commented May 23, 2016

s1monw commented May 23, 2016

wongder commented Mar 14, 2017

jasontedor commented Mar 14, 2017

wongder commented Mar 15, 2017 via email

jasontedor commented May 22, 2016 •

edited

makeyang commented May 22, 2016 •

edited

makeyang commented May 22, 2016 •

edited