Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to rollback writer on close #18508

Closed
makeyang opened this issue May 22, 2016 · 15 comments
Closed

failed to rollback writer on close #18508

makeyang opened this issue May 22, 2016 · 15 comments

Comments

@makeyang
Copy link
Contributor

Elasticsearch version:
2.1.2
JVM version:
jdk1.8.0_60
OS version:
CentOS release 6.6 (Final)
Description of the problem including expected versus actual behavior:

running a cluster for a long time, rolling update from 2.1.0 to 2.1.2 and one node keeps throw below exception:
[2016-05-22 22:34:58,678][WARN ][index.engine ] [d_172.20.122.110:9204] [op_logs_2016-05-26][7] failed to read latest segment infos on flush
java.nio.file.FileSystemException: /data6/elasticsearch/jiesi-59/jiesi-59/nodes/0/indices/op_logs_2016-05-26/7/index: Input/output error
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427)
at java.nio.file.Files.newDirectoryStream(Files.java:457)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:179)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:191)
at org.elasticsearch.index.store.FsDirectoryService$1.listAll(FsDirectoryService.java:127)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:567)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:563)
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:146)
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:783)
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:732)
at org.elasticsearch.index.engine.Engine.flushAndClose(Engine.java:1128)
at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:829)
at org.elasticsearch.index.IndexService.closeShardInjector(IndexService.java:443)
at org.elasticsearch.index.IndexService.removeShard(IndexService.java:416)
at org.elasticsearch.index.IndexService.close(IndexService.java:253)
at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:413)
at org.elasticsearch.indices.IndicesService.access$000(IndicesService.java:108)
at org.elasticsearch.indices.IndicesService$1.run(IndicesService.java:174)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Steps to reproduce:
1.
2.
3.

Provide logs (if relevant):

Describe the feature:

@makeyang
Copy link
Contributor Author

one more thing to add up:
I find below logs in dmesg:
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.
XFS (sdh1): xfs_log_force: error 5 returned.

is it cause by file system or disk broken?
dose ES have ability to detected this and make the cluster recovery from this?

@jasontedor
Copy link
Member

jasontedor commented May 22, 2016

java.nio.file.FileSystemException: /data6/elasticsearch/jiesi-59/jiesi-59/nodes/0/indices/op_logs_2016-05-26/7/index: Input/output error

This is almost always indicative of a filesystem error (usually hardware), not an application error.

@makeyang
Copy link
Contributor Author

ok. got it.
so I guess the last question is: can ES detected this and make the cluster health under this condition happen?

@jasontedor
Copy link
Member

XFS (sdh1): xfs_log_force: error 5 returned.

You definitely have a hardware issue. The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.

Since this is a hardware issue and not an Elasticsearch issue, I'll close this issue but let me know if you think otherwise.

@makeyang
Copy link
Contributor Author

@jasontedor pay attention to the very first log I post, actually it is one of many disks in that node have issue, so the engine should not fail completely but exclude allocation shard on that one, right?

@jasontedor
Copy link
Member

Engines are per shard, and there is not an allocation decider for excluding a disk that experienced an exception. If you want to exclude that disk you have to remove it from path.data. But I wouldn't trust that node even removing that failing disk until you run checks (SMART status etc.) on all the disks on that node and even a memory test.

Relates #18279

@makeyang
Copy link
Contributor Author

makeyang commented May 22, 2016

I get what u are talling about engine.
The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
is this the current action or is it the future action?
I bet it is the future action, right? otherwise, the es server shouldn't keep throw exception.

@jasontedor
Copy link
Member

I mean that you should take the node out of the cluster.

@makeyang
Copy link
Contributor Author

makeyang commented May 22, 2016

sure. it has been done.
my question is: can ES handle this situation by itself or it must be recoveried manually?

@jasontedor
Copy link
Member

Removing a node from a cluster is a course of action that should only be done by an operator that is aware of end-user SLAs, maintenance windows, etc.

@makeyang
Copy link
Contributor Author

@jasontedor I am sorry I don't make myself clear. what I concerned is below:
The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
so ES can do this or not currently?

@s1monw
Copy link
Contributor

s1monw commented May 23, 2016

The engine should fail itself and that should trigger a recovery but the best course of action here is to completely take that node out of the cluster.
so ES can do this or not currently?

it won't take the node our of the cluster currently.

@wongder
Copy link

wongder commented Mar 14, 2017

Hey guys, can you pls point me to some info or links on elasticsearch rollback to undo the last updates....maybe this can be done via the _version ?? Thank you.

@jasontedor
Copy link
Member

can you pls point me to some info or links on elasticsearch rollback to undo the last updates....maybe this can be done via the _version ?

This is not possible. If you have additional questions, please ask them on the forum.

@wongder
Copy link

wongder commented Mar 15, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants