New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch translog corruption if process is killed #9699
Comments
+1 happend on windows too. Anyway to mitigate this? |
+1 on Mac OS X here. Had to hard reset my Macbook and now getting translog corruption issues. Pegged about 100% of two CPU cores for a ~ 3.3 GB index. No other indexes. |
we are seeing something similar after running into "no more space available" on local disk. Not clear if this has been fixed, or is in flight. We are running ES version 1.4.0 with lucene version 4.10.2. Any info is appreciated. Thanks. |
We used to handle truncated translogs in a better manner (assuming that the node was killed halfway through writing an operation and discarding the last operation). This brings back that behavior by catching an `EOFException` during the stream reading and throwing a `TruncatedTranslogException` which can be safely ignored in `IndexShardGateway`. Fixes elastic#9699
This happened for me with 1.4.1, will this be fixed in 1.4.x? |
We used to handle truncated translogs in a better manner (assuming that the node was killed halfway through writing an operation and discarding the last operation). This brings back that behavior by catching an `EOFException` during the stream reading and throwing a `TruncatedTranslogException` which can be safely ignored in `IndexShardGateway`. Fixes #9699
@choweiyuan yes, I am planning on backporting this to the 1.4 branch |
Thanks @dakrone ! |
We used to handle truncated translogs in a better manner (assuming that the node was killed halfway through writing an operation and discarding the last operation). This brings back that behavior by catching an `EOFException` during the stream reading and throwing a `TruncatedTranslogException` which can be safely ignored in `IndexShardGateway`. Fixes #9699
Thanks @dakrone for fixing this promptly. Looking forward to the next 1.4.x release with this fix. |
We upgrade to 1.4.5 because we observed lots of "can't index" and "queue full" errors in the LMA collector logs. According to [1], transaction log can be corrupted if the process is killed abruptly. So upgrading to 1.4.5 should fix our issue. [1] elastic/elasticsearch#9699 Change-Id: I3a48467fd06e155b216b7088c0fdacb2020bc7f8
We used to handle truncated translogs in a better manner (assuming that the node was killed halfway through writing an operation and discarding the last operation). This brings back that behavior by catching an `EOFException` during the stream reading and throwing a `TruncatedTranslogException` which can be safely ignored in `IndexShardGateway`. Fixes elastic#9699
Delete .recovery file inside the translog folder Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/ |
stop From: ramona85 <notifications@github.commailto:notifications@github.com> Am 01.09.2015 16:49 schrieb "balaji006" <notifications@github.commailto:notifications@github.com>:
Reply to this email directly or view it on GitHubhttps://github.com//issues/9699#issuecomment-136941206. |
can you guys fix this? I am still seeing this in 1.7.3 and it's not nice really. |
@ambodi the translog underwent a huge rewrite in 2.x and it can't be backported. time to upgrade |
@clintongormley oh man! Okay, thanks! 👍 👍 |
We found that Elasticsearch transaction log gets corrupted occasionally if the process is killed abruptly by any reason including power shutdown while indexing is going on. As the default translog flush interval is 5 seconds, we expect to loose transactions for the past 5 seconds not translog corruption in this case. This can be easily reproduced by invoking kill -9 on elasticsearch process repeatedly and restarting while sending index requests continuously. I have attached 2 linux shell scripts to kill and restart ES.
Extract from log:
[2015-02-14 00:00:53,965][WARN ][indices.cluster ] [r1s8-1.dg.com] [curr_1024][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [curr_1024][3] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
... 4 more
Caused by: java.io.EOFException
at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readByte(BufferedChecksumStreamInput.java:48)
at org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:116)
at org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:156)
at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:371)
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
[2015-02-14 00:00:53,979][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] sending failed shard for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2015-02-14 00:00:53,980][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] received shard failed for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
The text was updated successfully, but these errors were encountered: