Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch translog corruption if process is killed #9699

Closed
athanikkal opened this issue Feb 14, 2015 · 12 comments · Fixed by #9797
Closed

Elasticsearch translog corruption if process is killed #9699

athanikkal opened this issue Feb 14, 2015 · 12 comments · Fixed by #9797
Assignees

Comments

@athanikkal
Copy link

We found that Elasticsearch transaction log gets corrupted occasionally if the process is killed abruptly by any reason including power shutdown while indexing is going on. As the default translog flush interval is 5 seconds, we expect to loose transactions for the past 5 seconds not translog corruption in this case. This can be easily reproduced by invoking kill -9 on elasticsearch process repeatedly and restarting while sending index requests continuously. I have attached 2 linux shell scripts to kill and restart ES.

Extract from log:

[2015-02-14 00:00:53,965][WARN ][indices.cluster ] [r1s8-1.dg.com] [curr_1024][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [curr_1024][3] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
... 4 more
Caused by: java.io.EOFException
at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readByte(BufferedChecksumStreamInput.java:48)
at org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:116)
at org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:156)
at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:371)
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
[2015-02-14 00:00:53,979][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] sending failed shard for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2015-02-14 00:00:53,980][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] received shard failed for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]

/********************************************************/
killes.sh script
/**********************************************************/
#!/bin/sh
#this script will kill ES periodically

if [ -z "$1" ]
then
    echo "Usage: $0 <sleepinterval>"
    exit 1
fi

while [ : ]
do
    pid=`ps aux | grep  elastic | grep -v "grep" | awk '{print $2}'`
    if [ ! -z "${pid}" ]
    then
            echo "killing es process $pid"
            kill -9 ${pid}
    fi
    sleep $1
done

/*******************************************************************/
restart_es shell script
/*******************************************************************/

#!/bin/bash
#This script is used to restart ES on failure

ES_ROOT=/var/ES
export ES_HEAP_SIZE=12g  # real size, for scale and performance testing
ES_HOME=/opt/elasticsearch
export ES_DIRECT_SIZE=1024m
export ES_JAVA_OPTS="-XX:HeapDumpPath=$ES_ROOT/cores/esheapdump.hprof"

while true; do
    #start ES server in foreground
    mkdir -p $ES_ROOT/log/
    ulimit -n 65534;$ES_HOME/bin/elasticsearch -Des.node.name="`hostname`"  2>&1 | logger -t es_mon
    echo "ES server exit with code $? , waiting to restart..." | logger -t dg_es_mon
    sleep 2  #pause before restart so ES processes can be cleaned up
done
@mkliu
Copy link

mkliu commented Feb 19, 2015

+1 happend on windows too. Anyway to mitigate this?

@dakrone dakrone self-assigned this Feb 19, 2015
@jamilbk
Copy link

jamilbk commented Feb 20, 2015

+1 on Mac OS X here. Had to hard reset my Macbook and now getting translog corruption issues. Pegged about 100% of two CPU cores for a ~ 3.3 GB index. No other indexes.

@marksheinbaum
Copy link

we are seeing something similar after running into "no more space available" on local disk. Not clear if this has been fixed, or is in flight. We are running ES version 1.4.0 with lucene version 4.10.2. Any info is appreciated. Thanks.

dakrone added a commit to dakrone/elasticsearch that referenced this issue Mar 3, 2015
We used to handle truncated translogs in a better manner (assuming that
the node was killed halfway through writing an operation and discarding
the last operation). This brings back that behavior by catching an
`EOFException` during the stream reading and throwing a
`TruncatedTranslogException` which can be safely ignored in
`IndexShardGateway`.

Fixes elastic#9699
@choweiyuan
Copy link

This happened for me with 1.4.1, will this be fixed in 1.4.x?

dakrone added a commit that referenced this issue Mar 3, 2015
We used to handle truncated translogs in a better manner (assuming that
the node was killed halfway through writing an operation and discarding
the last operation). This brings back that behavior by catching an
`EOFException` during the stream reading and throwing a
`TruncatedTranslogException` which can be safely ignored in
`IndexShardGateway`.

Fixes #9699
@dakrone
Copy link
Member

dakrone commented Mar 3, 2015

@choweiyuan yes, I am planning on backporting this to the 1.4 branch

@choweiyuan
Copy link

Thanks @dakrone !

dakrone added a commit that referenced this issue Mar 3, 2015
We used to handle truncated translogs in a better manner (assuming that
the node was killed halfway through writing an operation and discarding
the last operation). This brings back that behavior by catching an
`EOFException` during the stream reading and throwing a
`TruncatedTranslogException` which can be safely ignored in
`IndexShardGateway`.

Fixes #9699
@athanikkal
Copy link
Author

Thanks @dakrone for fixing this promptly. Looking forward to the next 1.4.x release with this fix.

openstack-gerrit pushed a commit to openstack-archive/fuel-plugin-elasticsearch-kibana that referenced this issue Jul 10, 2015
We upgrade to 1.4.5 because we observed lots of "can't index" and "queue
full" errors in the LMA collector logs. According to [1], transaction
log can be corrupted if the process is killed abruptly. So upgrading to
1.4.5 should fix our issue.

[1] elastic/elasticsearch#9699


Change-Id: I3a48467fd06e155b216b7088c0fdacb2020bc7f8
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
We used to handle truncated translogs in a better manner (assuming that
the node was killed halfway through writing an operation and discarding
the last operation). This brings back that behavior by catching an
`EOFException` during the stream reading and throwing a
`TruncatedTranslogException` which can be safely ignored in
`IndexShardGateway`.

Fixes elastic#9699
@balaji006
Copy link

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

@marksheinbaum
Copy link

stop

From: ramona85 <notifications@github.commailto:notifications@github.com>
Reply-To: elastic/elasticsearch <reply@reply.github.commailto:reply@reply.github.com>
Date: Tuesday, September 1, 2015 at 10:35 PM
To: elastic/elasticsearch <elasticsearch@noreply.github.commailto:elasticsearch@noreply.github.com>
Cc: Mark Sheinbaum <msheinbaum@silverspringnet.commailto:msheinbaum@silverspringnet.com>
Subject: Re: [elasticsearch] Elasticsearch translog corruption if process is killed (#9699)

Am 01.09.2015 16:49 schrieb "balaji006" <notifications@github.commailto:notifications@github.com>:

Delete .recovery file inside the translog folder

Eg:/es/elasticsearch-1.7.1/data/[elasticsearch_clustername]/nodes/0/indices/[indexname]/2/translog/

Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com//issues/9699#issuecomment-136941206.

@amir-rahnama
Copy link

can you guys fix this? I am still seeing this in 1.7.3 and it's not nice really.

@clintongormley
Copy link

@ambodi the translog underwent a huge rewrite in 2.x and it can't be backported. time to upgrade

@amir-rahnama
Copy link

@clintongormley oh man! Okay, thanks! 👍 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants