Elasticsearch translog corruption if process is killed

We found that Elasticsearch transaction log gets corrupted occasionally if the process is killed abruptly by any reason including power shutdown while indexing is going on. As the default translog flush interval is 5 seconds, we expect to loose transactions for the past 5 seconds not translog corruption in this case. This can be easily reproduced by invoking kill -9 on elasticsearch process repeatedly and restarting while sending index requests continuously.  I have attached 2 linux shell scripts to kill and restart ES.
#### Extract from log:

[2015-02-14 00:00:53,965][WARN ][indices.cluster          ] [r1s8-1.dg.com] [curr_1024][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [curr_1024][3] failed to recover shard
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
        ... 4 more
Caused by: java.io.EOFException
        at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
        at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readByte(BufferedChecksumStreamInput.java:48)
        at org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:116)
        at org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:156)
        at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:371)
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
        ... 5 more
[2015-02-14 00:00:53,979][WARN ][cluster.action.shard     ] [r1s8-1.dg.com] [curr_1024][3] sending failed shard for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2015-02-14 00:00:53,980][WARN ][cluster.action.shard     ] [r1s8-1.dg.com] [curr_1024][3] received shard failed for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]

```
/********************************************************/
killes.sh script
/**********************************************************/
#!/bin/sh
#this script will kill ES periodically

if [ -z "$1" ]
then
    echo "Usage: $0 <sleepinterval>"
    exit 1
fi

while [ : ]
do
    pid=`ps aux | grep  elastic | grep -v "grep" | awk '{print $2}'`
    if [ ! -z "${pid}" ]
    then
            echo "killing es process $pid"
            kill -9 ${pid}
    fi
    sleep $1
done

/*******************************************************************/
restart_es shell script
/*******************************************************************/

#!/bin/bash
#This script is used to restart ES on failure

ES_ROOT=/var/ES
export ES_HEAP_SIZE=12g  # real size, for scale and performance testing
ES_HOME=/opt/elasticsearch
export ES_DIRECT_SIZE=1024m
export ES_JAVA_OPTS="-XX:HeapDumpPath=$ES_ROOT/cores/esheapdump.hprof"

while true; do
    #start ES server in foreground
    mkdir -p $ES_ROOT/log/
    ulimit -n 65534;$ES_HOME/bin/elasticsearch -Des.node.name="`hostname`"  2>&1 | logger -t es_mon
    echo "ES server exit with code $? , waiting to restart..." | logger -t dg_es_mon
    sleep 2  #pause before restart so ES processes can be cleaned up
done
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elasticsearch translog corruption if process is killed #9699

Extract from log:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Elasticsearch translog corruption if process is killed #9699

Description

Extract from log:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions