Skip to content

Elasticsearch translog corruption if process is killed #9699

Closed
@athanikkal

Description

@athanikkal

We found that Elasticsearch transaction log gets corrupted occasionally if the process is killed abruptly by any reason including power shutdown while indexing is going on. As the default translog flush interval is 5 seconds, we expect to loose transactions for the past 5 seconds not translog corruption in this case. This can be easily reproduced by invoking kill -9 on elasticsearch process repeatedly and restarting while sending index requests continuously. I have attached 2 linux shell scripts to kill and restart ES.

Extract from log:

[2015-02-14 00:00:53,965][WARN ][indices.cluster ] [r1s8-1.dg.com] [curr_1024][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [curr_1024][3] failed to recover shard
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
... 4 more
Caused by: java.io.EOFException
at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readByte(BufferedChecksumStreamInput.java:48)
at org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:116)
at org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:156)
at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:371)
at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
[2015-02-14 00:00:53,979][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] sending failed shard for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2015-02-14 00:00:53,980][WARN ][cluster.action.shard ] [r1s8-1.dg.com] [curr_1024][3] received shard failed for [curr_1024][3], node[9Ip4LaEbTnqVa8xOtVocHg], [P], s[INITIALIZING], indexUUID [F1efdIdUQwONYbyYMqloEg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[curr_1024][3] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]

/********************************************************/
killes.sh script
/**********************************************************/
#!/bin/sh
#this script will kill ES periodically

if [ -z "$1" ]
then
    echo "Usage: $0 <sleepinterval>"
    exit 1
fi

while [ : ]
do
    pid=`ps aux | grep  elastic | grep -v "grep" | awk '{print $2}'`
    if [ ! -z "${pid}" ]
    then
            echo "killing es process $pid"
            kill -9 ${pid}
    fi
    sleep $1
done

/*******************************************************************/
restart_es shell script
/*******************************************************************/

#!/bin/bash
#This script is used to restart ES on failure

ES_ROOT=/var/ES
export ES_HEAP_SIZE=12g  # real size, for scale and performance testing
ES_HOME=/opt/elasticsearch
export ES_DIRECT_SIZE=1024m
export ES_JAVA_OPTS="-XX:HeapDumpPath=$ES_ROOT/cores/esheapdump.hprof"

while true; do
    #start ES server in foreground
    mkdir -p $ES_ROOT/log/
    ulimit -n 65534;$ES_HOME/bin/elasticsearch -Des.node.name="`hostname`"  2>&1 | logger -t es_mon
    echo "ES server exit with code $? , waiting to restart..." | logger -t dg_es_mon
    sleep 2  #pause before restart so ES processes can be cleaned up
done

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions