Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken translog on most indexes like NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp #16495

Closed
k-vladyslav opened this issue Feb 8, 2016 · 13 comments

Comments

@k-vladyslav
Copy link

Act 1 - Preface

How I run into that issue

  • I had elastic 2.0.0 + Kibana 4 + Logstash used as ELK stack.
  • Everything was ok until 05 Feb when I logged into server through SSH to check something related to my project task.
  • I've spotted that elastic utilize 3-4 cores of my server up to 100% each
  • I've shut down elastic, and reviewed it logs
  • There was said that
[2016-02-05 17:56:05,596][WARN ][indices.cluster          ] [dev-node] [[logstash-2015.11.20][1]] marking and sending shard failed due to [failed recovery]
[logstash-2015.11.20][[logstash-2015.11.20][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2015.11.20/1/translog/translog-42.ckp: Too many open files];
  • I've asked admin to raise available opened-files limit. And then after elastic was started again I begin to see another errors in elastic logs
[2016-02-08 12:00:30,905][WARN ][cluster.action.shard     ] [dev-node] [logstash-2016.01.04][2] received shard failed for [logstash-2016.01.04][2], node[Qg_JBfpNRfa3iApw8BDVsA], [P], v[15], s[INITIALIZING], a[id=mCviKgbnTomx3HaqKYFZ5w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-02-08T12:00:29.932Z]], indexUUID [50id1z4pS8SsVmFe7kdsvA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp]; ]
  • I've updated elastic to latest version 2.2.0, nothing has changed
  • I assume that I've terminated elastic with SIGKILL that time

It is functioning, Kibana works but one (or more) of my shards can't be initialized and elastic tries to start them in infinite loop. Because of that I have negative impact to server:

  • Some threads of elastic instance consumes from 300 to 450% of CPU (I'm on 8 core server)
  • elastic writes a lot of messages in logs with very high frequency. That leads to increasing of /var/log/elasticsearch/dev-cluster.log for 500 MB each hour or so.

So I may suggest that it does not work as it should .

Act 2 - Pathetic Attempts to fix that thing by my own

I've read related bug reports and articles, like
#14989, #15021, #9699

have tried several things:

  • Removed all ckp files from data folder with
    find . -type f -name '*.ckp' -delete
    didn't help
  • Removed all .tlog files from data folder with
    find . -type f -name '*.tlog' -delete
    didn't help
  • changed config in elastic to force it don't create replicas or many shards with updating config/elasticsearch.yml
index.number_of_shards: 1
index.number_of_replicas: 0

didn't help

Also read articles about _cluster/reroute like
https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html
didn't help

http://www.jillesvangurp.com/2015/02/18/elasticsearch-failed-shard-recovery/
I didn't try to use "org.apache.lucene.index.CheckIndex" approach because there might be another workaround

Example of logs I'm getting

Normal start

[2016-02-08 11:46:28,635][WARN ][bootstrap                ] unable to load JNA native support library, native methods will be disabled.
java.lang.UnsatisfiedLinkError: /tmp/jna--1666338091/jna407580579145636854.tmp: /tmp/jna--1666338091/jna407580579145636854.tmp: failed to map segment from shared object: Operation not permitted
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1938)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1821)
    at java.lang.Runtime.load0(Runtime.java:809)
    at java.lang.System.load(System.java:1086)
    at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:761)
    at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:736)
    at com.sun.jna.Native.<clinit>(Native.java:131)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.elasticsearch.bootstrap.Natives.<clinit>(Natives.java:45)
    at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:89)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:144)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:285)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)
[2016-02-08 11:46:28,637][WARN ][bootstrap                ] cannot check if running as root because JNA is not available
[2016-02-08 11:46:28,637][WARN ][bootstrap                ] cannot install syscall filters because JNA is not available
[2016-02-08 11:46:28,637][WARN ][bootstrap                ] cannot register console handler because JNA is not available
[2016-02-08 11:46:28,768][INFO ][node                     ] [dev-node] version[2.2.0], pid[22950], build[8ff36d1/2016-01-27T13:32:39Z]
[2016-02-08 11:46:28,768][INFO ][node                     ] [dev-node] initializing ...
[2016-02-08 11:46:29,165][INFO ][plugins                  ] [dev-node] modules [lang-expression, lang-groovy], plugins [], sites []
[2016-02-08 11:46:29,182][INFO ][env                      ] [dev-node] using [1] data paths, mounts [[/usr (/dev/md122)]], net usable_space [167.4gb], net total_space [196.6gb], spins? [possibly], types [ext4]
[2016-02-08 11:46:29,182][INFO ][env                      ] [dev-node] heap size [989.8mb], compressed ordinary object pointers [true]
[2016-02-08 11:46:30,905][INFO ][node                     ] [dev-node] initialized
[2016-02-08 11:46:30,905][INFO ][node                     ] [dev-node] starting ...
[2016-02-08 11:46:30,975][INFO ][transport                ] [dev-node] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2016-02-08 11:46:30,983][INFO ][discovery                ] [dev-node] dev-cluster/ARHJZZXRQnOl36kgJQxyUw
[2016-02-08 11:46:34,010][INFO ][cluster.service          ] [dev-node] new_master {dev-node}{ARHJZZXRQnOl36kgJQxyUw}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-02-08 11:46:34,020][INFO ][http                     ] [dev-node] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-02-08 11:46:34,021][INFO ][node                     ] [dev-node] started
[2016-02-08 11:46:37,463][INFO ][cluster.routing.allocation] [dev-node] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.kibana][0]] ...]).
[2016-02-08 11:47:31,731][INFO ][cluster.metadata         ] [dev-node] [logstash-2016.02.08] update_mapping [logs]
[2016-02-08 11:47:35,048][INFO ][cluster.metadata         ] [dev-node] [logstash-2016.02.08] update_mapping [logs]

Failed start

[2016-02-08 12:00:24,717][WARN ][bootstrap                ] unable to load JNA native support library, native methods will be disabled.
java.lang.UnsatisfiedLinkError: /tmp/jna--1666338091/jna7690950639376952187.tmp: /tmp/jna--1666338091/jna7690950639376952187.tmp: failed to map segment from shared object: Operation not permitted
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1938)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1821)
    at java.lang.Runtime.load0(Runtime.java:809)
    at java.lang.System.load(System.java:1086)
    at com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath(Native.java:761)
    at com.sun.jna.Native.loadNativeDispatchLibrary(Native.java:736)
    at com.sun.jna.Native.<clinit>(Native.java:131)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at org.elasticsearch.bootstrap.Natives.<clinit>(Natives.java:45)
    at org.elasticsearch.bootstrap.Bootstrap.initializeNatives(Bootstrap.java:89)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:144)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:285)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)
[2016-02-08 12:00:24,719][WARN ][bootstrap                ] cannot check if running as root because JNA is not available
[2016-02-08 12:00:24,719][WARN ][bootstrap                ] cannot install syscall filters because JNA is not available
[2016-02-08 12:00:24,720][WARN ][bootstrap                ] cannot register console handler because JNA is not available
[2016-02-08 12:00:24,853][INFO ][node                     ] [dev-node] version[2.2.0], pid[24746], build[8ff36d1/2016-01-27T13:32:39Z]
[2016-02-08 12:00:24,853][INFO ][node                     ] [dev-node] initializing ...
[2016-02-08 12:00:25,266][INFO ][plugins                  ] [dev-node] modules [lang-expression, lang-groovy], plugins [], sites []
[2016-02-08 12:00:25,283][INFO ][env                      ] [dev-node] using [1] data paths, mounts [[/usr (/dev/md122)]], net usable_space [167.4gb], net total_space [196.6gb], spins? [possibly], types [ext4]
[2016-02-08 12:00:25,283][INFO ][env                      ] [dev-node] heap size [989.8mb], compressed ordinary object pointers [true]
[2016-02-08 12:00:26,787][INFO ][node                     ] [dev-node] initialized
[2016-02-08 12:00:26,787][INFO ][node                     ] [dev-node] starting ...
[2016-02-08 12:00:26,866][INFO ][transport                ] [dev-node] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2016-02-08 12:00:26,875][INFO ][discovery                ] [dev-node] dev-cluster/Qg_JBfpNRfa3iApw8BDVsA
[2016-02-08 12:00:29,905][INFO ][cluster.service          ] [dev-node] new_master {dev-node}{Qg_JBfpNRfa3iApw8BDVsA}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-02-08 12:00:29,918][INFO ][http                     ] [dev-node] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-02-08 12:00:29,918][INFO ][node                     ] [dev-node] started
[2016-02-08 12:00:30,903][WARN ][indices.cluster          ] [dev-node] [[logstash-2016.01.04][2]] marking and sending shard failed due to [failed recovery]
[logstash-2016.01.04][[logstash-2016.01.04][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:254)
    at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
    at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: [logstash-2016.01.04][[logstash-2016.01.04][2]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:156)
    at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
    at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1450)
    at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1434)
    at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:925)
    at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:897)
    at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
    ... 5 more
Caused by: java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
    at java.nio.file.Files.newByteChannel(Files.java:361)
    at java.nio.file.Files.newByteChannel(Files.java:407)
    at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
    at java.nio.file.Files.newInputStream(Files.java:152)
    at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:82)
    at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:330)
    at org.elasticsearch.index.translog.Translog.<init>(Translog.java:179)
    at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
    at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:152)
    ... 11 more
[2016-02-08 12:00:30,905][WARN ][cluster.action.shard     ] [dev-node] [logstash-2016.01.04][2] received shard failed for [logstash-2016.01.04][2], node[Qg_JBfpNRfa3iApw8BDVsA], [P], v[15], s[INITIALIZING], a[id=mCviKgbnTomx3HaqKYFZ5w], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-02-08T12:00:29.932Z]], indexUUID [50id1z4pS8SsVmFe7kdsvA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp]; ]
[logstash-2016.01.04][[logstash-2016.01.04][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];

...

[2016-02-08 12:00:31,608][WARN ][indices.cluster          ] [dev-node] [[logstash-2016.01.04][2]] marking and sending shard failed due to [failed recovery]
[logstash-2016.01.04][[logstash-2016.01.04][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];
...

[2016-02-08 12:00:31,609][WARN ][cluster.action.shard     ] [dev-node] [logstash-2016.01.04][2] received shard failed for [logstash-2016.01.04][2], node[Qg_JBfpNRfa3iApw8BDVsA], [P], v[15], s[INITIALIZING], a[id=X3x681vYTHK_ue0wNCimGg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-02-08T12:00:30.906Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp]; ]], indexUUID [50id1z4pS8SsVmFe7kdsvA], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp]; ]
[logstash-2016.01.04][[logstash-2016.01.04][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];
...

[2016-02-08 12:00:31,613][WARN ][cluster.action.shard     ] [dev-node] [logstash-2016.01.04][2] received shard failed for [logstash-2016.01.04][2], node[Qg_JBfpNRfa3iApw8BDVsA], [P], v[15], s[INITIALIZING], a[id=X3x681vYTHK_ue0wNCimGg], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-02-08T12:00:30.906Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp]; ]], indexUUID [50id1z4pS8SsVmFe7kdsvA], message [master {dev-node}{Qg_JBfpNRfa3iApw8BDVsA}{127.0.0.1}{127.0.0.1:9300} marked shard as initializing, but shard is marked as failed, resend shard failure], failure [Unknown]
[2016-02-08 12:00:32,284][WARN ][indices.cluster          ] [dev-node] [[logstash-2016.01.04][2]] marking and sending shard failed due to [failed recovery]
[logstash-2016.01.04][[logstash-2016.01.04][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp];

GET /_cat/shards
outputs

logstash-2016.01.04 3 p STARTED    11634  7.9mb 127.0.0.1 dev-node 
logstash-2016.01.04 1 p STARTED    11894    8mb 127.0.0.1 dev-node 
logstash-2016.01.04 2 p UNASSIGNED                                 
logstash-2016.01.04 4 p STARTED    11859  7.9mb 127.0.0.1 dev-node 
logstash-2016.01.04 0 p STARTED    11779  7.9mb 127.0.0.1 dev-node 
logstash-2016.01.02 3 p STARTED    10617  6.1mb 127.0.0.1 dev-node 
logstash-2016.01.02 1 p STARTED    10593  6.1mb 127.0.0.1 dev-node 
logstash-2016.01.02 2 p STARTED    10797  6.2mb 127.0.0.1 dev-node 
logstash-2016.01.02 4 p STARTED    10509    6mb 127.0.0.1 dev-node 
logstash-2016.01.02 0 p STARTED    10716  6.2mb 127.0.0.1 dev-node 
logstash-2016.01.03 3 p STARTED    11204  6.4mb 127.0.0.1 dev-node 
logstash-2016.01.03 2 p STARTED    11237  6.4mb 127.0.0.1 dev-node 
logstash-2016.01.03 1 p STARTED    11417  6.6mb 127.0.0.1 dev-node 
logstash-2016.01.03 4 p STARTED    11260  6.6mb 127.0.0.1 dev-node 
logstash-2016.01.03 0 p STARTED    11218  6.4mb 127.0.0.1 dev-node 
.kibana             0 p STARTED        6 86.8kb 127.0.0.1 dev-node 
logstash-2016.02.08 0 p STARTED     2769  1.8mb 127.0.0.1 dev-node 
logstash-2016.01.01 3 p STARTED    10450  6.4mb 127.0.0.1 dev-node 
logstash-2016.01.01 2 p STARTED    10201  6.3mb 127.0.0.1 dev-node 
logstash-2016.01.01 1 p STARTED    10474  6.5mb 127.0.0.1 dev-node 
logstash-2016.01.01 4 p STARTED    10436  6.5mb 127.0.0.1 dev-node 
logstash-2016.01.01 0 p STARTED    10585  6.5mb 127.0.0.1 dev-node 

Other failed indexes were temporately moved out of data folder to backup folder

Act 3: What's next?

This issue is reproduced all across logstash indexes I have. Some have that errors, some no. So its not about single index issue.

How I suggest to solve that issue
I need some API like

curl -XPOST localhost:9200/stop_infinite_recovery -d '{
 "stop_that_creepy_recovery_and_ignore_all_errors": true
}

that gonna stop elastic infinite reports of broken/missed translog files, ignore all previous translog errors and run without errors even if it gonna require to loose/drop/delete some data but not all my ELK indexes.
Otherwise - I can't use my logs for last 1.5 months because getting many errors about almost every logstash index.
I've already deleted all logs older than 1.5 months, when tried to solve that issue. But that didn't help either.

Worst case scenario

I've already tried to run elastic with clean data directory, and then ELK stack runs as usual, no CPU overhead, no tonns of logs, everything clean and smooth.
I can drop my existed ELK logs for this time. BUT! I won't be able to do so each time I getting translog errors or smth like that.

So guys - any advice on how to force elasticsearch to ignore that damn translog errors? :)

@k-vladyslav k-vladyslav changed the title Broken translog: NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp Broken translog on most indexes like NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp Feb 8, 2016
@k-vladyslav
Copy link
Author

@s1monw @jasontedor plz take a look at related question if have any time.

@s1monw
Copy link
Contributor

s1monw commented Feb 8, 2016

There are tons of bugs fixed along those lines. I don't think upgrading will heal what the bug did to you. The only possibility to help you would have been to get directory listings from you translog files. But apparently you deleted them all?

I can drop my existed ELK logs for this time. BUT! I won't be able to do so each time I getting translog errors or smth like that.

there are not translog issues known in the later 2.x versions so I am pretty confident you will be ok.

can you tell me why you did this:

Removed all ckp files from data folder with find . -type f -name '.ckp' -delete didn't help
Removed all .tlog files from data folder with find . -type f -name '
.tlog' -delete didn't help

and also why didn't you open an issue before you delete stuff on your filesystem? :)

@s1monw
Copy link
Contributor

s1monw commented Feb 8, 2016

btw the solution to your problem would have been simple if you had your tlog files still. You ran into the same issue reported here: https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336/6 and fixed here #15788

@s1monw s1monw closed this as completed Feb 8, 2016
@k-vladyslav
Copy link
Author

@s1monw thank You for response, no worries, I have backups :) Now I answer Ur questions

I didn't delete original problematic indexes except ELK indexes older than 1.5 months from now (I don't need them). Other indexes are backed up in separate folder, and copied back to ./data folder so I could experiment as much as I need.
I've only deleted old /var/log/elasticsearch/dev-cluster.log file because it was like 2GB size, so I can't reconstruct what was the reason of that mess

Here is list of ckp files, for sample problematic index

$ find . -type f -name '*.ckp' | grep 2016.01.04
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-225.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-228.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-222.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-224.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-223.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-227.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-229.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-226.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-225.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-228.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-222.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-224.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-223.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-227.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-226.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-221.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-230.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-225.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-228.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-224.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-231.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-227.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-229.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-226.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-225.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-222.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-224.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-223.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-220.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-225.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-222.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-224.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-223.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-227.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-226.ckp
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-221.ckp
./dev-cluster/nodes/1/indices/logstash-2016.01.04/0/translog/translog.ckp
./dev-cluster/nodes/1/indices/logstash-2016.01.04/1/translog/translog.ckp
./dev-cluster/nodes/1/indices/logstash-2016.01.04/4/translog/translog.ckp
./dev-cluster/nodes/1/indices/logstash-2016.01.04/2/translog/translog.ckp
./dev-cluster/nodes/1/indices/logstash-2016.01.04/3/translog/translog.ckp

here is list of tlog files

$ find . -type f -name '*.tlog' | grep 2016.01.04
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-230.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-227.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-229.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/0/translog/translog-228.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-227.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-229.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-226.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/1/translog/translog-228.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-230.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-231.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-229.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/4/translog/translog-232.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-227.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-225.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-227.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-226.tlog
./dev-cluster/nodes/0/indices/logstash-2016.01.04/3/translog/translog-228.tlog
./dev-cluster/nodes/1/indices/logstash-2016.01.04/0/translog/translog-1.tlog
./dev-cluster/nodes/1/indices/logstash-2016.01.04/1/translog/translog-1.tlog
./dev-cluster/nodes/1/indices/logstash-2016.01.04/4/translog/translog-1.tlog
./dev-cluster/nodes/1/indices/logstash-2016.01.04/2/translog/translog-1.tlog
./dev-cluster/nodes/1/indices/logstash-2016.01.04/3/translog/translog-1.tlog

can you tell me why you did this

that was possible solutions for older versions I've found while researching.

You ran into the same issue reported here: https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336/6 and fixed here #15788

So I need to copy-paste each .ckp file by hand in each day index (38 in my case) on each shard that may fail (5 in my case)? Won't there any automatic function exists for that in elastic?

Also #15788 was merged on Jan 6 and I've setup ES 2.0.0 several months before that, so on 2.0.0 probably that bug could be occured, yep? Can't 2.2.x detect that problems left from previous versions and resolve them automatically?

@k-vladyslav
Copy link
Author

@s1monw also solution suggested here https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336/6 won't work bcz I have almost 40 indexes (might that be 400 indexes in case when I keep all ELK data) and translog-226.ckp is not the only file missed even in scope of one index.
So only automatic solution gonna help here.

@s1monw
Copy link
Contributor

s1monw commented Feb 8, 2016

there is no automatic solution for this. but I can maybe come up with a tool that can help you?

@k-vladyslav
Copy link
Author

@s1monw okay :( I just thought I have missed something in elastic API / CLI docs that could solve such problems automatically.
Anyway. I can neglect old logs for now, all critical events gathered from ELK were moved into tasks several days before that incident, so it's ok.
If that will be reproduced in 2.2.x and higher - I'll report again with more details, if U don't mind.
Thank You for Ur time.

@s1monw
Copy link
Contributor

s1monw commented Feb 8, 2016

@k-vladyslav just to double checking can I get a copy of ./dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog.ckp

@k-vladyslav
Copy link
Author

@s1monw it was sent to Your Github profile email.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Jun 22, 2016
…ranslog

There is simply a coding bug that only happens if translog views are closed
after the translog itself is closed. this can happen for instance if we hit
a disk full exception and try to repeatedly recover the translog. This will
cause a translog-N.ckp file to be deleted since the wrong generation is used
to generate the path to delete. This seems like a copy/past problem.

This bug doesn't affect 5.0

Relates to elastic#16495
s1monw added a commit that referenced this issue Jun 23, 2016
…ranslog (#19035)

There is simply a coding bug that only happens if translog views are closed
after the translog itself is closed. this can happen for instance if we hit
a disk full exception and try to repeatedly recover the translog. This will
cause a translog-N.ckp file to be deleted since the wrong generation is used
to generate the path to delete. This seems like a copy/past problem.

This bug doesn't affect 5.0

Relates to #16495
s1monw added a commit that referenced this issue Jun 23, 2016
…ranslog (#19035)

There is simply a coding bug that only happens if translog views are closed
after the translog itself is closed. this can happen for instance if we hit
a disk full exception and try to repeatedly recover the translog. This will
cause a translog-N.ckp file to be deleted since the wrong generation is used
to generate the path to delete. This seems like a copy/past problem.

This bug doesn't affect 5.0

Relates to #16495
@PasiKoistinen
Copy link

PasiKoistinen commented Oct 15, 2016

Hi s1monw,
I'm running a ES 2.2.2 on ubuntu 14 stable. We've hit this or a related problem a couple of times now. I think this might have something to do with low disk space situations but not sure. I've seen this happening now twice so I can say this is indeed a class A nuisance. Actually this makes me wonder if ES is stable data storage option at all.

Can you please suggest actions that would help solve this issue?

This is what we're seeing in logs:
Caused by: java.nio.file.NoSuchFileException: /var/lib/elasticsearch/elasticsearch/nodes/0/indices/lindex/0/translog/translog-148.ckp

That directory at /var/lib/elasticsearch/elasticsearch/nodes/0/indices/lindex/0/translog/
contains a couple of translog files:
translog-146.ckp translog-147.ckp translog-148.tlog translog-149.tlog translog.ckp

As you can see, the translog-148.ckp is indeed missing. Hence, ES is unable to finish initializing a shard because of this missing transaction file. I find this a critical bug as it can cause denial of service to an otherwise totally working index.

I'm asking for two kinds of help here:

  1. ) can you please advice how to bypass this error in this acute situation? I'd hate to do backup restore every time this happens. I'm looking for a deterministic way to correct this now and next time.
    2.) ES should have a startup option, or a tool/ feature that allows you to start it so that it cleanly wipes out all translogs. After this it would return to a last known working good state. Maybe something like --purge-translogs.

Thanks in advance,
P

@s1monw
Copy link
Contributor

s1monw commented Oct 15, 2016

can you please advice how to bypass this error in this acute situation? I'd hate to do backup restore every time this happens. I'm looking for a deterministic way to correct this now and next time.

upgrade to the latest version so this shouldn't happen anymore. We fixed a bunch of bugs related to this. in 5.0 which will go GA soon we have a commandline tool

For now have to restore your index. Again, please upgrade to the latest 2.4 version asap.

@igormasternoy
Copy link

Currently, I have 2.4.0 version and still ES can enter into endless loop. I store ES data in separately mounted SSD. Some times SSD's can be dropped and after I'll mount them again it causes ES to enter recovery loop. It tries to load transact log file which is empty at the moment, as the result I'm getting EOFException.

@rpedela
Copy link

rpedela commented Jan 14, 2017

I have a very similar set up to @igormasternoy on my dev machine and get a similar error. In my case, my laptop battery died and the computer shutdown automatically. When I booted, the translog was corrupted. I am using ES 2.3.3. I will be updating to the latest 2.4. Unfortunately I am not able to upgrade to 5.x yet because one of the 3rd-party plugins I use hasn't updated yet. Here an excerpt from the log:

[2017-01-14 11:15:28,824][WARN ][indices.cluster          ] [Stryfe] [[filing_docs_1477420655582][2]] marking and sending shard failed due to [failed recovery]
[filing_docs_1477420655582][[filing_docs_1477420655582][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
        at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
        at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: [filing_docs_1477420655582][[filing_docs_1477420655582][2]] EngineCreationFailureException[failed to create engine]; nested: EOFException;
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:155)
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
        at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1509)
        at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1493)
        at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:966)
        at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:938)
        at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
        ... 5 more
Caused by: java.io.EOFException
        at org.elasticsearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:43)
        at org.elasticsearch.index.translog.TranslogReader.open(TranslogReader.java:199)
        at org.elasticsearch.index.translog.Translog.openReader(Translog.java:377)
        at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:334)
        at org.elasticsearch.index.translog.Translog.<init>(Translog.java:179)
        at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:151)
        ... 11 more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants