NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

lizhecao · 2016-10-11T06:44:57Z

I have met an issue like #Broken translog on most indexes like NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp #16495
but I can't understand how to solve it without upgrade? what's the meaning of copy and paste the ckp file?any can show me what to do in detail?

s1monw · 2016-10-11T09:13:03Z

what version are you running on and what lead to this failure? do you have logs you can provide?

lizhecao · 2016-10-11T09:36:23Z

version:2.2.0
reason:I suppose the reason is the file descriptor setting is too small on my centos,so es report too many open files error,and I enlarge the file descriptor and restart es,and then it report errors as below:

[FH-CND-SS] [prs_sysinfo_20161001][0]: allocating [[prs_sysinfo_20161001][0], node[null], [P], v[18], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-11T09:48:42.522Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161001/0/translog/translog-8.ckp]; ]]] to [{FH-CND-SS}{Rmf1-0FvRjSse3EzZW3mXQ}{211.138.22.118}{211.138.22.118:9300}] on primary allocation

s1monw · 2016-10-11T10:30:47Z

I think we fixed this in 2.3 or 2.3.1 - can you upgrade to the latest and see if the index recovers?

lizhecao · 2016-10-11T10:57:18Z

OK, I will try.
Thank you!

lizhecao · 2016-10-13T11:13:16Z

@s1monw I have upgraded to the 2.4.0,but there comes a new problem.

[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][5]: throttling allocation [[content_flow_log_20160830][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][4] found 1 allocations of [content_flow_log_20160830][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][4]: throttling allocation [[content_flow_log_20160830][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][3] found 1 allocations of [content_flow_log_20160830][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][3]: throttling allocation [[content_flow_log_20160830][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][2] found 1 allocations of [content_flow_log_20160830][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][2]: throttling allocation [[content_flow_log_20160830][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][0] found 1 allocations of [content_flow_log_20160830][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][0]: throttling allocation [[content_flow_log_20160830][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160830][1] found 1 allocations of [peer_flow_log_20160830][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.343Z]], highest version: [22]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160830][1]: throttling allocation [[peer_flow_log_20160830][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.343Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation

I have read https://discuss.elastic.co/t/risk-associated-with-action-write-consistency-and-index-recovery-initial-shards-for-cluster-recovery-with-a-single-node/50211 and set "index.recovery.initial_shards" : 1，but it didn't help.
And from here https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html I tried reroute my indexs to my single node, but it came out a problem that my data is lost. I need help.

s1monw · 2016-10-13T12:00:10Z

@s1monw I have upgraded to the 2.4.0,but there comes a new problem.

what is the problem? these shards are unassigned but should assign at some point? How many unassigned shards do you have. Do they initialize?

lizhecao · 2016-10-14T01:51:32Z

what is the problem?

the problem is that many shards are unassigned. My cluster has only one node.

these shards are unassigned but should assign at some point?

yes,these shards are primary shards which should assign at my node.
the health of the cluster is red as shown below
{ "cluster_name": "stsc_p2p", "status": "red", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 203, "active_shards": 203, "relocating_shards": 0, "initializing_shards": 4, "unassigned_shards": 495, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 275, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 10650, "active_shards_percent_as_number": 28.917378917378915 }

How many unassigned shards do you have. Do they initialize?

I have 495 shards unassigned. How to see whether they are initialized?

s1monw · 2016-10-14T07:31:46Z

they should initialize one after another. You have 4 initializing and that is the default value for cluster.routing.allocation.node_concurrent_recoveries if you want to bump this up just update it via the cluster settings update API:

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "cluster.routing.allocation.node_concurrent_recoveries" : 10
    }
}'

lizhecao · 2016-10-14T09:29:52Z

I tried it， but it didn't help.
After I set "cluster.routing.allocation.node_concurrent_recoveries": 10, the health of cluster is still

{
  "cluster_name": "stsc_p2p",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 205,
  "active_shards": 205,
  "relocating_shards": 0,
  "initializing_shards": 4,
  "unassigned_shards": 493,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 175846,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 2273250,
  "active_shards_percent_as_number": 29.2022792022792
}

And there are persitent logging as below:

[2016-10-14 17:29:22,990][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][4] found 1 allocations of [content_flow_log_20160914][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,990][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][4]: throttling allocation [[content_flow_log_20160914][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][0] found 1 allocations of [content_flow_log_20160914][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][0]: throttling allocation [[content_flow_log_20160914][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][5] found 1 allocations of [content_flow_log_20160914][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][5]: throttling allocation [[content_flow_log_20160914][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][1] found 1 allocations of [content_flow_log_20160914][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][1]: throttling allocation [[content_flow_log_20160914][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160914][3] found 1 allocations of [peer_flow_log_20160914][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.639Z]], highest version: [20]

s1monw · 2016-10-14T09:47:45Z

do you see any exceptions in the log files?

lizhecao · 2016-10-17T01:08:11Z

I have emailed the log to you, please help me see what's wrong.
I have no idea how to recover the shards.

clintongormley · 2016-10-17T15:09:22Z

@lizhecao it looks like your shards are recovering, just slowly. there are no exceptions in what you showed above.

lizhecao · 2016-10-18T03:07:02Z

how long will it take for a shard recovery, I have waited for a long time.

lizhecao · 2016-10-19T11:28:48Z

Are there any methods to speed it up? @clintongormley @s1monw
It recover only 10 shards 1.5 day. And there are 400 shards unassigned.

colings86 · 2017-03-31T13:57:42Z

@lizhecao are you still seeing this issue? If so please provide details and reopen the issue

lizhecao · 2017-04-08T07:16:48Z

@colings86 3q for help. but now I can't provider details because the environment that time has been lost now. my solution is to copy the translog.ckp as the lost ckp file

s1monw added the feedback_needed label Oct 11, 2016

clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Oct 17, 2016

colings86 closed this as completed Mar 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

lizhecao commented Oct 11, 2016

s1monw commented Oct 11, 2016

lizhecao commented Oct 11, 2016 •

edited

s1monw commented Oct 11, 2016

lizhecao commented Oct 11, 2016

lizhecao commented Oct 13, 2016 •

edited by clintongormley

s1monw commented Oct 13, 2016

lizhecao commented Oct 14, 2016 •

edited

s1monw commented Oct 14, 2016

lizhecao commented Oct 14, 2016 •

edited by clintongormley

s1monw commented Oct 14, 2016

lizhecao commented Oct 17, 2016

clintongormley commented Oct 17, 2016

lizhecao commented Oct 18, 2016

lizhecao commented Oct 19, 2016 •

edited

colings86 commented Mar 31, 2017

lizhecao commented Apr 8, 2017

NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

Comments

lizhecao commented Oct 11, 2016

s1monw commented Oct 11, 2016

lizhecao commented Oct 11, 2016 • edited

s1monw commented Oct 11, 2016

lizhecao commented Oct 11, 2016

lizhecao commented Oct 13, 2016 • edited by clintongormley

s1monw commented Oct 13, 2016

lizhecao commented Oct 14, 2016 • edited

s1monw commented Oct 14, 2016

lizhecao commented Oct 14, 2016 • edited by clintongormley

s1monw commented Oct 14, 2016

lizhecao commented Oct 17, 2016

clintongormley commented Oct 17, 2016

lizhecao commented Oct 18, 2016

lizhecao commented Oct 19, 2016 • edited

colings86 commented Mar 31, 2017

lizhecao commented Apr 8, 2017

lizhecao commented Oct 11, 2016 •

edited

lizhecao commented Oct 13, 2016 •

edited by clintongormley

lizhecao commented Oct 14, 2016 •

edited

lizhecao commented Oct 14, 2016 •

edited by clintongormley

lizhecao commented Oct 19, 2016 •

edited