Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoSuchFileException: /opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161011/3/translog/translog.ckp #20854

Closed
lizhecao opened this issue Oct 11, 2016 · 16 comments
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed

Comments

@lizhecao
Copy link

I have met an issue like #Broken translog on most indexes like NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp #16495
but I can't understand how to solve it without upgrade? what's the meaning of copy and paste the ckp file?any can show me what to do in detail?

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2016

what version are you running on and what lead to this failure? do you have logs you can provide?

@lizhecao
Copy link
Author

lizhecao commented Oct 11, 2016

version:2.2.0
reason:I suppose the reason is the file descriptor setting is too small on my centos,so es report too many open files error,and I enlarge the file descriptor and restart es,and then it report errors as below:

[FH-CND-SS] [prs_sysinfo_20161001][0]: allocating [[prs_sysinfo_20161001][0], node[null], [P], v[18], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-11T09:48:42.522Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/fonsview/3RD/elasticsearch/data/stsc_p2p/nodes/0/indices/prs_sysinfo_20161001/0/translog/translog-8.ckp]; ]]] to [{FH-CND-SS}{Rmf1-0FvRjSse3EzZW3mXQ}{211.138.22.118}{211.138.22.118:9300}] on primary allocation

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2016

I think we fixed this in 2.3 or 2.3.1 - can you upgrade to the latest and see if the index recovers?

@lizhecao
Copy link
Author

OK, I will try.
Thank you!

@lizhecao
Copy link
Author

lizhecao commented Oct 13, 2016

@s1monw I have upgraded to the 2.4.0,but there comes a new problem.

[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][5]: throttling allocation [[content_flow_log_20160830][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][4] found 1 allocations of [content_flow_log_20160830][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,093][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][4]: throttling allocation [[content_flow_log_20160830][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][3] found 1 allocations of [content_flow_log_20160830][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][3]: throttling allocation [[content_flow_log_20160830][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][2] found 1 allocations of [content_flow_log_20160830][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][2]: throttling allocation [[content_flow_log_20160830][2], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][0] found 1 allocations of [content_flow_log_20160830][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]], highest version: [18]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160830][0]: throttling allocation [[content_flow_log_20160830][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.344Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160830][1] found 1 allocations of [peer_flow_log_20160830][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.343Z]], highest version: [22]
[2016-10-13 19:01:21,094][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160830][1]: throttling allocation [[peer_flow_log_20160830][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-13T10:59:15.343Z]]] to [[{FH-CND-SS}{d9cuWsBuSJCdsG1LuoKlsg}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation

I have read https://discuss.elastic.co/t/risk-associated-with-action-write-consistency-and-index-recovery-initial-shards-for-cluster-recovery-with-a-single-node/50211 and set "index.recovery.initial_shards" : 1,but it didn't help.
And from here https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html I tried reroute my indexs to my single node, but it came out a problem that my data is lost. I need help.

@s1monw
Copy link
Contributor

s1monw commented Oct 13, 2016

@s1monw I have upgraded to the 2.4.0,but there comes a new problem.

what is the problem? these shards are unassigned but should assign at some point? How many unassigned shards do you have. Do they initialize?

@lizhecao
Copy link
Author

lizhecao commented Oct 14, 2016

what is the problem?

the problem is that many shards are unassigned. My cluster has only one node.

these shards are unassigned but should assign at some point?

yes,these shards are primary shards which should assign at my node.
the health of the cluster is red as shown below
{ "cluster_name": "stsc_p2p", "status": "red", "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 203, "active_shards": 203, "relocating_shards": 0, "initializing_shards": 4, "unassigned_shards": 495, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 275, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 10650, "active_shards_percent_as_number": 28.917378917378915 }

How many unassigned shards do you have. Do they initialize?

I have 495 shards unassigned. How to see whether they are initialized?

@s1monw
Copy link
Contributor

s1monw commented Oct 14, 2016

they should initialize one after another. You have 4 initializing and that is the default value for cluster.routing.allocation.node_concurrent_recoveries if you want to bump this up just update it via the cluster settings update API:

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "cluster.routing.allocation.node_concurrent_recoveries" : 10
    }
}'

@lizhecao
Copy link
Author

lizhecao commented Oct 14, 2016

I tried it, but it didn't help.
After I set "cluster.routing.allocation.node_concurrent_recoveries": 10, the health of cluster is still

{
  "cluster_name": "stsc_p2p",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 205,
  "active_shards": 205,
  "relocating_shards": 0,
  "initializing_shards": 4,
  "unassigned_shards": 493,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 175846,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 2273250,
  "active_shards_percent_as_number": 29.2022792022792
}

And there are persitent logging as below:

[2016-10-14 17:29:22,990][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][4] found 1 allocations of [content_flow_log_20160914][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,990][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][4]: throttling allocation [[content_flow_log_20160914][4], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][0] found 1 allocations of [content_flow_log_20160914][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][0]: throttling allocation [[content_flow_log_20160914][0], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][5] found 1 allocations of [content_flow_log_20160914][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][5]: throttling allocation [[content_flow_log_20160914][5], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][1] found 1 allocations of [content_flow_log_20160914][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]], highest version: [20]
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [content_flow_log_20160914][1]: throttling allocation [[content_flow_log_20160914][1], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.645Z]]] to [[{FH-CND-SS}{_uzMYD-4RFm8T56t8cbuoA}{211.138.22.118}{211.138.22.118:9300}]] on primary allocation
[2016-10-14 17:29:22,991][DEBUG][gateway                  ] [FH-CND-SS] [peer_flow_log_20160914][3] found 1 allocations of [peer_flow_log_20160914][3], node[null], [P], v[0], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-10-14T08:27:01.639Z]], highest version: [20]

@s1monw
Copy link
Contributor

s1monw commented Oct 14, 2016

do you see any exceptions in the log files?

@lizhecao
Copy link
Author

I have emailed the log to you, please help me see what's wrong.
I have no idea how to recover the shards.

@clintongormley
Copy link

@lizhecao it looks like your shards are recovering, just slowly. there are no exceptions in what you showed above.

@clintongormley clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Oct 17, 2016
@lizhecao
Copy link
Author

how long will it take for a shard recovery, I have waited for a long time.

@lizhecao
Copy link
Author

lizhecao commented Oct 19, 2016

Are there any methods to speed it up? @clintongormley @s1monw
It recover only 10 shards 1.5 day. And there are 400 shards unassigned.

@colings86
Copy link
Contributor

@lizhecao are you still seeing this issue? If so please provide details and reopen the issue

@lizhecao
Copy link
Author

lizhecao commented Apr 8, 2017

@colings86 3q for help. but now I can't provider details because the environment that time has been lost now. my solution is to copy the translog.ckp as the lost ckp file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. feedback_needed
Projects
None yet
Development

No branches or pull requests

4 participants