Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard stuck in STARTED state when taking snapshot. #17550

Closed
joshreback opened this issue Apr 5, 2016 · 2 comments
Closed

Shard stuck in STARTED state when taking snapshot. #17550

joshreback opened this issue Apr 5, 2016 · 2 comments

Comments

@joshreback
Copy link

Elasticsearch version:
1.6.0

JVM version:
1.8.0_45-internal

Description of the problem including expected versus actual behavior:
When taking a snapshot (using an s3 repository), one of my shards is being snapshotted extremely slowly compared to the other shards in the index (and they are of comparable size). I've tried deleting the snapshot and trying again, but the problem persists.

Provide logs (if relevant):
Here is the output of GET /_snapshot/snapshot_prod/snapshot_16/_status (node shard 2 in "prod"):

{
   "snapshots": [
      {
         "snapshot": "snapshot_16",
         "repository": "snapshot_prod",
         "state": "STARTED",
         "shards_stats": {
            "initializing": 0,
            "started": 1,
            "finalizing": 0,
            "done": 15,
            "failed": 0,
            "total": 16
         },
         "stats": {
            "number_of_files": 437,
            "processed_files": 366,
            "total_size_in_bytes": 89068328004,
            "processed_size_in_bytes": 65917540329,
            "start_time_in_millis": 1459881018320,
            "time_in_millis": 682146
         },
         "indices": {
            "qa": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 5,
                  "failed": 0,
                  "total": 5
               },
               "stats": {
                  "number_of_files": 8,
                  "processed_files": 8,
                  "total_size_in_bytes": 13196,
                  "processed_size_in_bytes": 13196,
                  "start_time_in_millis": 1459881018320,
                  "time_in_millis": 808
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018320,
                        "time_in_millis": 474
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 4,
                        "processed_files": 4,
                        "total_size_in_bytes": 12880,
                        "processed_size_in_bytes": 12880,
                        "start_time_in_millis": 1459881018324,
                        "time_in_millis": 489
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018325,
                        "time_in_millis": 458
                     }
                  },
                  "3": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018320,
                        "time_in_millis": 448
                     }
                  },
                  "4": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 806
                     }
                  }
               }
            },
            "prod": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 1,
                  "finalizing": 0,
                  "done": 2,
                  "failed": 0,
                  "total": 3
               },
               "stats": {
                  "number_of_files": 421,
                  "processed_files": 350,
                  "total_size_in_bytes": 89068309560,
                  "processed_size_in_bytes": 65917521885,
                  "start_time_in_millis": 1459881018328,
                  "time_in_millis": 682138
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 71,
                        "processed_files": 71,
                        "total_size_in_bytes": 6022934925,
                        "processed_size_in_bytes": 6022934925,
                        "start_time_in_millis": 1459881018328,
                        "time_in_millis": 682138
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 123,
                        "processed_files": 123,
                        "total_size_in_bytes": 4443650351,
                        "processed_size_in_bytes": 4443650351,
                        "start_time_in_millis": 1459881018328,
                        "time_in_millis": 494598
                     }
                  },
                  "2": {
                     "stage": "STARTED",
                     "stats": {
                        "number_of_files": 227,
                        "processed_files": 156,
                        "total_size_in_bytes": 78601724284,
                        "processed_size_in_bytes": 55450936609,
                        "start_time_in_millis": 1459881018332,
                        "time_in_millis": 0
                     },
                     "node": "ItiaPMQhQiOVl3IMqabfeQ"
                  }
               }
            },
            "library": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 3,
                  "failed": 0,
                  "total": 3
               },
               "stats": {
                  "number_of_files": 3,
                  "processed_files": 3,
                  "total_size_in_bytes": 4853,
                  "processed_size_in_bytes": 4853,
                  "start_time_in_millis": 1459881018323,
                  "time_in_millis": 1474
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1635,
                        "processed_size_in_bytes": 1635,
                        "start_time_in_millis": 1459881018983,
                        "time_in_millis": 814
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1739,
                        "processed_size_in_bytes": 1739,
                        "start_time_in_millis": 1459881018323,
                        "time_in_millis": 818
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1479,
                        "processed_size_in_bytes": 1479,
                        "start_time_in_millis": 1459881018327,
                        "time_in_millis": 670
                     }
                  }
               }
            },
            "edmarket_dev": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 5,
                  "failed": 0,
                  "total": 5
               },
               "stats": {
                  "number_of_files": 5,
                  "processed_files": 5,
                  "total_size_in_bytes": 395,
                  "processed_size_in_bytes": 395,
                  "start_time_in_millis": 1459881018322,
                  "time_in_millis": 911
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 721
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 911
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 697
                     }
                  },
                  "3": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 744
                     }
                  },
                  "4": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 655
                     }
                  }
               }
            }
         }
      }
   ]
}

It continues to snapshot more bytes each time I re-run the _status command, but the # changes extremely slowly, so I want to make sure nothing is wrong with this node. Happy to provide more info & log output as necessary.

Thanks!

@imotov
Copy link
Contributor

imotov commented Apr 5, 2016

@joshreback for whatever reason (custom routing, perhaps?) this shard is more than 10 times larger than other 2 shards. Shard 0 is 4g, shard 1 is 6g and shard 2 is 78g. It takes time to snapshot 78g. If your node is in AWS and has good network connection you can try changing throttling, but be careful not to overload the nodes in your cluster.

@clintongormley
Copy link

Sounds like this issue can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants