Shard stuck in STARTED state when taking snapshot. #17550

joshreback · 2016-04-05T20:45:33Z

Elasticsearch version:
1.6.0

JVM version:
1.8.0_45-internal

Description of the problem including expected versus actual behavior:
When taking a snapshot (using an s3 repository), one of my shards is being snapshotted extremely slowly compared to the other shards in the index (and they are of comparable size). I've tried deleting the snapshot and trying again, but the problem persists.

Provide logs (if relevant):
Here is the output of GET /_snapshot/snapshot_prod/snapshot_16/_status (node shard 2 in "prod"):

{
   "snapshots": [
      {
         "snapshot": "snapshot_16",
         "repository": "snapshot_prod",
         "state": "STARTED",
         "shards_stats": {
            "initializing": 0,
            "started": 1,
            "finalizing": 0,
            "done": 15,
            "failed": 0,
            "total": 16
         },
         "stats": {
            "number_of_files": 437,
            "processed_files": 366,
            "total_size_in_bytes": 89068328004,
            "processed_size_in_bytes": 65917540329,
            "start_time_in_millis": 1459881018320,
            "time_in_millis": 682146
         },
         "indices": {
            "qa": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 5,
                  "failed": 0,
                  "total": 5
               },
               "stats": {
                  "number_of_files": 8,
                  "processed_files": 8,
                  "total_size_in_bytes": 13196,
                  "processed_size_in_bytes": 13196,
                  "start_time_in_millis": 1459881018320,
                  "time_in_millis": 808
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018320,
                        "time_in_millis": 474
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 4,
                        "processed_files": 4,
                        "total_size_in_bytes": 12880,
                        "processed_size_in_bytes": 12880,
                        "start_time_in_millis": 1459881018324,
                        "time_in_millis": 489
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018325,
                        "time_in_millis": 458
                     }
                  },
                  "3": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018320,
                        "time_in_millis": 448
                     }
                  },
                  "4": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 806
                     }
                  }
               }
            },
            "prod": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 1,
                  "finalizing": 0,
                  "done": 2,
                  "failed": 0,
                  "total": 3
               },
               "stats": {
                  "number_of_files": 421,
                  "processed_files": 350,
                  "total_size_in_bytes": 89068309560,
                  "processed_size_in_bytes": 65917521885,
                  "start_time_in_millis": 1459881018328,
                  "time_in_millis": 682138
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 71,
                        "processed_files": 71,
                        "total_size_in_bytes": 6022934925,
                        "processed_size_in_bytes": 6022934925,
                        "start_time_in_millis": 1459881018328,
                        "time_in_millis": 682138
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 123,
                        "processed_files": 123,
                        "total_size_in_bytes": 4443650351,
                        "processed_size_in_bytes": 4443650351,
                        "start_time_in_millis": 1459881018328,
                        "time_in_millis": 494598
                     }
                  },
                  "2": {
                     "stage": "STARTED",
                     "stats": {
                        "number_of_files": 227,
                        "processed_files": 156,
                        "total_size_in_bytes": 78601724284,
                        "processed_size_in_bytes": 55450936609,
                        "start_time_in_millis": 1459881018332,
                        "time_in_millis": 0
                     },
                     "node": "ItiaPMQhQiOVl3IMqabfeQ"
                  }
               }
            },
            "library": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 3,
                  "failed": 0,
                  "total": 3
               },
               "stats": {
                  "number_of_files": 3,
                  "processed_files": 3,
                  "total_size_in_bytes": 4853,
                  "processed_size_in_bytes": 4853,
                  "start_time_in_millis": 1459881018323,
                  "time_in_millis": 1474
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1635,
                        "processed_size_in_bytes": 1635,
                        "start_time_in_millis": 1459881018983,
                        "time_in_millis": 814
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1739,
                        "processed_size_in_bytes": 1739,
                        "start_time_in_millis": 1459881018323,
                        "time_in_millis": 818
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 1479,
                        "processed_size_in_bytes": 1479,
                        "start_time_in_millis": 1459881018327,
                        "time_in_millis": 670
                     }
                  }
               }
            },
            "edmarket_dev": {
               "shards_stats": {
                  "initializing": 0,
                  "started": 0,
                  "finalizing": 0,
                  "done": 5,
                  "failed": 0,
                  "total": 5
               },
               "stats": {
                  "number_of_files": 5,
                  "processed_files": 5,
                  "total_size_in_bytes": 395,
                  "processed_size_in_bytes": 395,
                  "start_time_in_millis": 1459881018322,
                  "time_in_millis": 911
               },
               "shards": {
                  "0": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 721
                     }
                  },
                  "1": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 911
                     }
                  },
                  "2": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 697
                     }
                  },
                  "3": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 744
                     }
                  },
                  "4": {
                     "stage": "DONE",
                     "stats": {
                        "number_of_files": 1,
                        "processed_files": 1,
                        "total_size_in_bytes": 79,
                        "processed_size_in_bytes": 79,
                        "start_time_in_millis": 1459881018322,
                        "time_in_millis": 655
                     }
                  }
               }
            }
         }
      }
   ]
}

It continues to snapshot more bytes each time I re-run the _status command, but the # changes extremely slowly, so I want to make sure nothing is wrong with this node. Happy to provide more info & log output as necessary.

Thanks!

The text was updated successfully, but these errors were encountered:

imotov · 2016-04-05T22:46:44Z

@joshreback for whatever reason (custom routing, perhaps?) this shard is more than 10 times larger than other 2 shards. Shard 0 is 4g, shard 1 is 6g and shard 2 is 78g. It takes time to snapshot 78g. If your node is in AWS and has good network connection you can try changing throttling, but be careful not to overload the nodes in your cluster.

clintongormley · 2016-04-06T11:42:45Z

Sounds like this issue can be closed

clintongormley closed this as completed Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard stuck in STARTED state when taking snapshot. #17550

Shard stuck in STARTED state when taking snapshot. #17550

joshreback commented Apr 5, 2016

imotov commented Apr 5, 2016

clintongormley commented Apr 6, 2016

Shard stuck in STARTED state when taking snapshot. #17550

Shard stuck in STARTED state when taking snapshot. #17550

Comments

joshreback commented Apr 5, 2016

imotov commented Apr 5, 2016

clintongormley commented Apr 6, 2016