failed snapshot _status returns 500 #23716

jpcarey · 2017-03-23T15:11:43Z

When a snapshot fails, the snapshot/_status will return a 500 error. It seems the only way to fetch the actual "FAILED" status is by listing the repository/_all. To me, the 500 exception returned when calling the snapshot/_status seems wrong.

Elasticsearch version: 5.2.2

Plugins installed: x-pack

PUT /_snapshot/my_backup
{
  "type": "fs",
  "settings": {
    "compress": true,
    "location": "repo_test"
  }
}

PUT test1

PUT /_snapshot/my_backup/snapshot_1
{
  "indices": "test1",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET _snapshot/my_backup/snapshot_1/_status
` response `
{
  "snapshots": [
    {
      "snapshot": "snapshot_1",
      "repository": "my_backup",
      "uuid": "8KxZ0zSlQFyh77dqvxc3Mw",
      "state": "SUCCESS",

}]}

`make a "bad" index...   `
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

PUT test2

PUT /_snapshot/my_backup/snapshot_2
{
  "indices": "test1,test2",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET _snapshot/my_backup/snapshot_2/_status
` response `
{
  "error": {
    "root_cause": [
      {
        "type": "index_shard_restore_failed_exception",
        "reason": "failed to read shard snapshot file",
        "index_uuid": "_f7dq3AMSEejQMZF4sbqYA",
        "shard": "0",
        "index": "test1"
      }
    ],
    "type": "index_shard_restore_failed_exception",
    "reason": "failed to read shard snapshot file",
    "index_uuid": "_f7dq3AMSEejQMZF4sbqYA",
    "shard": "0",
    "index": "test1",
    "caused_by": {
      "type": "no_such_file_exception",
      "reason": "/Users/jared/tmp/repo_test/indices/5H7x7fA-QsK7xqs6MdO0Bw/0/snap-2XWQ_Sd4QMCdSo1wU4VkoA.dat"
    }
  },
  "status": 500
}

GET /_snapshot/my_backup/_all?filter_path=*.snapshot,*.state
` response `
{
  "snapshots": [
    {
      "snapshot": "snapshot_1",
      "state": "SUCCESS"
    },
    {
      "snapshot": "snapshot_2",
      "state": "FAILED"
    }
  ]
}

javanna · 2017-03-24T11:47:42Z

This sounds like a legit request to me, @imotov what do you think?

abeyad · 2017-03-24T13:47:55Z

I agree, the _status endpoint for a failed snapshot should return information about the failure in a standard response, not a 500.

javanna · 2017-03-24T13:48:17Z

thanks @abeyad ! I will mark adoptme then.

imotov · 2017-03-24T14:25:13Z

@abeyad that feels like a bug and not enhancement. What do you think?

abeyad · 2017-03-24T14:26:18Z

@imotov agreed, i'll change the label

javanna · 2017-03-24T14:43:39Z

++ thanks for taking it @abeyad

abeyad · 2017-03-30T17:40:14Z

@jpcarey the steps you outlined above does not reproduce for me on 5.2.2. Instead, for

curl -XGET "localhost:9200/_snapshot/fs_repo/snap1"

I get:

{
  "snapshots" : [
    {
      "snapshot" : "snap1",
      "uuid" : "iTxr6rgSQMqjGOEOtk1C3g",
      "version_id" : 5020299,
      "version" : "5.2.2",
      "indices" : [
        "idx2"
      ],
      "state" : "FAILED",
      "reason" : "Indices don't have primary shards [idx2]",
      "start_time" : "2017-03-30T17:25:56.191Z",
      "start_time_in_millis" : 1490894756191,
      "end_time" : "2017-03-30T17:25:56.199Z",
      "end_time_in_millis" : 1490894756199,
      "duration_in_millis" : 8,
      "failures" : [
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 3,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 2,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 4,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 0,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 1,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards" : {
        "total" : 5,
        "failed" : 5,
        "successful" : 0
      }
    }
  ]
}

For getting the status:

curl -XGET "localhost:9200/_snapshot/fs_repo/snap1/_status"

I get:

{
  "snapshots" : [
    {
      "snapshot" : "snap1",
      "repository" : "fs_repo",
      "uuid" : "iTxr6rgSQMqjGOEOtk1C3g",
      "state" : "FAILED",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 0,
        "failed" : 5,
        "total" : 5
      },
      "stats" : {
        "number_of_files" : 0,
        "processed_files" : 0,
        "total_size_in_bytes" : 0,
        "processed_size_in_bytes" : 0,
        "start_time_in_millis" : 0,
        "time_in_millis" : 0
      },
      "indices" : {
        "idx2" : {
          "shards_stats" : {
            "initializing" : 0,
            "started" : 0,
            "finalizing" : 0,
            "done" : 0,
            "failed" : 5,
            "total" : 5
          },
          "stats" : {
            "number_of_files" : 0,
            "processed_files" : 0,
            "total_size_in_bytes" : 0,
            "processed_size_in_bytes" : 0,
            "start_time_in_millis" : 0,
            "time_in_millis" : 0
          },
          "shards" : {
            "0" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "1" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "2" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "3" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "4" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            }
          }
        }
      }
    }
  ]
}

jpcarey · 2017-03-30T18:18:30Z

@abeyad I re-ran the steps I provided (without x-pack), and still get the error with 5.2.2 (fresh untar). Reading the error, it is complaining about index test1, which is odd. I went back and made sure to add documents to the index, incase it was an issue around a blank index - same results.

macOS Sierra 10.12.3 (16D32)
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

curl 'localhost:9200/_snapshot/my_backup/snapshot_2/_status?pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_shard_restore_failed_exception",
        "reason" : "failed to read shard snapshot file",
        "index_uuid" : "RnhkQinqT4yYodBnq4fARQ",
        "shard" : "0",
        "index" : "test1"
      }
    ],
    "type" : "index_shard_restore_failed_exception",
    "reason" : "failed to read shard snapshot file",
    "index_uuid" : "RnhkQinqT4yYodBnq4fARQ",
    "shard" : "0",
    "index" : "test1",
    "caused_by" : {
      "type" : "no_such_file_exception",
      "reason" : "/Users/jared/tmp/repo_test/indices/uRZ1_CzRQ-eL3LyKwSvHcA/0/snap-ndxheQU0QgixJnHsLBmXJg.dat"
    }
  },
  "status" : 500
}

abeyad · 2017-03-30T18:32:13Z

@jpcarey I reproduced the problem - the issue is if you specify the snapshot to have only "bad" indices, then getting its status works fine. If the snapshot contains a mix of good and bad indices, then I get the same error you got.

If a snapshot is taken on multiple indices, and some of them are "good" indices that don't contain any corruption or failures, and some of them are "bad" indices that contain missing shards or corrupted shards, and if the snapshot request is set to partial=false (meaning don't take a snapshot if there are any failures), then the good indices will not be snapshotted either. Previously, when getting the status of such a snapshot, a 500 error would be thrown, because the snap-*.dat blob for the shards in the good index could not be found. This commit fixes the problem by reporting shards of good indices as failed due to a failed snapshot, instead of throwing the NoSuchFileException. Closes elastic#23716

If a snapshot is taken on multiple indices, and some of them are "good" indices that don't contain any corruption or failures, and some of them are "bad" indices that contain missing shards or corrupted shards, and if the snapshot request is set to partial=false (meaning don't take a snapshot if there are any failures), then the good indices will not be snapshotted either. Previously, when getting the status of such a snapshot, a 500 error would be thrown, because the snap-*.dat blob for the shards in the good index could not be found. This commit fixes the problem by reporting shards of good indices as failed due to a failed snapshot, instead of throwing the NoSuchFileException. Closes #23716

javanna added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 24, 2017

javanna added help wanted adoptme >enhancement labels Mar 24, 2017

abeyad added >bug and removed >enhancement help wanted adoptme labels Mar 24, 2017

abeyad self-assigned this Mar 24, 2017

abeyad mentioned this issue Mar 30, 2017

Fixes snapshot status on failed snapshots #23833

Merged

abeyad closed this as completed in #23833 Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed snapshot _status returns 500 #23716

failed snapshot _status returns 500 #23716

jpcarey commented Mar 23, 2017

javanna commented Mar 24, 2017

abeyad commented Mar 24, 2017

javanna commented Mar 24, 2017

imotov commented Mar 24, 2017

abeyad commented Mar 24, 2017

javanna commented Mar 24, 2017

abeyad commented Mar 30, 2017

jpcarey commented Mar 30, 2017

abeyad commented Mar 30, 2017

failed snapshot _status returns 500 #23716

failed snapshot _status returns 500 #23716

Comments

jpcarey commented Mar 23, 2017

javanna commented Mar 24, 2017

abeyad commented Mar 24, 2017

javanna commented Mar 24, 2017

imotov commented Mar 24, 2017

abeyad commented Mar 24, 2017

javanna commented Mar 24, 2017

abeyad commented Mar 30, 2017

jpcarey commented Mar 30, 2017

abeyad commented Mar 30, 2017