Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed snapshot _status returns 500 #23716

Closed
jpcarey opened this issue Mar 23, 2017 · 9 comments · Fixed by #23833
Closed

failed snapshot _status returns 500 #23716

jpcarey opened this issue Mar 23, 2017 · 9 comments · Fixed by #23833
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@jpcarey
Copy link
Contributor

jpcarey commented Mar 23, 2017

When a snapshot fails, the snapshot/_status will return a 500 error. It seems the only way to fetch the actual "FAILED" status is by listing the repository/_all. To me, the 500 exception returned when calling the snapshot/_status seems wrong.

Elasticsearch version: 5.2.2

Plugins installed: x-pack

PUT /_snapshot/my_backup
{
  "type": "fs",
  "settings": {
    "compress": true,
    "location": "repo_test"
  }
}

PUT test1

PUT /_snapshot/my_backup/snapshot_1
{
  "indices": "test1",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET _snapshot/my_backup/snapshot_1/_status
` response `
{
  "snapshots": [
    {
      "snapshot": "snapshot_1",
      "repository": "my_backup",
      "uuid": "8KxZ0zSlQFyh77dqvxc3Mw",
      "state": "SUCCESS",

}]}

`make a "bad" index...   `
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}

PUT test2

PUT /_snapshot/my_backup/snapshot_2
{
  "indices": "test1,test2",
  "ignore_unavailable": true,
  "include_global_state": false
}

GET _snapshot/my_backup/snapshot_2/_status
` response `
{
  "error": {
    "root_cause": [
      {
        "type": "index_shard_restore_failed_exception",
        "reason": "failed to read shard snapshot file",
        "index_uuid": "_f7dq3AMSEejQMZF4sbqYA",
        "shard": "0",
        "index": "test1"
      }
    ],
    "type": "index_shard_restore_failed_exception",
    "reason": "failed to read shard snapshot file",
    "index_uuid": "_f7dq3AMSEejQMZF4sbqYA",
    "shard": "0",
    "index": "test1",
    "caused_by": {
      "type": "no_such_file_exception",
      "reason": "/Users/jared/tmp/repo_test/indices/5H7x7fA-QsK7xqs6MdO0Bw/0/snap-2XWQ_Sd4QMCdSo1wU4VkoA.dat"
    }
  },
  "status": 500
}

GET /_snapshot/my_backup/_all?filter_path=*.snapshot,*.state
` response `
{
  "snapshots": [
    {
      "snapshot": "snapshot_1",
      "state": "SUCCESS"
    },
    {
      "snapshot": "snapshot_2",
      "state": "FAILED"
    }
  ]
}
@javanna javanna added the :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Mar 24, 2017
@javanna
Copy link
Member

javanna commented Mar 24, 2017

This sounds like a legit request to me, @imotov what do you think?

@abeyad
Copy link

abeyad commented Mar 24, 2017

I agree, the _status endpoint for a failed snapshot should return information about the failure in a standard response, not a 500.

@javanna
Copy link
Member

javanna commented Mar 24, 2017

thanks @abeyad ! I will mark adoptme then.

@imotov
Copy link
Contributor

imotov commented Mar 24, 2017

@abeyad that feels like a bug and not enhancement. What do you think?

@abeyad
Copy link

abeyad commented Mar 24, 2017

@imotov agreed, i'll change the label

@abeyad abeyad self-assigned this Mar 24, 2017
@javanna
Copy link
Member

javanna commented Mar 24, 2017

++ thanks for taking it @abeyad

@abeyad
Copy link

abeyad commented Mar 30, 2017

@jpcarey the steps you outlined above does not reproduce for me on 5.2.2. Instead, for

curl -XGET "localhost:9200/_snapshot/fs_repo/snap1"

I get:

{
  "snapshots" : [
    {
      "snapshot" : "snap1",
      "uuid" : "iTxr6rgSQMqjGOEOtk1C3g",
      "version_id" : 5020299,
      "version" : "5.2.2",
      "indices" : [
        "idx2"
      ],
      "state" : "FAILED",
      "reason" : "Indices don't have primary shards [idx2]",
      "start_time" : "2017-03-30T17:25:56.191Z",
      "start_time_in_millis" : 1490894756191,
      "end_time" : "2017-03-30T17:25:56.199Z",
      "end_time_in_millis" : 1490894756199,
      "duration_in_millis" : 8,
      "failures" : [
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 3,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 2,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 4,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 0,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "idx2",
          "index_uuid" : "idx2",
          "shard_id" : 1,
          "reason" : "primary shard is not allocated",
          "status" : "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards" : {
        "total" : 5,
        "failed" : 5,
        "successful" : 0
      }
    }
  ]
}

For getting the status:

curl -XGET "localhost:9200/_snapshot/fs_repo/snap1/_status"

I get:

{
  "snapshots" : [
    {
      "snapshot" : "snap1",
      "repository" : "fs_repo",
      "uuid" : "iTxr6rgSQMqjGOEOtk1C3g",
      "state" : "FAILED",
      "shards_stats" : {
        "initializing" : 0,
        "started" : 0,
        "finalizing" : 0,
        "done" : 0,
        "failed" : 5,
        "total" : 5
      },
      "stats" : {
        "number_of_files" : 0,
        "processed_files" : 0,
        "total_size_in_bytes" : 0,
        "processed_size_in_bytes" : 0,
        "start_time_in_millis" : 0,
        "time_in_millis" : 0
      },
      "indices" : {
        "idx2" : {
          "shards_stats" : {
            "initializing" : 0,
            "started" : 0,
            "finalizing" : 0,
            "done" : 0,
            "failed" : 5,
            "total" : 5
          },
          "stats" : {
            "number_of_files" : 0,
            "processed_files" : 0,
            "total_size_in_bytes" : 0,
            "processed_size_in_bytes" : 0,
            "start_time_in_millis" : 0,
            "time_in_millis" : 0
          },
          "shards" : {
            "0" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "1" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "2" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "3" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            },
            "4" : {
              "stage" : "FAILURE",
              "stats" : {
                "number_of_files" : 0,
                "processed_files" : 0,
                "total_size_in_bytes" : 0,
                "processed_size_in_bytes" : 0,
                "start_time_in_millis" : 0,
                "time_in_millis" : 0
              },
              "reason" : "primary shard is not allocated"
            }
          }
        }
      }
    }
  ]
}

@jpcarey
Copy link
Contributor Author

jpcarey commented Mar 30, 2017

@abeyad I re-ran the steps I provided (without x-pack), and still get the error with 5.2.2 (fresh untar). Reading the error, it is complaining about index test1, which is odd. I went back and made sure to add documents to the index, incase it was an issue around a blank index - same results.

macOS Sierra 10.12.3 (16D32)
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

curl 'localhost:9200/_snapshot/my_backup/snapshot_2/_status?pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "index_shard_restore_failed_exception",
        "reason" : "failed to read shard snapshot file",
        "index_uuid" : "RnhkQinqT4yYodBnq4fARQ",
        "shard" : "0",
        "index" : "test1"
      }
    ],
    "type" : "index_shard_restore_failed_exception",
    "reason" : "failed to read shard snapshot file",
    "index_uuid" : "RnhkQinqT4yYodBnq4fARQ",
    "shard" : "0",
    "index" : "test1",
    "caused_by" : {
      "type" : "no_such_file_exception",
      "reason" : "/Users/jared/tmp/repo_test/indices/uRZ1_CzRQ-eL3LyKwSvHcA/0/snap-ndxheQU0QgixJnHsLBmXJg.dat"
    }
  },
  "status" : 500
}

@abeyad
Copy link

abeyad commented Mar 30, 2017

@jpcarey I reproduced the problem - the issue is if you specify the snapshot to have only "bad" indices, then getting its status works fine. If the snapshot contains a mix of good and bad indices, then I get the same error you got.

abeyad pushed a commit to abeyad/elasticsearch that referenced this issue Mar 30, 2017
If a snapshot is taken on multiple indices, and some of them are "good"
indices that don't contain any corruption or failures, and some of them
are "bad" indices that contain missing shards or corrupted shards, and
if the snapshot request is set to partial=false (meaning don't take a
snapshot if there are any failures), then the good indices will not be
snapshotted either.  Previously, when getting the status of such a
snapshot, a 500 error would be thrown, because the snap-*.dat blob for
the shards in the good index could not be found.

This commit fixes the problem by reporting shards of good indices as
failed due to a failed snapshot, instead of throwing the
NoSuchFileException.

Closes elastic#23716
abeyad pushed a commit that referenced this issue Apr 7, 2017
If a snapshot is taken on multiple indices, and some of them are "good"
indices that don't contain any corruption or failures, and some of them
are "bad" indices that contain missing shards or corrupted shards, and
if the snapshot request is set to partial=false (meaning don't take a
snapshot if there are any failures), then the good indices will not be
snapshotted either.  Previously, when getting the status of such a
snapshot, a 500 error would be thrown, because the snap-*.dat blob for
the shards in the good index could not be found.

This commit fixes the problem by reporting shards of good indices as
failed due to a failed snapshot, instead of throwing the
NoSuchFileException.

Closes #23716
abeyad pushed a commit that referenced this issue Apr 7, 2017
If a snapshot is taken on multiple indices, and some of them are "good"
indices that don't contain any corruption or failures, and some of them
are "bad" indices that contain missing shards or corrupted shards, and
if the snapshot request is set to partial=false (meaning don't take a
snapshot if there are any failures), then the good indices will not be
snapshotted either.  Previously, when getting the status of such a
snapshot, a 500 error would be thrown, because the snap-*.dat blob for
the shards in the good index could not be found.

This commit fixes the problem by reporting shards of good indices as
failed due to a failed snapshot, instead of throwing the
NoSuchFileException.

Closes #23716
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants