Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to locate unrecovered shards and their state #11545

Closed
wants to merge 6 commits into from

Conversation

areek
Copy link
Contributor

@areek areek commented Jun 8, 2015

This API provides store information for shard copies of indices.
Store information reports on which nodes shard copies exist, the shard
copy version, indicating how recent they are, and any exceptions
encountered while opening the shard index or from earlier engine failure.

By default, only lists store information for shards that have at least one
unallocated copy. When the cluster health status is yellow, this will list
store information for shards that have at least one unassigned replica.
When the cluster health status is red, this will list store information
for shards, which has unassigned primaries.

Endpoints include shard stores information for a specific index, several
indices, or all:

curl -XGET 'http://localhost:9200/test/_shard_stores'
curl -XGET 'http://localhost:9200/test1,test2/_shard_stores'
curl -XGET 'http://localhost:9200/_shard_stores'

The scope of shards to list store information can be changed through
status param. Defaults to 'yellow' and 'red'. 'yellow' lists store information of
shards with at least one unassigned replica and 'red' for shards with unassigned
primary shard.
Use 'green' to list store information for shards with all assigned copies.

curl -XGET 'http://localhost:9200/_shard_stores?status=green'

Response:

The shard stores information is grouped by indices and shard ids.

{
    ...
   "0": { <1>
        "stores": [ <2>
            {
                "sPa3OgxLSYGvQ4oPs-Tajw": { <3>
                    "name": "node_t0",
                    "transport_address": "local[1]",
                    "attributes": {
                        "enable_custom_paths": "true",
                        "mode": "local"
                    }
                },
                "version": 4, <4>
                "allocation" : "primary" | "replica" | "unused", <6>
                "store_exception": ... <5>
            },
            ...
        ]
   },
    ...
}

<1> The key is the corresponding shard id for the store information
<2> A list of store information for all copies of the shard
<3> The node information that hosts a copy of the store, the key
is the unique node id.
<4> The version of the store copy
<5> The status of the store copy, whether it is used as a
primary, replica or not used at all
<6> Any exception encountered while opening the shard index or
from earlier engine failure

closes #10952

@areek
Copy link
Contributor Author

areek commented Jun 8, 2015

@s1monw This is still a WIP in terms of documentation and testing, would appreciate a review.

this.metaData = metaData;
this.listener = listener;
this.expectedOps = expectedOps;
this.opsCount = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a class called CountDown.java for this - check it out it might make things simpler her

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, switched to using CountDown

@s1monw
Copy link
Contributor

s1monw commented Jun 9, 2015

wow @areek this looks pretty awesome. I left some comments

@areek
Copy link
Contributor Author

areek commented Jun 10, 2015

Thanks for the review @s1monw! Addressed all the comments.

Was wondering if there are any tests that I can look at to get the cluster to have a bunch of unassigned nodes (currently just stoping random nodes)?
I see CorruptedFileTest#corruptRandomPrimaryFile, was wondering if there is a easier way to test out the corruption exception for the response?

Set<String> requestedIndices = new HashSet<>();
requestedIndices.addAll(Arrays.asList(request.indices()));
List<ShardId> shardIdsToFetch = new ArrayList<>();
for (MutableShardRouting shard : Iterables.concat(routingNodes.unassigned(), routingNodes.ignoredUnassigned())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should be able to use state.routingTable().shardsWithState(ShardRoutingState.UNASSIGNED);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to using state.routingTable().shardsWithState(ShardRoutingState.UNASSIGNED)

@bleskes
Copy link
Contributor

bleskes commented Jun 10, 2015

@areek I think this as awesome api. Left some comments here and there. I think we need to beef up the tests to check for the actual content for the shard responses (check it finds stuff and check that it detects corruption etc.) . I'll respond to the naming part on the ticket..

indexShardsBuilder.put(res.shardId.id(), shardStatuses);
shardsResponseBuilder.put(res.shardId.getIndex(), indexShardsBuilder.build());
for (FailedNodeException failure : res.failures) {
failureBuilder.add(failure);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we loose the information about which shard has failed. Should we wrap it in DefaultShardOperationFailedException ?

@bleskes
Copy link
Contributor

bleskes commented Jun 11, 2015

@areek change looks good. Did you see my comment about beefing up the testing?

@areek areek force-pushed the enhancement/10952 branch 2 times, most recently from 3848ce1 to 9f8d224 Compare June 23, 2015 01:56
@areek
Copy link
Contributor Author

areek commented Jun 23, 2015

@bleskes @clintongormley, I have updated the description with the new API, thoughts? It turned out to be a bit different from what we have discussed before, in terms of default behaviour. It would be good to have this reviewed.

/**
* Status used to choose shards to get store information on
*/
public enum Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this? Can't we use ClusterHealthStatus ? now that we have the EnumSet , we don't need ALL anymore...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, we now use ClusterHealthStatus

@bleskes
Copy link
Contributor

bleskes commented Jul 14, 2015

I left some final minor comments. I think we are getting close!

@areek
Copy link
Contributor Author

areek commented Jul 14, 2015

@bleskes Thanks for the review, addressed all your comments

@@ -0,0 +1,62 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left over IndicesShard_S_ ..

@bleskes
Copy link
Contributor

bleskes commented Jul 15, 2015

Left some final suggestions. Thx @areek

@areek
Copy link
Contributor Author

areek commented Jul 15, 2015

@bleskes Thanks for the review, updated the PR addressing all your comments.

@bleskes
Copy link
Contributor

bleskes commented Jul 16, 2015

LGTM. Left some very minor comment. Thx for all the hard word 👍

@areek
Copy link
Contributor Author

areek commented Jul 16, 2015

merged to master 7a21d84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API for locating unrecovered shard copies and their state
4 participants