Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail replica shards locally upon failures #5847

Closed
wants to merge 2 commits into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Apr 17, 2014

When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (POST_RECOVERY state), we have a racing condition between the failed shard message and moving the shard into the STARTED state. If the latter happen first, master will fail to resolve the fail shard message.

This PR builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the STARTED command from master. It also makes us more resilient to other racing conditions in this area.

When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message.

This PR builds on elastic#5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.

private void failReplicaIfNeeded(String index, int shardId, Throwable t) {
if (!ignoreReplicaException(t)) {
logger.warn("Failed to perform " + transportAction + " on replica [" + index + "][" + shardId + "]. failing shard.", t);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we end up double logging warnings, no? The first here, and the second when failing the engine. I think its enough to log a warning when failing the engine later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree but I think we should log that we executed this as debug?

@s1monw
Copy link
Contributor

s1monw commented Apr 17, 2014

one small comments but otherwise LGTM

@bleskes
Copy link
Contributor Author

bleskes commented Apr 18, 2014

I pushed another commit with the log message removed. I adapted the reason (which is logged by the shard failure) to include the information that was missing. I decided in the end not to add a debug logging as there is no logic and hardly any code between here and where we log it. If anyone feels strongly about it, I'll happily add it.

@s1monw
Copy link
Contributor

s1monw commented Apr 18, 2014

LGTM

@bleskes bleskes closed this in 12bbe28 Apr 18, 2014
bleskes added a commit that referenced this pull request Apr 18, 2014
When a replication operation (index/delete/update) fails to be executed properly, we fail the replica and allow master to allocate a new copy of it. At the moment, the node hosting the primary shard is responsible of notifying the master of a failed replica. However, if the replica shard is initializing (`POST_RECOVERY` state), we have a racing condition between the failed shard message and moving the shard into the `STARTED` state. If the latter happen first, master will fail to resolve the fail shard message.

This commit builds on #5800 and fails the engine of the replica shard if a replication operation fails. This protects us against the above as the shard will reject the `STARTED` command from master. It also makes us more resilient to other racing conditions in this area.

Closes #5847
@bleskes
Copy link
Contributor Author

bleskes commented Apr 18, 2014

thx @s1monw @kimchy . pushed.

@bleskes bleskes deleted the enhance/local_fail_shard branch April 18, 2014 17:01
@clintongormley clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v1.2.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants