Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a timeout to local mapping change check #9575

Closed
wants to merge 2 commits into from

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Feb 5, 2015

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to indices.recovery.internal_action_timeout) and upgrade the task urgency to IMMEDIATE. If the check times out , we fail the recovery.

Note that is PR is against 1.4 .

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long.  This commit adds a time (equal to  `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`
@bleskes bleskes added v1.4.3 v1.5.0 v2.0.0-beta1 >regression resiliency :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. review and removed >regression labels Feb 5, 2015
@s1monw
Copy link
Contributor

s1monw commented Feb 5, 2015

I think the timeout is ok to add but returning as if everything is ok is wrong. IMO we need to fail this recovery altogether until we can run this check successfully.

@bleskes
Copy link
Contributor Author

bleskes commented Feb 5, 2015

@s1monw @kimchy pushed an update

@s1monw
Copy link
Contributor

s1monw commented Feb 6, 2015

LGTM thanks boaz

bleskes added a commit that referenced this pull request Feb 6, 2015
After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long.  This commit adds a timeout (equal to  `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery.

Closes #9575
bleskes added a commit that referenced this pull request Feb 6, 2015
After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery.

Closes #9575
@bleskes bleskes closed this in 2302222 Feb 6, 2015
@bleskes bleskes deleted the recovery_mapping_check branch February 6, 2015 09:09
bleskes added a commit that referenced this pull request Feb 9, 2015
After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery.

Closes #9575
@s1monw s1monw added v1.3.8 and removed review labels Feb 10, 2015
@clintongormley clintongormley changed the title Recovery: add a timeout to local mapping change check Add a timeout to local mapping change check Jun 7, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long.  This commit adds a timeout (equal to  `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery.

Closes elastic#9575
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery.

Closes elastic#9575
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement resiliency v1.3.8 v1.4.3 v1.5.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants