Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup TransportReplicationAction #12395

Closed

Conversation

martijnvg
Copy link
Member

  • Split the actual primary operation from the primary phase into a dedicated AsyncPrimaryAction class (similar to AsyncReplicaAction)
  • Removed threaded_operation option from replication based transport actions.
  • Let the threading be handled by by the transport service and drop forking from new threads from the transport replication action.

@martijnvg martijnvg force-pushed the cleanup_transport_replication_action branch 8 times, most recently from 1a0c33f to 680e9da Compare July 23, 2015 09:12

});
} else {
if (replicaRequest.operationThreaded()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so happy that you remove this part!

@martijnvg
Copy link
Member Author

@brwe I've fixed the test and made sure that the newly added ReplicationRequest#internalShardRouting is always used in the test.

@martijnvg martijnvg force-pushed the cleanup_transport_replication_action branch from 7412f68 to 5a96d4d Compare August 4, 2015 15:08
@martijnvg
Copy link
Member Author

I rebased this PR and moved the retry primary handling entirely to the coordinating node.

Also I included @brwe test from #12574 which passes with this PR. This test simulates a situation where 2 nodes endlessly redirect same index request to each other caused by a number of events described in #12573.

The reason this test is included is because the way primary write requests are redirected has changed in the PR. The coordinating node remains in charge of the entire write operation, even if the primary shard is on a different node than the coordinating node. While before the node that holds the primary shard is always in charge of a write request. In a situation that is described in #12573 this can lead to nodes endlessly redirecting the same write request to each other until the cluster state caught up or something bad happens. With this PR with the #12573 situation, the write request will be retried when a new cluster state comes or the write timeout has been met.

try {
channel.sendResponse(e);
} catch (Throwable t) {
logger.warn("failed to send response for get", t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for write request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes :)

@brwe
Copy link
Contributor

brwe commented Aug 5, 2015

Overall I really like this split of primary operation and primary phase. I think this might make it much easier to understand what is happening. I left some comments and questions around testing and where we should retry though. Hope they are not too confusing...

@martijnvg martijnvg force-pushed the cleanup_transport_replication_action branch from 5a96d4d to dc6c053 Compare August 10, 2015 16:22
@martijnvg martijnvg force-pushed the cleanup_transport_replication_action branch 3 times, most recently from 9c0d06a to 12db140 Compare August 19, 2015 07:19
…cated AsyncPrimaryAction class (similar to AsyncReplicaAction)

Removed threaded_operation option from replication based transport actions.
Let the threading be handled by by the transport service and drop forking from new threads from the transport replication action.
…ccurs but retry on the node holding the primary shard
…ally then use the transport service instead of the threadpool for forking a new thread. This way there is one place where new threads are being spawned.
…uest can get stuck in an endless redirect loop between nodes due to slow cluster state processing.
@martijnvg martijnvg force-pushed the cleanup_transport_replication_action branch from 3a9e901 to 3f64076 Compare August 25, 2015 11:28
@martijnvg
Copy link
Member Author

@bleskes I brought back the chasing of the primary shard to PrimaryPhase inner class and delegate to the primary action if the primary shard is local.

final ShardRouting primary = request.internalShardRouting;
// Although this gets executed locally, this more of an assertion, but if change the primary action
// to be performed remotely this check is important to check before performing the action:
if (clusterService.localNode().id().equals(primary.currentNodeId()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this? if we fail to find the shard in the indicesService, we'll throw an exception anyhow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. it kind of made sense how the pr did work, i left it because imo it would be a nice check. but lets just remove it.

@martijnvg
Copy link
Member Author

The tricky part about the PR is that the cluster state is observed twice, one time in the primary phase and one time in the async primary action. In cases we retry, we might miss the update the to the cluster state. This would only be likely to occur more if the primary action was executed remotely which, is what the plan was after as follow up issue. But this can be dangerous. It is fixable if we remember on what version we decided to execute the primary operation, but that would make this change bigger than the plan is and we should maybe to a break from this change and reconsider it post 2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants