-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop re-trying if session renew fails with 404 #5
Comments
@jaapterwoerds We're currently studying these edge cases. Nevertheless, after the configured number of retries, the system stops, right?. As you mention, a 404 Not Found error is pointing out that the |
The system stops that's correct. I would suggest to stop the retry after getting the 404 and log a more specific message. |
It makes sense. OTOH, we need to check what would be the behaviour of the underlying Akka cluster in that case, especially in the face of a quick restarting of the node. |
Indeed retry'ing the renewal on a 404 does not make much sense. Could we perhaps fail gracefully by registering a new session, though? |
@jaapterwoerds are you sure the TTL really did expire? @juanjovazquez mentioned before that something like this can happen when the consul node associated with the session goes down, which sounds plausible (though I haven't been able to reproduce it in isolation, yet) |
This seemed to recently successfully recover:
This was with constructr version 0.9.1-Wehkamp-0.6, which is 0.9.1 with some extra logging: hseeberger/constructr@v0.9.1...wehkamp:v0.9.1-Wehkamp-0.6 |
@raboof I checked the logs for the service I am talking about. In two cases the refreshing failed with a 404 while the last succesfull refresh was 30 seconds earlier. The service was running with constructr version 0.11.1 and default settings(ttlFactor 2.0): March 17th 2016, 16:40:49.164 Terminating the system, because constructr-machine has terminated! I also see some consul logging around that time, but I am not sure if I can relate that with this issue: March 17th 2016, 16:40:53.000 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state |
In any case, it seems we could improve the resiliency of constructr if we create a new session when the session refresh fails with a 404 and there are still retries left. |
What you propose means virtually coming back to the What I wonder now is why are you getting the 404 response in the first place?. The session ttl should be higher than the lapse between retries. In that cases, is the key leaving out the registry?. |
We are still investigating why exactly we get the 404 response from consul. There is no definitive conclusion yet I am afraid. I suspect it has to do with the serfHealthcheck failures that are logged by consul. To come back to the original issue, I would suggest to make a change that distinguishes between permanent failures(e.g. the 404 response) and transient failures(which can be retried). This distinction needs to be made in ConstructR itself(I will create an issue for this). The consul coordinator implementation then can signal a permanent failure when it encounters a 404 on session renewal. In this proposal both permanent failures and transient failures which have exhausted the number of retries will result in the termination of the actor system. The debate on whether to change the behaviour on permanent failures, for instance to go back to state AddingSelf, is then a separate matter. |
@jaapterwoerds @raboof After checking this in ConstructR project, may I close this issue?. Thanks. |
We have also experienced the exact same issue. When I checked the list of active sessions in consul , the incriminated session was no longer there. The TTL is set to 60sec and the refresh interval left to the default value , 30s. 03:29:01.806 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.809 [system-akka.actor.default-dispatcher-15] INFO a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
09:16:31.116 [system-akka.actor.default-dispatcher-2] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:34.168 [system-akka.actor.default-dispatcher-3] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.226 [system-akka.actor.default-dispatcher-4] WARN d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.231 [system-akka.actor.default-dispatcher-18] ERROR d.h.c.machine.ConstructrMachine - Number of retries exhausted in Refreshing!
09:16:37.234 [system-akka.actor.default-dispatcher-4] ERROR d.h.constructr.akka.Constructr - Terminating the system, because constructr-machine has terminated!
09:16:37.241 [system-akka.actor.default-dispatcher-2] INFO a.r.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
09:16:37.242 [system-akka.actor.default-dispatcher-2] INFO a.r.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote tr
ansports. |
It might be related to hashicorp/consul#1033 , our service is running in a docker container and by digging into the consul agent logs I have found that the agent at one point was not able to contact the service for the health check. That might have triggered consul to mark the node as down. According to consul documentation https://www.consul.io/docs/internals/sessions.html the session attached to the node is invalidated (Any of the health checks go to the critical state). |
@gerson24 I have got this hseeberger/constructr#120 merged in. At least now in the event of 404 the node will gracefully leave the cluster. Any suggestion on how to handle the 404 issue ? |
Since it appears that this issue continues, I'll reopen it to better think about how to handle it properly. @berardino thanks for the feedback. |
@berardino We've discussed this internally and reached the conclusion that ConstructR should not be responsible for fixing a Consul related issue. IMHO ConstructR is already a resilient service as it enables you to set a certain number of retries. After that retries, and having left the Akka cluster (thanks to your PR on ConstructR), the better way to go would be let the node to stop working and let the cluster scheduler, or whatever machinery you have in place, to deal with the issue, for instance restarting the node or alerting about the problem. Does it make sense?. |
yeah , I also think that this should be solved within the boundaries of Consul. I did some changes to ConsulCoordination.refresh so that in the event of a NotFound response instead of keep trying to refresh the session, if the current node is still a member Up of the cluster, it will just invoke addSelf . Failure in AddingSelf, going to RetryScheduled/AddingSelf: java.util.NoSuchElementException: Future.filter predicate is not satisfied So basically ConsulCoordination is not even able to create a new session. After adding more logging I have found that the createSession method is failing with :
Sometimes after a couples of retries this error goes away (the consul agent serf check reported an OK status) and ConsulCoordination is able to recover by creating a new session. So what I think is going on here is that the consul agent randomly reports a false positive serf check. Consul server believes that the service is down therefore invalidates the sessions associated with the node. So if you agree I can open a PR with the changes I have implemented that make the ConsulCoordination a little more resilient to faulty consul behaviors. |
The consul agent can report a false negative serfHealth status. When that happens the consul server removes the faulty member. This has two consequences : 1 - all the sessions associated with the node are deleted 2 - the creation of a new session might still fail with 500,Internal Server Error, Check 'serfHealth' is in critical state This commit is an attempt to be more resilient in this cases and keep trying to create a new session. If it succeded the system will keep working as if nothing happened, otherwise after the max num of retry the constructr machine will terminate the akka system. The default constructr machines configuration don't give consul enough time to recover under these circumstances. I suggest to change the configuration for production env like to something like : coordination-timeout = 10 seconds nr-of-retries = 10 refresh-interval = 60 seconds retry-delay = 10 seconds ttl-factor = 5.0
The consul agent can report a false negative serfHealth status. When that happens the consul server removes the faulty member. This has two consequences : 1 - all the sessions associated with the node are deleted 2 - the creation of a new session might still fail with 500,Internal Server Error, Check 'serfHealth' is in critical state This commit is an attempt to be more resilient in this cases and keep trying to create a new session. If it succeded the system will keep working as if nothing happened, otherwise after the max num of retry the constructr machine will terminate the akka system. The default constructr machines configuration don't give consul enough time to recover under these circumstances. I suggest to change the configuration for production env like to something like : coordination-timeout = 10 seconds nr-of-retries = 10 refresh-interval = 60 seconds retry-delay = 10 seconds ttl-factor = 5.0
@berardino, thanks for your PR and comments. However, I don't agree with your proposal of coming back to the In any case, your PR changes the basic rules about the possible thansitions in ConstructR, so that I think you might discuss it with @hseeberger in ConstructR first. This project is about to give an alternative to etcd for coordination and it shouldn't change the basic rules set by ConstructR itself. |
@juanjovazquez well to be precise ConstructR does not go back to the AddingSelf state. In fact is just the consul refresh method that uses the addSelf method to create a new session if unable to refresh the existing. If you want to decouple the two concept you might want to consider override def addSelf[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
createIfNotExist(self, ttl)
}
def def createIfNotExist[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
// actual addSelf implementation
} then I can change my PR https://github.com/Tecsisa/constructr-consul/pull/27/files#diff-cfcdf993fe3f04a5f708a6d43c4ce068R160 to invoke createIfNotExist instead of addSelf. We have already deployed the changes and the system has been running just fine. As you can see from the attached logs, this issue can happen very often. And I'd rather try to recover this than letting the system going down and dealing with it at a cluster management level. We are still investigating on why this keeps happening (it might be related to docker and the serfCheck reporting false critical state). As well as we are considering implementing an alternative solution that uses consul discovery service to bootstrap the cluster instead of having ConstructR that can potentially kill our services at any time. But for the time being I think my PR is a good enough solution that mitigates the issue, until we find a better one.
|
Create a new session on 404 when refreshing (closes #5)
We have seen some cases in which refreshing fails because the session TTL has expired. The application was running with default settings but somehow the refresh was executed after the TTL on the session expired(Why this can happen is another story altogether). In that particular case you get a 404 when trying to renewing the session. The behaviour now is that it tries anyway to refresh the session while this will never happen because the session has expired.
Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5.
I would suggest to change the behaviour to stop retrying in this particular case.
The text was updated successfully, but these errors were encountered: