Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop re-trying if session renew fails with 404 #5

Closed
jaapterwoerds opened this issue Mar 18, 2016 · 19 comments
Closed

Stop re-trying if session renew fails with 404 #5

jaapterwoerds opened this issue Mar 18, 2016 · 19 comments

Comments

@jaapterwoerds
Copy link

We have seen some cases in which refreshing fails because the session TTL has expired. The application was running with default settings but somehow the refresh was executed after the TTL on the session expired(Why this can happen is another story altogether). In that particular case you get a 404 when trying to renewing the session. The behaviour now is that it tries anyway to refresh the session while this will never happen because the session has expired.

Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5.

I would suggest to change the behaviour to stop retrying in this particular case.

@juanjovazquez
Copy link

@jaapterwoerds We're currently studying these edge cases. Nevertheless, after the configured number of retries, the system stops, right?. As you mention, a 404 Not Found error is pointing out that the Session is no longer available so that we could stop re-trying in that case.

@jaapterwoerds
Copy link
Author

The system stops that's correct. I would suggest to stop the retry after getting the 404 and log a more specific message.

@juanjovazquez
Copy link

It makes sense. OTOH, we need to check what would be the behaviour of the underlying Akka cluster in that case, especially in the face of a quick restarting of the node.

@raboof
Copy link
Contributor

raboof commented Mar 20, 2016

Indeed retry'ing the renewal on a 404 does not make much sense.

Could we perhaps fail gracefully by registering a new session, though?

@raboof
Copy link
Contributor

raboof commented Mar 20, 2016

@jaapterwoerds are you sure the TTL really did expire? @juanjovazquez mentioned before that something like this can happen when the consul node associated with the session goes down, which sounds plausible (though I haven't been able to reproduce it in isolation, yet)

@raboof
Copy link
Contributor

raboof commented Mar 21, 2016

This seemed to recently successfully recover:

March 18th 2016, 22:15:52.782   DEBUG   31318   Successfully added self, going to RefreshScheduled
March 18th 2016, 22:15:52.782   DEBUG   31318   Transitioning to RefreshScheduled
March 18th 2016, 22:15:52.761   INFO    31318   Created session e8152e05-d0b5-b1bf-96d1-8ba463f41eaf with ttl 1 minute
March 18th 2016, 22:15:52.740   WARN    31318   unhandled event MemberUp(Member(address = akka.tcp://XXX@YYY:31826, status = Up)) in state AddingSelf
March 18th 2016, 22:15:52.739   DEBUG   31318   Transitioning out of Joining
March 18th 2016, 22:15:52.739   DEBUG   31318   Transitioning to AddingSelf
March 18th 2016, 22:15:52.738   DEBUG   31318   Received nodes Vector(akka.tcp://XXX@ZZZ:31796, akka.tcp://XXX@YYY:31826), going to Joining 
March 18th 2016, 22:15:52.738   DEBUG   31318   Transitioning to Joining
March 18th 2016, 22:15:52.738   INFO    31318   Cluster Node [akka.tcp://XXX@WWW:31318] - Trying to join seed nodes [akka.tcp://XXX@ZZZ:31796, akka.tcp://XXX@YYY:31826] when already part of a cluster, ignoring
March 18th 2016, 22:15:52.731   DEBUG   31318   Waited for 3000 milliseconds, going to GettingNodes
March 18th 2016, 22:15:52.731   DEBUG   31318   Transitioning to GettingNodes
March 18th 2016, 22:15:49.718   WARN    31318   Failure in Refreshing, going to Retrying: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/5926655c-cd31-bf5b-7e21-d5d58b6e7940
March 18th 2016, 22:15:49.718   DEBUG   31318   Transitioning from Refreshing to Retrying
March 18th 2016, 22:15:49.717   WARN    31318   Failed to refresh session 5926655c-cd31-bf5b-7e21-d5d58b6e7940: status code 404 Not Found, entity Session id '5926655c-cd31-bf5b-7e21-d5d58b6e7940' not found
March 18th 2016, 22:15:49.710   DEBUG   31318   Waited for 30000 milliseconds, going to Refreshing
March 18th 2016, 22:15:49.710   DEBUG   31318   Transitioning to Refreshing

This was with constructr version 0.9.1-Wehkamp-0.6, which is 0.9.1 with some extra logging: hseeberger/constructr@v0.9.1...wehkamp:v0.9.1-Wehkamp-0.6

@jaapterwoerds
Copy link
Author

@raboof I checked the logs for the service I am talking about. In two cases the refreshing failed with a 404 while the last succesfull refresh was 30 seconds earlier. The service was running with constructr version 0.11.1 and default settings(ttlFactor 2.0):

March 17th 2016, 16:40:49.164 Terminating the system, because constructr-machine has terminated!
March 17th 2016, 16:40:49.164 Number of retries exhausted in Refreshing!
March 17th 2016, 16:40:49.163 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5
March 17th 2016, 16:40:49.156 Waited for 3000 milliseconds, going to Refreshing
March 17th 2016, 16:40:49.156 Transitioning to Refreshing
March 17th 2016, 16:40:46.144 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5
March 17th 2016, 16:40:46.144 Transitioning from Refreshing to RetryScheduled
March 17th 2016, 16:40:46.136 Waited for 3000 milliseconds, going to Refreshing
March 17th 2016, 16:40:46.136 Transitioning to Refreshing
March 17th 2016, 16:40:43.123 Transitioning from Refreshing to RetryScheduled
March 17th 2016, 16:40:43.123 Failure in Refreshing, going to RetryScheduled/Refreshing: de.heikoseeberger.constructr.coordination.Coordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/da36395f-b7a8-198d-07ba-920236d1c4a5
March 17th 2016, 16:40:43.115 Waited for 30000 milliseconds, going to Refreshing
March 17th 2016, 16:40:43.115 Transitioning to Refreshing
March 17th 2016, 16:40:13.104 Transitioning to RefreshScheduled
March 17th 2016, 16:40:13.103 Successfully refreshed, going to RefreshScheduled

I also see some consul logging around that time, but I am not sure if I can relate that with this issue:

March 17th 2016, 16:40:53.000 [ERR] http: Request PUT /v1/session/create, error: rpc error: rpc error: Check 'serfHealth' is in critical state

@jaapterwoerds
Copy link
Author

In any case, it seems we could improve the resiliency of constructr if we create a new session when the session refresh fails with a 404 and there are still retries left.

@juanjovazquez
Copy link

What you propose means virtually coming back to the AddingSelf stage in terms of the cluster machine as the 404 response is pointing out that the key is no longer available in the KV registry. When the session expires, the key disappears. But, if the process comes back to AddingSelf how are we sure that the node is still part of the Akka cluster?. This leads us to dangerous and problematic coordination issues. ;).

What I wonder now is why are you getting the 404 response in the first place?. The session ttl should be higher than the lapse between retries. In that cases, is the key leaving out the registry?.

@jaapterwoerds
Copy link
Author

We are still investigating why exactly we get the 404 response from consul. There is no definitive conclusion yet I am afraid. I suspect it has to do with the serfHealthcheck failures that are logged by consul.

To come back to the original issue, I would suggest to make a change that distinguishes between permanent failures(e.g. the 404 response) and transient failures(which can be retried). This distinction needs to be made in ConstructR itself(I will create an issue for this). The consul coordinator implementation then can signal a permanent failure when it encounters a 404 on session renewal. In this proposal both permanent failures and transient failures which have exhausted the number of retries will result in the termination of the actor system.

The debate on whether to change the behaviour on permanent failures, for instance to go back to state AddingSelf, is then a separate matter.

@juanjovazquez
Copy link

@jaapterwoerds @raboof After checking this in ConstructR project, may I close this issue?. Thanks.

@berardino
Copy link
Contributor

We have also experienced the exact same issue. When I checked the list of active sessions in consul , the incriminated session was no longer there. The TTL is set to 60sec and the refresh interval left to the default value , 30s.
We don't know yet why the session disappeared, but as consequence the node was shut down.
Any further suggestion/idea on what's the root cause might be ? We are still investigating.

03:29:01.806 [system-akka.actor.default-dispatcher-15] INFO  a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO  a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.808 [system-akka.actor.default-dispatcher-15] INFO  a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
03:29:01.809 [system-akka.actor.default-dispatcher-15] INFO  a.c.s.ClusterSingletonManager - ClusterSingletonManager state change [Start -> Younger]
09:16:31.116 [system-akka.actor.default-dispatcher-2] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:34.168 [system-akka.actor.default-dispatcher-3] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.226 [system-akka.actor.default-dispatcher-4] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.co
nstructr.coordination.consul.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 404 Not Found for URI /v1/session/renew/521ecd64-63a3-a45c-7268-9a26df738cb2
09:16:37.231 [system-akka.actor.default-dispatcher-18] ERROR d.h.c.machine.ConstructrMachine - Number of retries exhausted in Refreshing!
09:16:37.234 [system-akka.actor.default-dispatcher-4] ERROR d.h.constructr.akka.Constructr - Terminating the system, because constructr-machine has terminated!
09:16:37.241 [system-akka.actor.default-dispatcher-2] INFO  a.r.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
09:16:37.242 [system-akka.actor.default-dispatcher-2] INFO  a.r.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote tr
ansports.

@berardino
Copy link
Contributor

It might be related to hashicorp/consul#1033 , our service is running in a docker container and by digging into the consul agent logs I have found that the agent at one point was not able to contact the service for the health check. That might have triggered consul to mark the node as down. According to consul documentation https://www.consul.io/docs/internals/sessions.html the session attached to the node is invalidated (Any of the health checks go to the critical state).

@berardino
Copy link
Contributor

@gerson24 I have got this hseeberger/constructr#120 merged in. At least now in the event of 404 the node will gracefully leave the cluster.
Still does not fix this issue but I think it's a little better.

Any suggestion on how to handle the 404 issue ?

@juanjovazquez
Copy link

Since it appears that this issue continues, I'll reopen it to better think about how to handle it properly. @berardino thanks for the feedback.

@juanjovazquez juanjovazquez reopened this Aug 31, 2016
@juanjovazquez
Copy link

@berardino We've discussed this internally and reached the conclusion that ConstructR should not be responsible for fixing a Consul related issue. IMHO ConstructR is already a resilient service as it enables you to set a certain number of retries. After that retries, and having left the Akka cluster (thanks to your PR on ConstructR), the better way to go would be let the node to stop working and let the cluster scheduler, or whatever machinery you have in place, to deal with the issue, for instance restarting the node or alerting about the problem. Does it make sense?.

@berardino
Copy link
Contributor

yeah , I also think that this should be solved within the boundaries of Consul.

I did some changes to ConsulCoordination.refresh so that in the event of a NotFound response instead of keep trying to refresh the session, if the current node is still a member Up of the cluster, it will just invoke addSelf .
This approach seems to be working a little better. But it still fails sometimes due to

Failure in AddingSelf, going to RetryScheduled/AddingSelf: java.util.NoSuchElementException: Future.filter predicate is not satisfied

So basically ConsulCoordination is not even able to create a new session. After adding more logging I have found that the createSession method is failing with :

500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,Check 'serfHealth' is in critical state)

Sometimes after a couples of retries this error goes away (the consul agent serf check reported an OK status) and ConsulCoordination is able to recover by creating a new session.

So what I think is going on here is that the consul agent randomly reports a false positive serf check. Consul server believes that the service is down therefore invalidates the sessions associated with the node.

So if you agree I can open a PR with the changes I have implemented that make the ConsulCoordination a little more resilient to faulty consul behaviors.
As for the faulty serf check, I think that might be related to the service running in docker (at least in our case). I will keep investigating to see whether that can be solved too.

berardino pushed a commit to berardino/constructr-consul that referenced this issue Sep 3, 2016
The consul agent can report a false negative serfHealth status.
When that happens the consul server removes the faulty member.
This has two consequences :
1 - all the sessions associated with the node are deleted
2 - the creation of a new session might still fail with
500,Internal Server Error, Check 'serfHealth' is in critical state

This commit is an attempt to be more resilient in this cases and keep
trying to create a new session. If it succeded the system will
keep working as if nothing happened, otherwise after the max num of retry
the constructr machine will terminate the akka system.

The default constructr machines configuration don't give consul enough
time to recover under these circumstances. I suggest to change
the configuration for production env like to something like :

  coordination-timeout = 10 seconds
  nr-of-retries        = 10
  refresh-interval     = 60 seconds
  retry-delay          = 10 seconds
  ttl-factor           = 5.0
berardino pushed a commit to berardino/constructr-consul that referenced this issue Sep 3, 2016
The consul agent can report a false negative serfHealth status.
When that happens the consul server removes the faulty member.
This has two consequences :
1 - all the sessions associated with the node are deleted
2 - the creation of a new session might still fail with
500,Internal Server Error, Check 'serfHealth' is in critical state

This commit is an attempt to be more resilient in this cases and keep
trying to create a new session. If it succeded the system will
keep working as if nothing happened, otherwise after the max num of retry
the constructr machine will terminate the akka system.

The default constructr machines configuration don't give consul enough
time to recover under these circumstances. I suggest to change
the configuration for production env like to something like :

  coordination-timeout = 10 seconds
  nr-of-retries        = 10
  refresh-interval     = 60 seconds
  retry-delay          = 10 seconds
  ttl-factor           = 5.0
@juanjovazquez
Copy link

@berardino, thanks for your PR and comments. However, I don't agree with your proposal of coming back to the AddingSelfstage. As I pointed out previously, if the machine comes back to AddingSelf how can we be sure that the node is still a member of the Akka cluster?. Take into account that the joining phase is previous to the AddingSelf phase. IMHO, the extra level of resiliency that you're looking for should be better reached by means of the cluster scheduler.

In any case, your PR changes the basic rules about the possible thansitions in ConstructR, so that I think you might discuss it with @hseeberger in ConstructR first. This project is about to give an alternative to etcd for coordination and it shouldn't change the basic rules set by ConstructR itself.

@berardino
Copy link
Contributor

berardino commented Sep 11, 2016

@juanjovazquez well to be precise ConstructR does not go back to the AddingSelf state. In fact is just the consul refresh method that uses the addSelf method to create a new session if unable to refresh the existing. If you want to decouple the two concept you might want to consider
extracting the addSelf current implementation to a createIfNotExist method and having :

override def addSelf[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
   createIfNotExist(self, ttl)
}

def def createIfNotExist[A: NodeSerialization](self: A, ttl: FiniteDuration) = {
 // actual addSelf implementation
}

then I can change my PR https://github.com/Tecsisa/constructr-consul/pull/27/files#diff-cfcdf993fe3f04a5f708a6d43c4ce068R160 to invoke createIfNotExist instead of addSelf.
This makes clear that we are not adding self, but just trying to be resilient and create a session
if unable to refresh the existing.

We have already deployed the changes and the system has been running just fine. As you can see from the attached logs, this issue can happen very often. And I'd rather try to recover this than letting the system going down and dealing with it at a cluster management level.

We are still investigating on why this keeps happening (it might be related to docker and the serfCheck reporting false critical state). As well as we are considering implementing an alternative solution that uses consul discovery service to bootstrap the cluster instead of having ConstructR that can potentially kill our services at any time.

But for the time being I think my PR is a good enough solution that mitigates the issue, until we find a better one.

00:08:48.229 [service-akka.actor.default-dispatcher-16] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
02:09:03.526 [service-akka.actor.default-dispatcher-2] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
10:07:34.102 [service-akka.actor.default-dispatcher-18] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/c27f8702-804b-e573-4881-fdc2790d2e90
10:08:44.215 [service-akka.actor.default-dispatcher-4] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session c27f8702-804b-e573-4881-fdc2790d2e90 not found. Creating a new one
12:08:49.470 [service-akka.actor.default-dispatcher-6] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
12:08:49.532 [service-akka.actor.default-dispatcher-4] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,Check 'serfHealth' is in critical state)
12:08:49.533 [service-akka.actor.default-dispatcher-4] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
12:08:59.569 [service-akka.actor.default-dispatcher-18] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
12:08:59.636 [service-akka.actor.default-dispatcher-2] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
12:09:09.679 [service-akka.actor.default-dispatcher-17] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 3eee3ea2-56fb-98c0-4407-3ea8abe5df8c not found. Creating a new one
15:42:20.588 [service-akka.actor.default-dispatcher-4] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
15:42:20.728 [service-akka.actor.default-dispatcher-6] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,rpc error: Check 'serfHealth' is in critical state)
15:42:20.728 [service-akka.actor.default-dispatcher-6] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
15:42:30.779 [service-akka.actor.default-dispatcher-5] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
15:42:30.821 [service-akka.actor.default-dispatcher-5] ERROR akka.actor.ActorSystemImpl - Unable to create a new session 500 Internal Server Error HttpEntity.Strict(text/plain; charset=UTF-8,rpc error: Check 'serfHealth' is in critical state)
15:42:30.822 [service-akka.actor.default-dispatcher-5] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: java.util.NoSuchElementException: Future.filter predicate is not satisfied
15:42:40.857 [service-akka.actor.default-dispatcher-4] WARN  akka.actor.ActorSystemImpl - Unable to refresh, session 541cf141-a33a-a810-44e3-22d882566642 not found. Creating a new one
16:08:42.956 [service-akka.actor.default-dispatcher-2] WARN  d.h.c.machine.ConstructrMachine - Failure in Refreshing, going to RetryScheduled/Refreshing: com.tecsisa.constructr.coordination.ConsulCoordination$UnexpectedStatusCode: Unexpected status code 500 Internal Server Error for URI /v1/session/renew/745450d2-b270-44f6-366c-1e1cd66d2079

juanjovazquez pushed a commit that referenced this issue Sep 17, 2016
Create a new session on 404 when refreshing (closes #5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants