Create a new session on 404 when refreshing (closes #5) #27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@juanjovazquez
This is the changes that I have implemented so far (including also the configuration changes).
And this setup seems to holding up much better. I have had a three nodes cluster running on AWS
for a day without experiencing any downtime.
The consul agent can report a false negative serfHealth status.
When that happens the consul server removes the faulty member.
This has two consequences :
1 - all the sessions associated with the node are deleted
2 - the creation of a new session might still fail with
500,Internal Server Error, Check 'serfHealth' is in critical state
This commit is an attempt to be more resilient in this cases and keep
trying to create a new session. If it succeded the system will
keep working as if nothing happened, otherwise after the max num of retry
the constructr machine will terminate the akka system.
The default constructr machines configuration don't give consul enough
time to recover under these circumstances. I suggest to change
the configuration for production env like to something like :
coordination-timeout = 10 seconds
nr-of-retries = 10
refresh-interval = 60 seconds
retry-delay = 10 seconds
ttl-factor = 5.0