Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

sanderegg · 2021-03-17T12:49:10Z

Use-case:

someone is connected to osparc, working with studies
the osparc-platform is re-deployed
the user of the osparc platform should continue to work seamlessly, maybe with a small acceptable glitch but it should definitely not complete with receiving 500 HTTP codes

references:

the e2e of isolve-mpi failed with the webserver returning a 500 for listing projects. one can see in the logs that the webserver was restarting at that moment.

Docker reference:

docker swarm starts a new service (the replacing webserver) and waits until it is healthy. once healthy it closes the replaced service, waits 10seconds and then kills it if it is still around.
: restarting webserver works
: restarting database
: restarting other subsystems

pcrespov · 2021-03-17T15:32:25Z

Related to #2140

New version is deployed.
The deployer updates the stack
The user gets interrupted because the service is down and sometimes gets a gateway error
- Instead, the old service should not be shut-down until the new stack is in place

Possible cause:

The swarm is already configured to have zero downtime per service (i.e. a given service gets turned off ONLY when the new one is started). The problem might be that even if services are ready, the state between services is not ready. For example, the new webserver is updated correctly but traffik proxy has still not detected it. That would cause a wrong gateway failure on a front-end request

Ideas to solve this problem

Incorporate more conditions on the "healtcheck" validation function (e.g. traffik has discovered all backend services)
Might be to deploy the entire stack separately first, have a set of rules to validate (e.g. all services healthy, all services connected, traffik routings ready) and then switch.
Notify the frontend that a new version was deployed Front-end notifies of a new version available #2128

sanderegg · 2021-05-31T09:05:18Z

Testing like so:

running a lot of sleepers
-> restarting the webserver seems to be ok
-> restarting the database restarts some of the computational services, but seems to generate issues loosing some states.

sanderegg added a:frontend issue affecting the front-end (area group) a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:webserver issue related to the webserver service labels Mar 17, 2021

sanderegg assigned sanderegg, GitHK, pcrespov and odeimaiz Mar 17, 2021

sanderegg changed the title ~~Zero downtime: Connected users should be able to continue working with osparc when the webserver is restarted~~ Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted Mar 17, 2021

This was referenced Mar 17, 2021

zero downtime for upgrades #2140

Closed

Zero time restart #2214

Merged

WIP: errors middleware captures CancelledError in handlers #2215

Closed

sanderegg mentioned this issue Apr 9, 2021

Webserver: Database listening task does not reconnect when postgres restarts #2246

Closed

sanderegg mentioned this issue Oct 5, 2022

M1-12 Maintenance and DevOps ITISFoundation/osparc-issues#675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

sanderegg commented Mar 17, 2021 •

edited

pcrespov commented Mar 17, 2021 •

edited

sanderegg commented May 31, 2021

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

Comments

sanderegg commented Mar 17, 2021 • edited

references:

pcrespov commented Mar 17, 2021 • edited

Possible cause:

Ideas to solve this problem

sanderegg commented May 31, 2021

sanderegg commented Mar 17, 2021 •

edited

pcrespov commented Mar 17, 2021 •

edited