Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

Open
1 of 3 tasks
sanderegg opened this issue Mar 17, 2021 · 2 comments
Assignees
Labels
a:frontend issue affecting the front-end (area group) a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:webserver issue related to the webserver service

Comments

@sanderegg
Copy link
Member

sanderegg commented Mar 17, 2021

Use-case:

  1. someone is connected to osparc, working with studies
  2. the osparc-platform is re-deployed
  3. the user of the osparc platform should continue to work seamlessly, maybe with a small acceptable glitch but it should definitely not complete with receiving 500 HTTP codes

references:

graylog entries related to failed e2e

the e2e of isolve-mpi failed with the webserver returning a 500 for listing projects. one can see in the logs that the webserver was restarting at that moment.

Docker reference:

  • docker swarm starts a new service (the replacing webserver) and waits until it is healthy. once healthy it closes the replaced service, waits 10seconds and then kills it if it is still around.

  • : restarting webserver works

  • : restarting database

  • : restarting other subsystems

@sanderegg sanderegg added a:frontend issue affecting the front-end (area group) a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:webserver issue related to the webserver service labels Mar 17, 2021
@sanderegg sanderegg changed the title Zero downtime: Connected users should be able to continue working with osparc when the webserver is restarted Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted Mar 17, 2021
@pcrespov
Copy link
Member

pcrespov commented Mar 17, 2021

Related to #2140

  • New version is deployed.
  • The deployer updates the stack
  • The user gets interrupted because the service is down and sometimes gets a gateway error
    • Instead, the old service should not be shut-down until the new stack is in place

Possible cause:

The swarm is already configured to have zero downtime per service (i.e. a given service gets turned off ONLY when the new one is started). The problem might be that even if services are ready, the state between services is not ready. For example, the new webserver is updated correctly but traffik proxy has still not detected it. That would cause a wrong gateway failure on a front-end request

Ideas to solve this problem

  • Incorporate more conditions on the "healtcheck" validation function (e.g. traffik has discovered all backend services)
  • Might be to deploy the entire stack separately first, have a set of rules to validate (e.g. all services healthy, all services connected, traffik routings ready) and then switch.
  • Notify the frontend that a new version was deployed Front-end notifies of a new version available #2128

@sanderegg
Copy link
Member Author

Testing like so:

  • running a lot of sleepers
    -> restarting the webserver seems to be ok
    -> restarting the database restarts some of the computational services, but seems to generate issues loosing some states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:frontend issue affecting the front-end (area group) a:infra+ops maintenance of infrastructure or operations (discussed in retro) a:webserver issue related to the webserver service
Projects
None yet
Development

No branches or pull requests

4 participants