Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do a high-availability rolling update? #4270

Closed
wolfchimneyrock opened this issue Jan 29, 2024 · 4 comments
Closed

How to do a high-availability rolling update? #4270

wolfchimneyrock opened this issue Jan 29, 2024 · 4 comments

Comments

@wolfchimneyrock
Copy link
Contributor

wolfchimneyrock commented Jan 29, 2024

Description

Registry
Version
: 2.5.8
Persistence type: sql

For our in progress high-availability Apicurio deployment, I am trying to understand how we will be able to do rolling upgrades without causing an outage window.

Environment

We are running Apicurio Registry on a clustered set of VM's, each VM connecting to a shared distributed PostgreSQL db.

When upgrades occur, the new software is installed and started on one VM at a time. If the process fails then a roll-back is usually automatically attempted before going to the next VM.

This typically lets us upgrade software without impacting availability during.

With Apicurio Registry, it looks like database schema changes can occur between releases which are automatically applied when the software is started. Also there is a strict requirement that only one database version can work with a given software version.

because of this, when the first VM gets its upgraded software it will upgrade the database, and the other VM's will continue to serve client requests running the old software fetching from the new database. I don't see any indication that this will always work, as sometimes updates remove tables or columns etc...

If there is an issue with the new software we can't roll-back to the prior version since the database has been upgraded and the old software doesn't work with a new database.

Do you have any insight on how we can achieve rolling upgrades?

One Idea is to remove the strict software - db version requirement, instead the software has a minimum db version requirement (even a window of two db versions allowed would help), since you could break db updates into two steps

  1. add new features, deprecate old ones
  2. remove old features

and then have a guarantee that any software version can work with two different database versions.

@wolfchimneyrock wolfchimneyrock added the Bug Something isn't working label Jan 29, 2024
@apicurio-bot
Copy link

apicurio-bot bot commented Jan 29, 2024

Thank you for reporting an issue!

Pinging @jsenko to respond or triage.

@carlesarnal
Copy link
Member

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

@wolfchimneyrock
Copy link
Contributor Author

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

  1. while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.
  2. enable read-only api access
  3. export the blue db to .zip file - if this step fails we can backout
  4. import the blue db into green - we can still backout
  5. route some % of incoming traffic to green as a smoke test - we can still backout
  6. route all traffic to green
  7. re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

@carlesarnal
Copy link
Member

This is a very interesting one.
Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.
Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

  1. while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.
  2. enable read-only api access
  3. export the blue db to .zip file - if this step fails we can backout
  4. import the blue db into green - we can still backout
  5. route some % of incoming traffic to green as a smoke test - we can still backout
  6. route all traffic to green
  7. re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

Yes, this kind of approach makes a lot of sense and is similar to what I would have expected. The steps I described were obviously aiming at a managed database instance, not a self managed one. I'll transform this to a discussion and select your comment as the answer. Thanks!

@Apicurio Apicurio locked and limited conversation to collaborators Apr 9, 2024
@carlesarnal carlesarnal converted this issue into discussion #4539 Apr 9, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants