How to do a high-availability rolling update? #4270

wolfchimneyrock · 2024-01-29T15:08:04Z

Description

Registry
Version: 2.5.8
Persistence type: sql

For our in progress high-availability Apicurio deployment, I am trying to understand how we will be able to do rolling upgrades without causing an outage window.

Environment

We are running Apicurio Registry on a clustered set of VM's, each VM connecting to a shared distributed PostgreSQL db.

When upgrades occur, the new software is installed and started on one VM at a time. If the process fails then a roll-back is usually automatically attempted before going to the next VM.

This typically lets us upgrade software without impacting availability during.

With Apicurio Registry, it looks like database schema changes can occur between releases which are automatically applied when the software is started. Also there is a strict requirement that only one database version can work with a given software version.

because of this, when the first VM gets its upgraded software it will upgrade the database, and the other VM's will continue to serve client requests running the old software fetching from the new database. I don't see any indication that this will always work, as sometimes updates remove tables or columns etc...

If there is an issue with the new software we can't roll-back to the prior version since the database has been upgraded and the old software doesn't work with a new database.

Do you have any insight on how we can achieve rolling upgrades?

One Idea is to remove the strict software - db version requirement, instead the software has a minimum db version requirement (even a window of two db versions allowed would help), since you could break db updates into two steps

add new features, deprecate old ones
remove old features

and then have a guarantee that any software version can work with two different database versions.

apicurio-bot · 2024-01-29T15:08:07Z

Thank you for reporting an issue!

Pinging @jsenko to respond or triage.

carlesarnal · 2024-03-13T09:09:16Z

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

wolfchimneyrock · 2024-03-15T14:48:01Z

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.
enable read-only api access
export the blue db to .zip file - if this step fails we can backout
import the blue db into green - we can still backout
route some % of incoming traffic to green as a smoke test - we can still backout
route all traffic to green
re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

carlesarnal · 2024-04-09T11:48:12Z

This is a very interesting one.
Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.
Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.

enable read-only api access

export the blue db to .zip file - if this step fails we can backout

import the blue db into green - we can still backout

route some % of incoming traffic to green as a smoke test - we can still backout

route all traffic to green

re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

Yes, this kind of approach makes a lot of sense and is similar to what I would have expected. The steps I described were obviously aiming at a managed database instance, not a self managed one. I'll transform this to a discussion and select your comment as the answer. Thanks!

wolfchimneyrock added the Bug Something isn't working label Jan 29, 2024

apicurio-bot bot added area/compatibility area/rest-api area/storage labels Jan 29, 2024

Apicurio locked and limited conversation to collaborators Apr 9, 2024

carlesarnal converted this issue into discussion #4539 Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

How to do a high-availability rolling update? #4270

How to do a high-availability rolling update? #4270

wolfchimneyrock commented Jan 29, 2024 •

edited

apicurio-bot bot commented Jan 29, 2024

carlesarnal commented Mar 13, 2024

wolfchimneyrock commented Mar 15, 2024

carlesarnal commented Apr 9, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

How to do a high-availability rolling update? #4270

How to do a high-availability rolling update? #4270

Comments

wolfchimneyrock commented Jan 29, 2024 • edited

Description

Environment

apicurio-bot bot commented Jan 29, 2024

carlesarnal commented Mar 13, 2024

wolfchimneyrock commented Mar 15, 2024

carlesarnal commented Apr 9, 2024

This issue was moved to a discussion.

wolfchimneyrock commented Jan 29, 2024 •

edited