Handling database migrations when the CLI upgrades on self-hosted Observal #965
Apoorvgarg-creator
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Handling database migrations when the CLI upgrades on self-hosted Observal
Why this matters
Observal is self-hosted. That means every operator runs their own Postgres and ClickHouse, on whatever box they chose, with whatever backup story they have (or don't). The CLI, on the other hand, lives on dozens of laptops and CI runners and gets upgraded casually:
pip install -U observal-cli,brew upgrade observal, or a freshcurl | bash.So we have a fan-out problem. One CLI release can hit a hundred different self-hosted servers, each on a different schema version, each owned by a different person with a different appetite for risk. If we get this wrong, the failure modes are not subtle: a CLI command silently writes to a column that no longer exists, an Alembic migration runs at 3am because someone pulled an image, a ClickHouse
ALTERblocks a query that was holding a lock.This discussion is about what we should commit to as a project so that "I just upgraded the CLI" never turns into "and now my database is in a weird state."
The shape of the system today
flowchart LR subgraph User["Operator laptop / CI"] CLI["observal CLI<br/>(pip / brew / curl)"] end subgraph SelfHost["Self-hosted server (Docker)"] API["observal-api<br/>(FastAPI)"] Init["observal-init<br/>(entrypoint.sh)"] PG[("Postgres<br/>Alembic-managed")] CH[("ClickHouse<br/>DDL bootstrap")] Init -->|alembic upgrade head| PG Init -->|CREATE TABLE IF NOT EXISTS| CH API --> PG API --> CH end CLI -->|HTTP / JWT| API CLI -.->|GET /api/v1/config/version<br/>reads min_cli_version| APIA few things to notice:
entrypoint.shis what runsalembic upgrade head. So a CLI upgrade only matters once the operator doesdocker compose pull && docker compose up -d.client.check_version_compatibility()readsmin_cli_versionfrom/api/v1/config/versionand warns if the local CLI is older. We do not warn in the other direction (CLI newer than server). We do not block. We do not say "go run X to upgrade your server."What could actually go wrong
I want to enumerate the failure modes concretely before proposing fixes, because the right answer depends on which ones we care about.
min_cli_version. Good.Scenarios 3 and 4 are the scary ones because they happen on the server side but are triggered by an upgrade workflow that operators feel is harmless. Scenarios 1, 5, 6, 7 are about visibility: the system knows something is wrong but does not tell anyone.
End to end: what I think we should do
I want to lay out a layered defense, not pick one approach. Each layer is cheap on its own and they compound.
Layer 1: Bidirectional version handshake (cheap, big win)
Today the server tells the CLI "I require min_cli_version X." We should add the reverse: the CLI knows what server versions it works against, and warns if the server is older.
sequenceDiagram participant CLI participant API CLI->>API: GET /api/v1/config/version API-->>CLI: { server_version, schema_revision, min_cli_version } CLI->>CLI: Compare against bundled compat matrix alt CLI newer than server supports CLI->>CLI: Print "server is on X, CLI expects >= Y. Run: observal server upgrade" else server newer than CLI supports CLI->>CLI: Print "CLI is on X, server requires >= Y. Run: pip install -U observal-cli" else compatible CLI->>CLI: Proceed silently endThe CLI release artifact already knows its own version. We just need to ship a tiny
compat.jsoninside the wheel and have the server expose its currentschema_revision(the Alembic revision string, easy to read at startup). That gives us:{ "cli_version": "0.9.0", "min_server_version": "0.8.0", "min_schema_revision": "0003_add_insight_meta_cache" }Example warning text we would surface:
Layer 2:
observal server statusandobserval server upgradeOperators should not have to remember where they put their docker compose file. We can add CLI subcommands that introspect and remediate. These are not destructive; they are guided remote ops.
observal server statuswould surface:observal server upgradeis a thin orchestrator. We do not want to do anything magic here; we want to give operators a one-command path with a confirmation gate.flowchart TD A[observal server upgrade] --> B{Server reachable?} B -- no --> Z1[Print SSH instructions, exit] B -- yes --> C[Fetch server_version, schema_revision, pending migrations] C --> D{Pending migrations?} D -- no --> Z2[Print 'already up to date'] D -- yes --> E[Show migration plan + confirmation prompt] E --> F{User confirms?} F -- no --> Z3[Exit] F -- yes --> G[Trigger server-side upgrade endpoint or print exact docker compose commands] G --> H[Poll /api/v1/config/version until schema_revision matches] H --> I[Print success]The honest answer is that for most self-hosted deploys, the CLI cannot SSH into the server and run docker. So the v1 implementation of
observal server upgradeis "print the exact commands and a confirmation, then poll for completion." A v2 could integrate with a server-side endpoint that runs the upgrade in-place (gated behind admin auth).Layer 3: Pre-migration safety net on the server
This is the layer that protects scenarios 3 and 4. Before
alembic upgrade headruns, the entrypoint should:pg_dump --schema-onlyis fast and lets us reconstruct the DDL state on failure.pg_dump -Fcto a path on the data volume. Off by default because it is heavy on big DBs, but flag-controlled.MIGRATION_FAILEDmarker file and refuse to start the API. Operators will notice.Concretely the entrypoint becomes:
Layer 4: Schema revision in every response header
This is the cheapest observability win. Every API response includes:
The CLI can passively notice drift on every call, not just at login. This catches scenario 7 (silent drift): the operator did an upgrade six months ago, the CLI just made an API call, and we can softly remind them on the next major-version mismatch.
Layer 5: Doctor extension
observal doctoralready exists. We should extend it to print a single "Server health" section:This makes scenario 7 (silent drift) visible without the operator having to remember to check.
Layer 6: Documented support window
We should commit publicly to a policy. Suggestion:
This is the part that does not need code, just a paragraph in the docs and consistent enforcement of
min_cli_versionin releases.Putting the layers together
flowchart TB subgraph Detection["Detection layer (cheap, always on)"] L1[Layer 1: Version handshake] L4[Layer 4: Schema header on every response] L5[Layer 5: Doctor check] end subgraph Action["Action layer (operator-driven)"] L2[Layer 2: observal server status / upgrade] end subgraph Safety["Safety layer (server-side, invisible)"] L3[Layer 3: Pre-migration backup + transactional Alembic] end subgraph Policy["Policy layer (docs, not code)"] L6[Layer 6: Support window N to N-1] end Detection --> Action Action --> Safety Safety --> PolicyConcrete example: a user's experience after this lands
Before:
After:
And:
What I am asking for
Three questions for the community:
observal server upgradeever do anything more than print instructions? A server-side admin endpoint that triggers an in-place upgrade is convenient but expands the trust surface.I will update this discussion with a concrete RFC once we have rough consensus on the shape.
Beta Was this translation helpful? Give feedback.
All reactions