Handling database migrations when the CLI upgrades on self-hosted Observal #965

Apoorvgarg-creator · 2026-05-13T17:00:23Z

Apoorvgarg-creator
May 13, 2026
Maintainer

Handling database migrations when the CLI upgrades on self-hosted Observal

Why this matters

Observal is self-hosted. That means every operator runs their own Postgres and ClickHouse, on whatever box they chose, with whatever backup story they have (or don't). The CLI, on the other hand, lives on dozens of laptops and CI runners and gets upgraded casually: pip install -U observal-cli, brew upgrade observal, or a fresh curl | bash.

So we have a fan-out problem. One CLI release can hit a hundred different self-hosted servers, each on a different schema version, each owned by a different person with a different appetite for risk. If we get this wrong, the failure modes are not subtle: a CLI command silently writes to a column that no longer exists, an Alembic migration runs at 3am because someone pulled an image, a ClickHouse ALTER blocks a query that was holding a lock.

This discussion is about what we should commit to as a project so that "I just upgraded the CLI" never turns into "and now my database is in a weird state."

The shape of the system today

flowchart LR
    subgraph User["Operator laptop / CI"]
        CLI["observal CLI<br/>(pip / brew / curl)"]
    end

    subgraph SelfHost["Self-hosted server (Docker)"]
        API["observal-api<br/>(FastAPI)"]
        Init["observal-init<br/>(entrypoint.sh)"]
        PG[("Postgres<br/>Alembic-managed")]
        CH[("ClickHouse<br/>DDL bootstrap")]
        Init -->|alembic upgrade head| PG
        Init -->|CREATE TABLE IF NOT EXISTS| CH
        API --> PG
        API --> CH
    end

    CLI -->|HTTP / JWT| API
    CLI -.->|GET /api/v1/config/version<br/>reads min_cli_version| API

A few things to notice:

The CLI never touches the database directly. It only speaks HTTP to the API. So strictly speaking, "the CLI migrates the database" is the wrong framing. The server migrates the database. The CLI's job is to make sure the server it is talking to is on a schema it can speak to.
Migrations run at container start, not on demand. entrypoint.sh is what runs alembic upgrade head. So a CLI upgrade only matters once the operator does docker compose pull && docker compose up -d.
We already have half of a handshake. client.check_version_compatibility() reads min_cli_version from /api/v1/config/version and warns if the local CLI is older. We do not warn in the other direction (CLI newer than server). We do not block. We do not say "go run X to upgrade your server."

What could actually go wrong

I want to enumerate the failure modes concretely before proposing fixes, because the right answer depends on which ones we care about.

#	Scenario	What breaks today	Severity
1	User upgrades CLI, forgets to upgrade server	CLI calls newer endpoint, gets 404, cryptic error	Medium
2	User upgrades server, forgets to upgrade CLI	Server warns via `min_cli_version`. Good.	Low
3	User upgrades server, Alembic migration fails mid-run	DB left in partially-migrated state, no automatic rollback	High
4	User upgrades server with no backup, schema change is destructive	Data loss, no recovery	Critical
5	User runs CLI command that assumes a column the server is missing	Server returns 500, no helpful guidance	Medium
6	User upgrades CLI but stays on a server that is two majors behind	Subtle behavior drift, hard to diagnose	Medium
7	User has no idea their server is out of date	Silent drift, accumulates over months	Medium

Scenarios 3 and 4 are the scary ones because they happen on the server side but are triggered by an upgrade workflow that operators feel is harmless. Scenarios 1, 5, 6, 7 are about visibility: the system knows something is wrong but does not tell anyone.

End to end: what I think we should do

I want to lay out a layered defense, not pick one approach. Each layer is cheap on its own and they compound.

Layer 1: Bidirectional version handshake (cheap, big win)

Today the server tells the CLI "I require min_cli_version X." We should add the reverse: the CLI knows what server versions it works against, and warns if the server is older.

sequenceDiagram
    participant CLI
    participant API
    CLI->>API: GET /api/v1/config/version
    API-->>CLI: { server_version, schema_revision, min_cli_version }
    CLI->>CLI: Compare against bundled compat matrix
    alt CLI newer than server supports
        CLI->>CLI: Print "server is on X, CLI expects >= Y. Run: observal server upgrade"
    else server newer than CLI supports
        CLI->>CLI: Print "CLI is on X, server requires >= Y. Run: pip install -U observal-cli"
    else compatible
        CLI->>CLI: Proceed silently
    end

The CLI release artifact already knows its own version. We just need to ship a tiny compat.json inside the wheel and have the server expose its current schema_revision (the Alembic revision string, easy to read at startup). That gives us:

{
  "cli_version": "0.9.0",
  "min_server_version": "0.8.0",
  "min_schema_revision": "0003_add_insight_meta_cache"
}

Example warning text we would surface:

WARN: Your server is on schema 0002_models_by_ide.
      This CLI (0.9.0) expects at least 0003_add_insight_meta_cache.
      Some commands (observal scan, observal pull) may fail.

      To upgrade your server:
        ssh user@your-server
        cd /opt/observal
        docker compose pull && docker compose up -d

Layer 2: `observal server status` and `observal server upgrade`

Operators should not have to remember where they put their docker compose file. We can add CLI subcommands that introspect and remediate. These are not destructive; they are guided remote ops.

observal server status would surface:

$ observal server status
Server URL:        https://observal.example.com
Server version:    0.8.2
Schema revision:   0003_add_insight_meta_cache  (alembic head: 0004_add_avatar_url)
ClickHouse:        reachable, schema version unknown (DDL bootstrap)
Pending migrations: 1
   0004_add_avatar_url   add avatar_url column to users

Suggested action: docker compose pull && docker compose up -d

observal server upgrade is a thin orchestrator. We do not want to do anything magic here; we want to give operators a one-command path with a confirmation gate.

flowchart TD
    A[observal server upgrade] --> B{Server reachable?}
    B -- no --> Z1[Print SSH instructions, exit]
    B -- yes --> C[Fetch server_version, schema_revision, pending migrations]
    C --> D{Pending migrations?}
    D -- no --> Z2[Print 'already up to date']
    D -- yes --> E[Show migration plan + confirmation prompt]
    E --> F{User confirms?}
    F -- no --> Z3[Exit]
    F -- yes --> G[Trigger server-side upgrade endpoint or print exact docker compose commands]
    G --> H[Poll /api/v1/config/version until schema_revision matches]
    H --> I[Print success]

The honest answer is that for most self-hosted deploys, the CLI cannot SSH into the server and run docker. So the v1 implementation of observal server upgrade is "print the exact commands and a confirmation, then poll for completion." A v2 could integrate with a server-side endpoint that runs the upgrade in-place (gated behind admin auth).

Layer 3: Pre-migration safety net on the server

This is the layer that protects scenarios 3 and 4. Before alembic upgrade head runs, the entrypoint should:

Dump a schema-only backup. pg_dump --schema-only is fast and lets us reconstruct the DDL state on failure.
Take a full snapshot if the operator opts in. pg_dump -Fc to a path on the data volume. Off by default because it is heavy on big DBs, but flag-controlled.
Run migrations inside an explicit transaction with statement_timeout so a runaway migration does not lock the table for an hour.
On failure, do not exit silently. Write a MIGRATION_FAILED marker file and refuse to start the API. Operators will notice.

Concretely the entrypoint becomes:

# Pseudocode for entrypoint.sh
if needs_migration; then
  pg_dump --schema-only > /data/backups/pre-$REVISION.sql
  if [ "${OBSERVAL_BACKUP_BEFORE_MIGRATE:-0}" = "1" ]; then
    pg_dump -Fc > /data/backups/pre-$REVISION.dump
  fi
  if ! alembic upgrade head; then
    touch /data/MIGRATION_FAILED
    echo "Migration failed. See /data/backups/pre-$REVISION.sql for pre-migration schema."
    exit 1
  fi
fi

Layer 4: Schema revision in every response header

This is the cheapest observability win. Every API response includes:

X-Observal-Server-Version: 0.8.2
X-Observal-Schema-Revision: 0003_add_insight_meta_cache

The CLI can passively notice drift on every call, not just at login. This catches scenario 7 (silent drift): the operator did an upgrade six months ago, the CLI just made an API call, and we can softly remind them on the next major-version mismatch.

Layer 5: Doctor extension

observal doctor already exists. We should extend it to print a single "Server health" section:

$ observal doctor
...
Server health
  URL:              https://observal.example.com    ok
  Version:          0.8.2                            ok
  Schema:           0003_add_insight_meta_cache      WARN (1 migration pending)
  ClickHouse:       reachable                        ok
  Postgres:         reachable                        ok
  Last upgrade:     38 days ago                      info
...

This makes scenario 7 (silent drift) visible without the operator having to remember to check.

Layer 6: Documented support window

We should commit publicly to a policy. Suggestion:

The CLI supports the current minor server release and the previous one. Server N requires CLI >= N minus 1. If you skip more than one minor, run observal server upgrade and re-run.

This is the part that does not need code, just a paragraph in the docs and consistent enforcement of min_cli_version in releases.

Putting the layers together

flowchart TB
    subgraph Detection["Detection layer (cheap, always on)"]
        L1[Layer 1: Version handshake]
        L4[Layer 4: Schema header on every response]
        L5[Layer 5: Doctor check]
    end

    subgraph Action["Action layer (operator-driven)"]
        L2[Layer 2: observal server status / upgrade]
    end

    subgraph Safety["Safety layer (server-side, invisible)"]
        L3[Layer 3: Pre-migration backup + transactional Alembic]
    end

    subgraph Policy["Policy layer (docs, not code)"]
        L6[Layer 6: Support window N to N-1]
    end

    Detection --> Action
    Action --> Safety
    Safety --> Policy

Concrete example: a user's experience after this lands

Before:

$ observal scan
Error: 500 Internal Server Error

After:

$ observal scan
WARN: Your server is on schema 0002_models_by_ide.
      This CLI (0.9.0) expects at least 0003_add_insight_meta_cache.
      Some commands may fail.

To bring your server up to date:
  observal server upgrade

Proceeding anyway. If this command fails, please run the upgrade above.

And:

$ observal server upgrade
Server URL:        https://observal.example.com
Server version:    0.8.0 -> 0.9.0 (latest)
Pending migrations:
  0003_add_insight_meta_cache   add insight_meta_cache table
  0004_add_avatar_url            add avatar_url column to users

A schema-only backup will be taken to /data/backups/pre-0004_add_avatar_url.sql
on the server before migrations run.

Continue? [y/N]

What I am asking for

Three questions for the community:

Is layer 3 (pre-migration backup) something we should turn on by default, or behind a flag? Default-on is safer but adds latency to every container restart.
Should observal server upgrade ever do anything more than print instructions? A server-side admin endpoint that triggers an in-place upgrade is convenient but expands the trust surface.
What is the right support window? N to N-1 is strict but predictable. N to N-2 is friendlier but means more compat code.

I will update this discussion with a concrete RFC once we have rough consensus on the shape.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling database migrations when the CLI upgrades on self-hosted Observal #965

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Handling database migrations when the CLI upgrades on self-hosted Observal #965

Uh oh!

Apoorvgarg-creator May 13, 2026 Maintainer

Handling database migrations when the CLI upgrades on self-hosted Observal

Why this matters

The shape of the system today

What could actually go wrong

End to end: what I think we should do

Layer 1: Bidirectional version handshake (cheap, big win)

Layer 2: observal server status and observal server upgrade

Layer 3: Pre-migration safety net on the server

Layer 4: Schema revision in every response header

Layer 5: Doctor extension

Layer 6: Documented support window

Putting the layers together

Concrete example: a user's experience after this lands

What I am asking for

Replies: 0 comments

Apoorvgarg-creator
May 13, 2026
Maintainer

Layer 2: `observal server status` and `observal server upgrade`