Making /api/v1 a stable contract: deprecation lifecycle, schema snapshots, and a support window #970
Apoorvgarg-creator
started this conversation in
Ideas
Replies: 2 comments 1 reply
-
|
Agree with all, maybe a little less on some fronts? Too much backwards compat can mess things up. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
LGTM, agree with most of them |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Making /api/v1 a stable contract: deprecation lifecycle, schema snapshots, and a support window
This discussion proposes an API stability policy for Observal. It defines what counts as a breaking change, what contributors would be asked to do before opening a PR that touches an API surface, what reviewers would be asked to check before approving one, and a phased engineering plan to back the discipline up with tooling.
Nothing in this document is final. Where this document takes a position, it is a starting position. If you disagree with any of the rules, the lifecycle, the support window, or the phasing, please push back in the comments. The goal is to land a policy the community has shaped, not one handed down.
The same rules would apply to:
/api/v1/.../api/v1/graphql/api/v1/config/versionhandshake endpoint (treated as load-bearing)A companion document, discussion #965, covers schema migrations and CLI/server upgrade safety. The two policies are siblings.
Why this matters
Observal is self-hosted. One CLI release fans out across many operators, on many server versions, with no central control. The cost of a silent breaking change is therefore disproportionately high: a single removed field can break many CLIs in the field, and there is no way to push a hotfix to them.
The proposal rests on three premises. These are starting positions; if any of them feel wrong, that is exactly the kind of feedback worth raising in the comments.
/api/v1, every removal or rename is treated as breaking. Additive changes are safe. Anything else is not.What counts as a breaking change
If you are unsure whether your change is breaking, assume it is and read this section.
These are breaking:
inttostring,nullabletonot-null)strtoLiteral["a", "b"])200to201,404to200)These are additive (safe):
These are gray-zone (require explicit reviewer sign-off):
What this would ask of contributors
Before opening a PR that touches any API surface listed above, the proposal would ask you to:
Run the breaking-change checklist in the section above. If your change is breaking, you must follow the deprecation lifecycle, not remove directly.
For a removal or rename: keep the old field in place, mark it deprecated, schedule removal for at least one minor release later. The PR that deprecates and the PR that removes must be separate, in separate releases.
Bump
MIN_CLI_VERSIONinobserval-server/config.pyif your change removes or renames a field the CLI consumes. This is the floor below which the server refuses old CLIs.Update the OpenAPI snapshot (once Phase 3 ships). The CI diff is your contract-change receipt.
Add a CHANGELOG entry under the
API contractsubsection. Format:[deprecated]or[removed]or[added], then the endpoint and field, then the removal target version. Example:Cross-link discussion #965 if your change also touches a database schema. Schema and API changes often move together; they need to be reviewed together.
What this would ask of reviewers
When reviewing a PR that touches an API surface, the proposal would ask you to explicitly check (and say so in the review):
agent_idtoagentId), type changes from auto-inferred Pydantic models, GraphQL field removals that look like file deletions.MIN_CLI_VERSIONis bumped if a CLI-consumed field is being removed in this PR.@deprecateddecorator (REST) ordeprecation_reason(GraphQL) is wired so headers actually emit. A deprecation without the header is a deprecation in name only.A PR that touches an API surface must have a review comment that says, in some form, "I checked the breaking-change list and CHANGELOG entry, this is [additive / deprecation / removal-of-previously-deprecated]." If that sentence is missing, the review is incomplete.
Proposed plan
Four phases. Each phase would ship independently. The phasing is a suggestion; alternative orderings are welcome in the comments.
flowchart TB P1[Phase 1: Process + AGENTS.md rules] --> P2[Phase 2: Deprecation lifecycle in code] P2 --> P3[Phase 3: CI enforcement] P3 --> P4[Phase 4: v2 preparation]Phase 1: Process and rules in writing
Status: this discussion + the AGENTS.md section landed on the same commit.
MIN_CLI_VERSIONif needed, and add a CHANGELOG entry?"Phase 2: Deprecation lifecycle in code
Goal: a removal stops being a cliff and becomes a ramp.
Bidirectional version handshake. Extend
/api/v1/config/version(additively) withschema_revision. Ship acompat.jsoninside the CLI wheel declaringmin_server_versionandmin_schema_revision. The CLI warns when the server is too old, not only when the CLI is too old.Standards-compliant deprecation headers (RFC 8594). Every response that includes a deprecated field carries:
Per-field decorator for REST. A
@deprecated_field(name, since, removed_in, replacement=None)decorator on FastAPI response models. The decorator emits the field as usual and registers it for header emission. Example:GraphQL field deprecation. Use Strawberry's
deprecation_reasonconsistently. Same lifecycle as REST.Soft-warning flow:
sequenceDiagram participant Dev as Maintainer participant Server participant CLI as Old CLI in the field Note over Dev,Server: Release N: deprecate Dev->>Server: Mark field deprecated, removed_in=N+2 CLI->>Server: GET /api/v1/agents Server-->>CLI: Field present + Deprecation/Sunset/Link headers CLI->>CLI: Surface as observal doctor warning Note over Dev,Server: Release N+2: remove Dev->>Server: Delete field, bump MIN_CLI_VERSION CLI->>CLI: Operator has already been warned twicePhase 3: CI enforcement
Goal: the test suite catches breaks before review even starts.
tests/contracts/openapi.snapshot.json. CI fails any PR that changes the shape unless the snapshot is updated in the same PR. Snapshot updates surface the contract change in review.tests/contracts/schema.graphql. Same idea.MIN_CLI_VERSIONenforcement test. If the OpenAPI diff is non-additive, the test fails unlessMIN_CLI_VERSIONwas bumped in the same PR.flowchart LR PR[PR opened] --> O[OpenAPI snapshot diff] O -- additive --> Pass1[Pass] O -- breaking --> M{MIN_CLI_VERSION bumped + CHANGELOG entry?} M -- yes --> Pass2[Pass] M -- no --> Fail[Block merge]Phase 4: v2 preparation
We will not pre-build v2. We will define the trigger and the procedure now so when v2 is needed, it is a known operation.
v2 trigger criteria (any one of):
v2 procedure when triggered:
/api/v2/...alongside/api/v1/.... Both serve traffic.Accept-Version: 2, falls back to v1 on 406.Sunsetheaders on v1 with a minimum 6-month horizon.observal doctorsurfaces the v1 sunset to operators as a recurring warning.Starting positions on the open questions
A few items came up during the audit that need a working answer. The table below records a starting position for each, with the reasoning, so the discussion has something concrete to react to. None of these are settled. If you have a stronger argument for a different answer, please raise it in the comments.
MIN_CLI_VERSIONbe per-endpoint or global?/v1/traces,/v1/logs,/v1/metrics)?Where the rules live
AGENTS.mdsection "Schema and CLI/server upgrade safety": the codified rules that human and AI reviewers apply.CHANGELOG.mdAPI contractsubsection (going forward): the audit trail of what we deprecated, added, and removed, per release.Feedback we are specifically looking for
This is where your input is most useful:
If the community lands on this
Three small things contributors and reviewers could start doing immediately, even before Phases 2 and 3 ship:
observal-server/api/,observal-server/schemas/, or any GraphQL resolver.main, open an issue with anapi-contractlabel so we can track it.The discipline is the deliverable. The tooling in Phases 2 and 3 makes the discipline durable. The community shaping this policy is what makes it stick.
Beta Was this translation helpful? Give feedback.
All reactions