Skip to content

feat(gateway): add reconciler lease for HA multi-replica deployments#1577

Draft
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/reconciler-lease
Draft

feat(gateway): add reconciler lease for HA multi-replica deployments#1577
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/reconciler-lease

Conversation

@derekwaynecarr
Copy link
Copy Markdown
Collaborator

Summary

Introduce a database-backed reconciler lease so that only one gateway replica runs the watch and reconcile loops in Postgres-backed HA deployments. SQLite (single-replica) deployments skip the lease and run unconditionally as before.

The lease is a lightweight JSON record in the objects table using CAS for cross-replica safety. A lease coordinator on each replica attempts acquisition, runs renewal while holding, and releases on shutdown for fast failover. Watch and reconcile loops now accept a cancellation channel for cooperative shutdown.

Related Issue

Closes #1429

Changes

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • [x ] Follows Conventional Commits
  • [x ] Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Introduce a database-backed reconciler lease so that only one gateway
replica runs the watch and reconcile loops in Postgres-backed HA
deployments. SQLite (single-replica) deployments skip the lease and
run unconditionally as before.

The lease is a lightweight JSON record in the objects table using CAS
for cross-replica safety. A lease coordinator on each replica attempts
acquisition, runs renewal while holding, and releases on shutdown for
fast failover. Watch and reconcile loops now accept a cancellation
channel for cooperative shutdown.

Implements NVIDIA#1429

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 26, 2026 21:25
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

This is WIP

@derekwaynecarr derekwaynecarr marked this pull request as draft May 26, 2026 21:26
@derekwaynecarr
Copy link
Copy Markdown
Collaborator Author

Need to figure out what we want to do in CI for HA setups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): reconciler lease for HA multi-replica deployments

1 participant