Skip to content

fix(backend): clean stuck manifest backlog#2257

Open
riderx wants to merge 7 commits into
mainfrom
codex/manifest-cleanup-health
Open

fix(backend): clean stuck manifest backlog#2257
riderx wants to merge 7 commits into
mainfrom
codex/manifest-cleanup-health

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 13, 2026

Summary (AI generated)

  • Harden deleted bundle cleanup so missing app_versions_meta no longer blocks manifest row cleanup.
  • Add a bounded delete_old_deleted_versions() migration so hard-delete cascades run in 500-version batches.
  • Add admin:cleanup-stuck-manifest-backlog to audit and clear old stuck manifest rows, then vacuum analyze public.manifest.
  • Add unit and pgTAP coverage for metadata-missing cleanup, counter resets, bounded hard delete, and manifest cascade.

Motivation (AI generated)

public.manifest can grow when old soft-deleted bundle cleanup is blocked or backlogged. The database should keep the same retention policy while providing a bounded operational path to remove old stuck rows safely.

Business Impact (AI generated)

This reduces production database storage pressure and keeps bundle cleanup predictable without changing customer-facing bundle retention behavior. It also gives operators an auditable script to clean existing backlog after deployment.

Test Plan (AI generated)

  • bun lint:backend
  • bunx eslint --no-ignore scripts/cleanup_stuck_manifest_backlog.ts
  • bunx vitest run tests/on-version-update-cleanup.unit.test.ts
  • PGSSLMODE=disable bunx supabase test db supabase/tests/00-supabase_test_helpers.sql supabase/tests/16_test_retention.sql --db-url postgresql://postgres:postgres@127.0.0.1:54562/postgres
  • bun test:backend
  • bun run admin:cleanup-stuck-manifest-backlog --help
  • Dry-run script audit against local Supabase DB

Generated with AI

Summary by CodeRabbit

  • New Features

    • Added an admin CLI to audit and optionally perform bounded cleanup of stuck/old manifest backlog.
    • Added a DB migration introducing a safe, bounded hard-delete helper for old soft-deleted versions with tightened permissions.
  • Bug Fixes

    • Improved orphan-detection for manifest assets and changed metadata-update failures to be logged rather than aborting deletes.
    • Ensures manifest counters are reset when manifest rows are removed.
  • Tests

    • Expanded tests to cover bounded-batch deletion and manifest cleanup behavior.
  • Documentation

    • Clarified inline comment about file-read behavior and error handling.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a bounded PL/pgSQL hard-delete function for old soft-deleted app_versions, an admin Bun/TypeScript script to audit and iteratively call that function, trigger adjustments to avoid throwing on meta-update errors, and SQL/unit tests validating bounded batching and manifest cascade.

Changes

Manifest Backlog Cleanup System

Layer / File(s) Summary
Database cleanup function and tests
supabase/migrations/20260513003104_manifest_cleanup_health.sql, supabase/tests/16_test_retention.sql
delete_old_deleted_versions() hard-deletes up to 500 soft-deleted app_versions older than 3 months (skips versions with channels), sets ownership/permissions, and is covered by SQL tests that verify bounded-batch behavior and manifest cascade deletions.
Admin CLI cleanup script
package.json, scripts/cleanup_stuck_manifest_backlog.ts
Adds admin:cleanup-stuck-manifest-backlog npm script and a Bun/TypeScript admin script that resolves DB URL from flags/env, prints pre-cleanup audits (table size, vacuum stats, bucket counts), optionally runs iterative delete_old_deleted_versions() batches with configurable pause and early-stop on stalled progress, optionally VACUUMs public.manifest, and prints post-cleanup audits.
Trigger cleanup integration and tests
supabase/functions/_backend/triggers/on_version_update.ts, tests/on-version-update-cleanup.unit.test.ts, supabase/functions/_backend/files/files.ts
deleteIt/trigger logic now includes s3_path when checking manifest references; stats/meta updates run only when metadata exists and meta-update failures are logged (not thrown). Unit tests expand Supabase/Postgres mocks to simulate manifest row deletions, verify deleteObject calls for bundle paths even when meta is missing, and assert SQL counter reset/decrement calls. A comment in files.ts was clarified about avoiding DB/R2 checks on file reads.

Sequence Diagram(s)

sequenceDiagram
  participant Script as Admin Script
  participant DB as Postgres
  participant Fn as delete_old_deleted_versions
  participant Manifest as public.manifest
  Script->>DB: pre-cleanup audit queries (size, vacuum, bucket counts)
  Script->>Fn: CALL public.delete_old_deleted_versions()
  Fn->>DB: DELETE bounded rows -> cascade deletes Manifest rows
  DB-->>Script: remaining eligible count
  Script->>DB: optional VACUUM (ANALYZE) public.manifest
  Script->>DB: post-cleanup audit queries
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hop through rows both old and gray,
Bounded batches whisk the backlog away.
Manifests fall and counters mend,
Tests confirm the careful end.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: a new bounded cleanup mechanism for stuck manifest backlog rows in the database.
Description check ✅ Passed The PR description covers summary, motivation, business impact, and a detailed test plan with checkboxes, fully complying with the required template structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/manifest-cleanup-health

Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 13, 2026

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing codex/manifest-cleanup-health (4a3669f) with main (728e86b)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@riderx riderx marked this pull request as ready for review May 13, 2026 02:05
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/on-version-update-cleanup.unit.test.ts (1)

164-208: ⚡ Quick win

Use it.concurrent() for new test cases in this file to enable parallel test execution.

The two test cases at lines 164 and 187 should be converted to concurrent form per the coding guidelines for test files.

♻️ Proposed change
-  it('still deletes manifest rows when version metadata is missing', async () => {
+  it.concurrent('still deletes manifest rows when version metadata is missing', async () => {
@@
-  it('resets manifest counters after deleting manifest entries', async () => {
+  it.concurrent('resets manifest counters after deleting manifest entries', async () => {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/on-version-update-cleanup.unit.test.ts` around lines 164 - 208, Convert
the two synchronous tests that call deleteIt (the ones referencing
manifestEntries with ids 456 and 789) to concurrent tests by replacing it(...)
with it.concurrent(...); specifically update the test declarations that use the
descriptions 'still deletes manifest rows when version metadata is missing' and
'resets manifest counters after deleting manifest entries' so they call
it.concurrent and keep the same test body and uses of deleteIt, createContext,
createVersion, manifestEntries, and expectations intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/cleanup_stuck_manifest_backlog.ts`:
- Around line 204-208: The DB client config currently disables TLS certificate
validation by setting ssl: shouldUseSsl(databaseUrl) ? { rejectUnauthorized:
false } : undefined; update this so certificate validation is enabled: either
remove the rejectUnauthorized override or set it to true when
shouldUseSsl(databaseUrl) is true. Modify the Client initialization (the Client
constructor call that uses application_name, connectionString, ssl and
databaseUrl) so that ssl is either omitted when not needed or set to {
rejectUnauthorized: true } when shouldUseSsl(databaseUrl) returns true.

---

Nitpick comments:
In `@tests/on-version-update-cleanup.unit.test.ts`:
- Around line 164-208: Convert the two synchronous tests that call deleteIt (the
ones referencing manifestEntries with ids 456 and 789) to concurrent tests by
replacing it(...) with it.concurrent(...); specifically update the test
declarations that use the descriptions 'still deletes manifest rows when version
metadata is missing' and 'resets manifest counters after deleting manifest
entries' so they call it.concurrent and keep the same test body and uses of
deleteIt, createContext, createVersion, manifestEntries, and expectations
intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2008313-2130-4d55-812f-56f97cc2f7f9

📥 Commits

Reviewing files that changed from the base of the PR and between 2c16491 and ed1a77d.

📒 Files selected for processing (6)
  • package.json
  • scripts/cleanup_stuck_manifest_backlog.ts
  • supabase/functions/_backend/triggers/on_version_update.ts
  • supabase/migrations/20260513003104_manifest_cleanup_health.sql
  • supabase/tests/16_test_retention.sql
  • tests/on-version-update-cleanup.unit.test.ts

Comment thread scripts/cleanup_stuck_manifest_backlog.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
supabase/tests/16_test_retention.sql (1)

296-305: 💤 Low value

Document the batch size dependency in the test assertion.

The test expects exactly 5 remaining versions after delete_old_deleted_versions() runs, which depends on a batch size of 500 defined in the migration. While the batch size is documented in the migration's function comment, the test file itself has no explanation for this hard-coded expectation. Add a comment above the assertion clarifying the relationship: "Test expects 505 initial versions minus 500-version batch limit = 5 remaining".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@supabase/tests/16_test_retention.sql` around lines 296 - 305, Add a short
explanatory comment above the assertion that documents the batch-size
dependency: note that the test starts with 505 initial versions and the
migration's batch limit (used by delete_old_deleted_versions) is 500, so the
expectation is 505 - 500 = 5 remaining versions; reference the relevant symbols
in the comment (delete_old_deleted_versions, remaining_batch_versions,
retention_batch_targets) so future readers understand why the hard-coded 5 is
used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@supabase/tests/16_test_retention.sql`:
- Around line 296-305: Add a short explanatory comment above the assertion that
documents the batch-size dependency: note that the test starts with 505 initial
versions and the migration's batch limit (used by delete_old_deleted_versions)
is 500, so the expectation is 505 - 500 = 5 remaining versions; reference the
relevant symbols in the comment (delete_old_deleted_versions,
remaining_batch_versions, retention_batch_targets) so future readers understand
why the hard-coded 5 is used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8099ccf7-b74a-48b3-b442-32cf418461f1

📥 Commits

Reviewing files that changed from the base of the PR and between c7f2de0 and c9f320a.

📒 Files selected for processing (3)
  • supabase/functions/_backend/triggers/on_version_update.ts
  • supabase/tests/16_test_retention.sql
  • tests/on-version-update-cleanup.unit.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/on-version-update-cleanup.unit.test.ts

@sonarqubecloud
Copy link
Copy Markdown

@digzrow-coder
Copy link
Copy Markdown

There is still an object-lifecycle gap here: admin:cleanup-stuck-manifest-backlog --apply hard-deletes old app_versions by calling public.delete_old_deleted_versions(), and that removes manifest rows by FK cascade without deleting the corresponding R2 manifest objects.

That is fine only if the soft-delete trigger has already removed every manifest object. But this PR is explicitly for a stuck/backlogged cleanup path, and deleteIt() can leave manifest rows behind when prior cleanup was blocked. If any of those rows still point at live R2 keys, the admin script removes the last DB reference to those keys and future runs cannot discover them for object cleanup. The script's audit shows counts by manifest row/version, but it never lists or deletes the manifest.s3_path objects before the cascade.

I would either make the backlog script collect candidate manifest.s3_path values and delete those R2 objects before/while deleting the DB rows, or clearly split DB-only cleanup from storage cleanup and keep enough retry state for failed object deletes. A useful regression would seed an old deleted version with a manifest row, run the cleanup helper, and assert the storage-delete path is invoked or the manifest row remains available for a later object-cleanup pass.

Copy link
Copy Markdown

@KCDaemon KCDaemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked latest head (4a3669f). This still has an object-lifecycle gap in the apply path.

admin:cleanup-stuck-manifest-backlog --apply calls public.delete_old_deleted_versions(), and that hard-deletes old app_versions; the related manifest rows are then removed by FK cascade. That clears the database backlog, but it does not call the R2/delete path for those manifest objects. After the cascade, the s3_path rows are gone, so there is no durable source left for a later storage cleanup retry or for the deleted-bundle read guard to identify those paths.

The new unit tests cover the normal deleteManifest() function path, but the backlog cleanup script bypasses that path and only calls SQL. To make this safe, the apply job needs to delete/enqueue the manifest objects before removing the rows, or persist a tombstone/path table until storage/cache cleanup is complete. Please add a regression for the actual cleanup script/function path, not just the regular version-delete path. The PR is also merge-conflicted (DIRTY), so keep blocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants