Skip to content

Fix concurrent sitemap burst and archive retry storm#323

Merged
simonsmallchua merged 8 commits into
mainfrom
fix/sitemap-concurrency-archive-skipping
Apr 11, 2026
Merged

Fix concurrent sitemap burst and archive retry storm#323
simonsmallchua merged 8 commits into
mainfrom
fix/sitemap-concurrency-archive-skipping

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 11, 2026

Summary

  • Global sitemap insertion semaphore (GNH_SITEMAP_CONCURRENCY = 3): at most 3 jobs may insert sitemap batches concurrently. Previously, N simultaneous job creations launched N independent goroutines each writing 100-URL UNNEST batches every 200ms, which combined into a write burst that pushed DB EMA above 60ms and shed the concurrency limit to the 10-slot floor — leaving 80+ jobs starved.
  • Permanent archive skipping: when both hot storage (Supabase) and cold storage (R2) return a 404, the task is now marked with html_archived_at = NOW() and excluded from future sweeps. Previously it was left with html_archived_at = NULL, causing the archive scheduler to retry the same unrecoverable tasks on every sweep.

Test plan

  • Build passes: go build ./...
  • Tests pass: go test ./internal/jobs/... ./internal/archive/...
  • Deploy and load 5+ large sitemap jobs simultaneously — confirm DB EMA stays below 60ms and concurrency limit does not shed to floor
  • Confirm "Cold-storage fallback also failed" errors cease for jobs with mass 404s; replaced by "marking archive as skipped" warnings that stop recurring

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Archive sweep now permanently skips tasks when content is missing from both storage locations, preventing repeated re-queuing.
  • Chores

    • Sitemap processing now honours a configurable concurrency limit to reduce write bursts and database pressure; deployment sets the concurrency to 5 and acquisition timeouts avoid stalled processing.
  • Documentation

    • Changelog updated to record the above fixes and sitemap concurrency configuration.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 11, 2026

Updates to Preview Branch (fix/sitemap-concurrency-archive-skipping) ↗︎

Deployments Status Updated
Database Sat, 11 Apr 2026 07:18:19 UTC
Services Sat, 11 Apr 2026 07:18:19 UTC
APIs Sat, 11 Apr 2026 07:18:19 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Sat, 11 Apr 2026 07:18:20 UTC
Migrations Sat, 11 Apr 2026 07:18:22 UTC
Seeding Sat, 11 Apr 2026 07:18:24 UTC
Edge Functions Sat, 11 Apr 2026 07:18:24 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • ✅ Review completed - (🔄 Check again to review again)
📝 Walkthrough

Walkthrough

Adds a global sitemap insertion semaphore (configurable via GNH_SITEMAP_CONCURRENCY, default 3) to limit concurrent sitemap DB insertions and acquisition timeout handling; and introduces logic to permanently skip archive tasks when both hot and cold storage return definitive 404s, wiring a new MarkSkipped lifecycle into sources and DB.

Changes

Cohort / File(s) Summary
Changelog & Deploy config
CHANGELOG.md, fly.toml
Changed ## [Unreleased] to ## [Unreleased:patch] with a ### Fixed section; added two Fixed entries. Added GNH_SITEMAP_CONCURRENCY = "5" to fly.toml env.
Archive interfaces
internal/archive/archive.go, internal/jobs/interfaces.go
Added MarkSkipped(ctx context.Context, candidate ArchiveCandidate) error to ArchiveSource and MarkArchiveSkipped(ctx context.Context, taskID string) error to DbQueueInterface.
Archive scheduler logic
internal/archive/scheduler.go
Added isPermanent404(err error) bool; when both hot and cold downloads are classified as permanent 404s, call src.MarkSkipped(ctx, c) instead of leaving task unarchived and re-queuing.
Archive implementation & DB
internal/archive/task_html.go, internal/db/queue.go
TaskHTMLSource implements MarkSkipped calling DB; added DbQueue.MarkArchiveSkipped which conditionally sets html_archived_at = NOW() and clears storage fields for the task.
Job manager (sitemap)
internal/jobs/manager.go
Added sitemapSem semaphore initialised from GNH_SITEMAP_CONCURRENCY; sitemap goroutine now acquires semaphore with a 60‑minute timeout before starting (existing 30‑minute processing timeout applies after acquisition).
Tests
internal/jobs/worker_process_test.go
Extended test MockDbQueue with MarkArchiveSkipped(ctx, string) error stub.

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as Archive Scheduler
    participant Hot as Hot Storage
    participant Cold as Cold Storage
    participant Classifier as isPermanent404
    participant Source as ArchiveSource (TaskHTMLSource)
    participant DbQueue as DbQueue
    participant DB as Database

    Scheduler->>Hot: download(candidate)
    Hot-->>Scheduler: error (404 / other)
    Scheduler->>Cold: download(candidate fallback)
    Cold-->>Scheduler: error (404 / other)
    Scheduler->>Classifier: classify both errors
    Classifier-->>Scheduler: both permanent 404? (yes)
    Scheduler->>Source: MarkSkipped(ctx, candidate)
    Source->>DbQueue: MarkArchiveSkipped(ctx, taskID)
    DbQueue->>DB: UPDATE tasks SET html_archived_at = NOW(), clear storage WHERE id=taskID AND html_archived_at IS NULL
    DB-->>DbQueue: success
    DbQueue-->>Source: nil
    Source-->>Scheduler: nil
Loading

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly and clearly summarizes the two primary changes: fixing concurrent sitemap bursts through semaphore limiting and preventing archive retry storms through permanent skipping.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 11, 2026

Release Versions

App patch: v0.32.0v0.32.1

Changelog

Fixed

  • Global sitemap insertion semaphore (GNH_SITEMAP_CONCURRENCY, default 3)
    limits how many jobs may insert sitemap batches concurrently; previously N
    simultaneous job creations launched N independent goroutines each writing
    100-URL UNNEST batches every 200ms, causing a combined write burst that pushed
    DB EMA well above the 60ms high-water mark and shed concurrency to the 10-slot
    floor
  • Archive sweeps now permanently skip tasks where both hot storage and cold
    storage return a 404; previously the task was left with
    html_archived_at = NULL and re-queued on every sweep, consuming cold storage
    quota and archive-worker capacity with no chance of success

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 11, 2026

Codecov Report

❌ Patch coverage is 0% with 74 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/jobs/manager.go 0.00% 41 Missing ⚠️
internal/archive/scheduler.go 0.00% 18 Missing ⚠️
internal/db/queue.go 0.00% 13 Missing ⚠️
internal/archive/task_html.go 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CHANGELOG.md`:
- Around line 39-42: Update the release note sentence that currently says
"burning S3 quota" to neutral backend-agnostic wording (e.g., "burning cold
storage/object storage quota" or "consuming cold storage quota") so it doesn't
single out S3; modify the sentence in the archive sweep paragraph that mentions
hot storage (Supabase) and cold storage (R2) to replace the "S3" token with the
chosen neutral phrase.

In `@internal/archive/scheduler.go`:
- Around line 201-208: The code only checks coldErr before calling
src.MarkSkipped; change it so MarkSkipped is called only when both the hot read
error and coldErr are permanent 404s. Concretely, locate the branch using
isPermanent404(coldErr) and replace the condition with a conjunction that also
checks the hot-read error (e.g. isPermanent404(hotErr) &&
isPermanent404(coldErr) or isPermanent404(err) && isPermanent404(coldErr)
depending on the hot-read error variable name), leaving the existing
lg.Warn/lg.Error and src.MarkSkipped call intact.
- Around line 172-183: The isPermanent404 function currently misses Supabase and
typed S3/R2 errors; update isPermanent404(err error) to first check for typed S3
errors using errors.As against *types.NotFound and *types.NoSuchKey, then check
the error string for the exact Supabase pattern "download failed with status
404" (instead of "StatusCode: 404"), and retain the generic "not_found" check
for other providers; reference the isPermanent404 function and use errors.As to
detect *types.NotFound/*types.NoSuchKey and strings.Contains(err.Error(),
"download failed with status 404") as described.

In `@internal/db/queue.go`:
- Around line 1086-1093: MarkArchiveSkipped currently only sets html_archived_at
and leaves hot-storage pointers set, which prevents MarkFullyArchivedJobs from
transitioning tasks to archived; modify the UPDATE in MarkArchiveSkipped to also
NULL out the hot-storage pointers (e.g., set html_storage_path = NULL and any
related field such as html_storage_provider or html_storage_key if present) in
the same UPDATE WHERE id = $1 AND html_archived_at IS NULL so the row is fully
treated as archived by MarkFullyArchivedJobs.

In `@internal/jobs/manager.go`:
- Around line 350-364: The timeout context (backgroundCtx) is created before
waiting on jm.sitemapSem, causing long semaphore waits to consume the sitemap
budget; change the logic to block on acquiring jm.sitemapSem first (detecting if
the wait itself times out and marking the job as failed/recording the timeout),
and only after a successful acquire create a context.WithTimeout
(30*time.Minute) and call jm.processSitemap using that new context; ensure you
still defer releasing jm.sitemapSem and canceling the new timeout context and
reference jm.sitemapSem, processSitemap, cancel, job.ID and normalisedDomain to
locate the affected code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 709810c3-36b5-414a-9dc1-10d19f15556a

📥 Commits

Reviewing files that changed from the base of the PR and between b5cb5c8 and bc5d37a.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • fly.toml
  • internal/archive/archive.go
  • internal/archive/scheduler.go
  • internal/archive/task_html.go
  • internal/db/queue.go
  • internal/jobs/interfaces.go
  • internal/jobs/manager.go
  • internal/jobs/worker_process_test.go

Comment thread CHANGELOG.md Outdated
Comment thread internal/archive/scheduler.go
Comment thread internal/archive/scheduler.go Outdated
Comment thread internal/db/queue.go
Comment thread internal/jobs/manager.go Outdated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/jobs/manager.go (1)

361-364: ⚠️ Potential issue | 🟠 Major

Semaphore-timeout failure path uses a cancelled context and can leave jobs stalled.

At Line 363, jm.updateJobWithError(semCtx, ...) is called after semCtx.Done(), so the DB write will usually fail immediately. Also, only error_message is updated, which can leave the job pending with no tasks. This is the same issue pattern raised earlier and still appears unresolved.

Proposed fix
 		case <-semCtx.Done():
 			log.Error().Str("job_id", jobID).Msg("Timed out waiting for sitemap semaphore slot")
-			jm.updateJobWithError(semCtx, jobID, "timed out waiting for sitemap processing slot")
+			failCtx, failCancel := context.WithTimeout(context.Background(), 15*time.Second)
+			defer failCancel()
+			if err := jm.dbQueue.Execute(failCtx, func(tx *sql.Tx) error {
+				_, execErr := tx.ExecContext(failCtx, `
+					UPDATE jobs
+					SET status = $1,
+					    error_message = $2,
+					    completed_at = $3
+					WHERE id = $4 AND status IN ($5, $6)
+				`, JobStatusFailed, "timed out waiting for sitemap processing slot", time.Now().UTC(), jobID, JobStatusPending, JobStatusInitialising)
+				return execErr
+			}); err != nil {
+				log.Error().Err(err).Str("job_id", jobID).Msg("Failed to mark job as failed after sitemap semaphore timeout")
+			}
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/jobs/manager.go` around lines 361 - 364, The semaphore-timeout
branch calls jm.updateJobWithError with semCtx (already cancelled) and only sets
error_message, which will typically fail and leave the job pending; change the
handler to use a non-cancelled context (e.g., context.Background() or the
request/manager root ctx) when calling jm.updateJobWithError(jobID, ...), and
ensure updateJobWithError (or the update path invoked) sets the job status to a
terminal state (failed) and marks/creates any necessary task records so the job
cannot remain pending with no tasks; locate the timeout branch using semCtx,
log.Error(..., jobID) and jm.updateJobWithError and swap the context and expand
the update payload to include final job state and task cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/jobs/manager.go`:
- Around line 361-364: The semaphore-timeout branch calls jm.updateJobWithError
with semCtx (already cancelled) and only sets error_message, which will
typically fail and leave the job pending; change the handler to use a
non-cancelled context (e.g., context.Background() or the request/manager root
ctx) when calling jm.updateJobWithError(jobID, ...), and ensure
updateJobWithError (or the update path invoked) sets the job status to a
terminal state (failed) and marks/creates any necessary task records so the job
cannot remain pending with no tasks; locate the timeout branch using semCtx,
log.Error(..., jobID) and jm.updateJobWithError and swap the context and expand
the update payload to include final job state and task cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9a762cf2-2277-4258-b0f7-4c66ae8a96a2

📥 Commits

Reviewing files that changed from the base of the PR and between bc5d37a and 2e5830f.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • internal/archive/scheduler.go
  • internal/db/queue.go
  • internal/jobs/manager.go

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/jobs/manager.go`:
- Around line 362-364: The timeout branch only calls jm.updateJobWithError which
sets error_message but doesn't mark the job terminal; change the semaphore wait
timeout handling to also mark the job as failed and set completed_at.
Specifically, after logging the timeout, call the job-failure update that sets
status="failed" and completed_at=time.Now() (or extend jm.updateJobWithError to
accept/update status and completed_at), e.g., invoke the jm method that marks a
job failed (or update the job record for jobID with status failed and
completed_at) using the same context.Background(), so timed-out sitemap waits
end up in a terminal failed state instead of lingering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6a065113-15fa-473b-af13-5ea4e4d6b9c0

📥 Commits

Reviewing files that changed from the base of the PR and between 2e5830f and 243ccf8.

📒 Files selected for processing (1)
  • internal/jobs/manager.go

Comment thread internal/jobs/manager.go
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
internal/jobs/manager.go (1)

363-372: ⚠️ Potential issue | 🟡 Minor

Add status guard to prevent overwriting a cancelled job.

The UPDATE lacks a status constraint, so if CancelJob runs while this goroutine is waiting on the semaphore and marks the job as cancelled, this timeout handler could subsequently overwrite it to failed. The codebase already uses this pattern elsewhere (e.g., line 1131).

🛠️ Proposed fix
 			if updateErr := jm.dbQueue.Execute(context.Background(), func(tx *sql.Tx) error {
 				_, err := tx.ExecContext(context.Background(), `
 					UPDATE jobs
 					SET status = $1, error_message = $2, completed_at = $3
-					WHERE id = $4
-				`, JobStatusFailed, "timed out waiting for sitemap processing slot", time.Now().UTC(), jobID)
+					WHERE id = $4 AND status IN ($5, $6)
+				`, JobStatusFailed, "Job timed out: waiting for sitemap processing slot", time.Now().UTC(), jobID, JobStatusPending, JobStatusInitialising)
 				return err
 			}); updateErr != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/jobs/manager.go` around lines 363 - 372, The UPDATE in
jm.dbQueue.Execute can overwrite a job that was cancelled while waiting; change
the UPDATE to include a status guard so it does not modify cancelled jobs (e.g.
add "AND status != $1" or "AND status <> $1") and pass JobStatusCancelled as an
extra parameter, updating the parameter order used in tx.ExecContext
accordingly; keep this change localized in the jm.dbQueue.Execute block that
sets JobStatusFailed with the "timed out waiting for sitemap processing slot"
message so cancelled jobs are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/jobs/manager.go`:
- Around line 363-372: The UPDATE in jm.dbQueue.Execute can overwrite a job that
was cancelled while waiting; change the UPDATE to include a status guard so it
does not modify cancelled jobs (e.g. add "AND status != $1" or "AND status <>
$1") and pass JobStatusCancelled as an extra parameter, updating the parameter
order used in tx.ExecContext accordingly; keep this change localized in the
jm.dbQueue.Execute block that sets JobStatusFailed with the "timed out waiting
for sitemap processing slot" message so cancelled jobs are preserved.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb9c07bb-11ec-452c-a25c-13cc26a9e992

📥 Commits

Reviewing files that changed from the base of the PR and between c3a4d71 and efce2a5.

📒 Files selected for processing (1)
  • internal/jobs/manager.go

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-323.fly.dev
Dashboard: https://hover-pr-323.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit 59e7395 into main Apr 11, 2026
11 checks passed
@simonsmallchua simonsmallchua deleted the fix/sitemap-concurrency-archive-skipping branch April 11, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant