Fix concurrent sitemap burst and archive retry storm#323
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Updates to Preview Branch (fix/sitemap-concurrency-archive-skipping) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a global sitemap insertion semaphore (configurable via GNH_SITEMAP_CONCURRENCY, default 3) to limit concurrent sitemap DB insertions and acquisition timeout handling; and introduces logic to permanently skip archive tasks when both hot and cold storage return definitive 404s, wiring a new MarkSkipped lifecycle into sources and DB. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Archive Scheduler
participant Hot as Hot Storage
participant Cold as Cold Storage
participant Classifier as isPermanent404
participant Source as ArchiveSource (TaskHTMLSource)
participant DbQueue as DbQueue
participant DB as Database
Scheduler->>Hot: download(candidate)
Hot-->>Scheduler: error (404 / other)
Scheduler->>Cold: download(candidate fallback)
Cold-->>Scheduler: error (404 / other)
Scheduler->>Classifier: classify both errors
Classifier-->>Scheduler: both permanent 404? (yes)
Scheduler->>Source: MarkSkipped(ctx, candidate)
Source->>DbQueue: MarkArchiveSkipped(ctx, taskID)
DbQueue->>DB: UPDATE tasks SET html_archived_at = NOW(), clear storage WHERE id=taskID AND html_archived_at IS NULL
DB-->>DbQueue: success
DbQueue-->>Source: nil
Source-->>Scheduler: nil
Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Release VersionsApp patch: ChangelogFixed
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 39-42: Update the release note sentence that currently says
"burning S3 quota" to neutral backend-agnostic wording (e.g., "burning cold
storage/object storage quota" or "consuming cold storage quota") so it doesn't
single out S3; modify the sentence in the archive sweep paragraph that mentions
hot storage (Supabase) and cold storage (R2) to replace the "S3" token with the
chosen neutral phrase.
In `@internal/archive/scheduler.go`:
- Around line 201-208: The code only checks coldErr before calling
src.MarkSkipped; change it so MarkSkipped is called only when both the hot read
error and coldErr are permanent 404s. Concretely, locate the branch using
isPermanent404(coldErr) and replace the condition with a conjunction that also
checks the hot-read error (e.g. isPermanent404(hotErr) &&
isPermanent404(coldErr) or isPermanent404(err) && isPermanent404(coldErr)
depending on the hot-read error variable name), leaving the existing
lg.Warn/lg.Error and src.MarkSkipped call intact.
- Around line 172-183: The isPermanent404 function currently misses Supabase and
typed S3/R2 errors; update isPermanent404(err error) to first check for typed S3
errors using errors.As against *types.NotFound and *types.NoSuchKey, then check
the error string for the exact Supabase pattern "download failed with status
404" (instead of "StatusCode: 404"), and retain the generic "not_found" check
for other providers; reference the isPermanent404 function and use errors.As to
detect *types.NotFound/*types.NoSuchKey and strings.Contains(err.Error(),
"download failed with status 404") as described.
In `@internal/db/queue.go`:
- Around line 1086-1093: MarkArchiveSkipped currently only sets html_archived_at
and leaves hot-storage pointers set, which prevents MarkFullyArchivedJobs from
transitioning tasks to archived; modify the UPDATE in MarkArchiveSkipped to also
NULL out the hot-storage pointers (e.g., set html_storage_path = NULL and any
related field such as html_storage_provider or html_storage_key if present) in
the same UPDATE WHERE id = $1 AND html_archived_at IS NULL so the row is fully
treated as archived by MarkFullyArchivedJobs.
In `@internal/jobs/manager.go`:
- Around line 350-364: The timeout context (backgroundCtx) is created before
waiting on jm.sitemapSem, causing long semaphore waits to consume the sitemap
budget; change the logic to block on acquiring jm.sitemapSem first (detecting if
the wait itself times out and marking the job as failed/recording the timeout),
and only after a successful acquire create a context.WithTimeout
(30*time.Minute) and call jm.processSitemap using that new context; ensure you
still defer releasing jm.sitemapSem and canceling the new timeout context and
reference jm.sitemapSem, processSitemap, cancel, job.ID and normalisedDomain to
locate the affected code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 709810c3-36b5-414a-9dc1-10d19f15556a
📒 Files selected for processing (9)
CHANGELOG.mdfly.tomlinternal/archive/archive.gointernal/archive/scheduler.gointernal/archive/task_html.gointernal/db/queue.gointernal/jobs/interfaces.gointernal/jobs/manager.gointernal/jobs/worker_process_test.go
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/jobs/manager.go (1)
361-364:⚠️ Potential issue | 🟠 MajorSemaphore-timeout failure path uses a cancelled context and can leave jobs stalled.
At Line 363,
jm.updateJobWithError(semCtx, ...)is called aftersemCtx.Done(), so the DB write will usually fail immediately. Also, onlyerror_messageis updated, which can leave the jobpendingwith no tasks. This is the same issue pattern raised earlier and still appears unresolved.Proposed fix
case <-semCtx.Done(): log.Error().Str("job_id", jobID).Msg("Timed out waiting for sitemap semaphore slot") - jm.updateJobWithError(semCtx, jobID, "timed out waiting for sitemap processing slot") + failCtx, failCancel := context.WithTimeout(context.Background(), 15*time.Second) + defer failCancel() + if err := jm.dbQueue.Execute(failCtx, func(tx *sql.Tx) error { + _, execErr := tx.ExecContext(failCtx, ` + UPDATE jobs + SET status = $1, + error_message = $2, + completed_at = $3 + WHERE id = $4 AND status IN ($5, $6) + `, JobStatusFailed, "timed out waiting for sitemap processing slot", time.Now().UTC(), jobID, JobStatusPending, JobStatusInitialising) + return execErr + }); err != nil { + log.Error().Err(err).Str("job_id", jobID).Msg("Failed to mark job as failed after sitemap semaphore timeout") + } return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/jobs/manager.go` around lines 361 - 364, The semaphore-timeout branch calls jm.updateJobWithError with semCtx (already cancelled) and only sets error_message, which will typically fail and leave the job pending; change the handler to use a non-cancelled context (e.g., context.Background() or the request/manager root ctx) when calling jm.updateJobWithError(jobID, ...), and ensure updateJobWithError (or the update path invoked) sets the job status to a terminal state (failed) and marks/creates any necessary task records so the job cannot remain pending with no tasks; locate the timeout branch using semCtx, log.Error(..., jobID) and jm.updateJobWithError and swap the context and expand the update payload to include final job state and task cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/jobs/manager.go`:
- Around line 361-364: The semaphore-timeout branch calls jm.updateJobWithError
with semCtx (already cancelled) and only sets error_message, which will
typically fail and leave the job pending; change the handler to use a
non-cancelled context (e.g., context.Background() or the request/manager root
ctx) when calling jm.updateJobWithError(jobID, ...), and ensure
updateJobWithError (or the update path invoked) sets the job status to a
terminal state (failed) and marks/creates any necessary task records so the job
cannot remain pending with no tasks; locate the timeout branch using semCtx,
log.Error(..., jobID) and jm.updateJobWithError and swap the context and expand
the update payload to include final job state and task cleanup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9a762cf2-2277-4258-b0f7-4c66ae8a96a2
📒 Files selected for processing (4)
CHANGELOG.mdinternal/archive/scheduler.gointernal/db/queue.gointernal/jobs/manager.go
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/jobs/manager.go`:
- Around line 362-364: The timeout branch only calls jm.updateJobWithError which
sets error_message but doesn't mark the job terminal; change the semaphore wait
timeout handling to also mark the job as failed and set completed_at.
Specifically, after logging the timeout, call the job-failure update that sets
status="failed" and completed_at=time.Now() (or extend jm.updateJobWithError to
accept/update status and completed_at), e.g., invoke the jm method that marks a
job failed (or update the job record for jobID with status failed and
completed_at) using the same context.Background(), so timed-out sitemap waits
end up in a terminal failed state instead of lingering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6a065113-15fa-473b-af13-5ea4e4d6b9c0
📒 Files selected for processing (1)
internal/jobs/manager.go
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/jobs/manager.go (1)
363-372:⚠️ Potential issue | 🟡 MinorAdd status guard to prevent overwriting a cancelled job.
The UPDATE lacks a status constraint, so if
CancelJobruns while this goroutine is waiting on the semaphore and marks the job ascancelled, this timeout handler could subsequently overwrite it tofailed. The codebase already uses this pattern elsewhere (e.g., line 1131).🛠️ Proposed fix
if updateErr := jm.dbQueue.Execute(context.Background(), func(tx *sql.Tx) error { _, err := tx.ExecContext(context.Background(), ` UPDATE jobs SET status = $1, error_message = $2, completed_at = $3 - WHERE id = $4 - `, JobStatusFailed, "timed out waiting for sitemap processing slot", time.Now().UTC(), jobID) + WHERE id = $4 AND status IN ($5, $6) + `, JobStatusFailed, "Job timed out: waiting for sitemap processing slot", time.Now().UTC(), jobID, JobStatusPending, JobStatusInitialising) return err }); updateErr != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/jobs/manager.go` around lines 363 - 372, The UPDATE in jm.dbQueue.Execute can overwrite a job that was cancelled while waiting; change the UPDATE to include a status guard so it does not modify cancelled jobs (e.g. add "AND status != $1" or "AND status <> $1") and pass JobStatusCancelled as an extra parameter, updating the parameter order used in tx.ExecContext accordingly; keep this change localized in the jm.dbQueue.Execute block that sets JobStatusFailed with the "timed out waiting for sitemap processing slot" message so cancelled jobs are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/jobs/manager.go`:
- Around line 363-372: The UPDATE in jm.dbQueue.Execute can overwrite a job that
was cancelled while waiting; change the UPDATE to include a status guard so it
does not modify cancelled jobs (e.g. add "AND status != $1" or "AND status <>
$1") and pass JobStatusCancelled as an extra parameter, updating the parameter
order used in tx.ExecContext accordingly; keep this change localized in the
jm.dbQueue.Execute block that sets JobStatusFailed with the "timed out waiting
for sitemap processing slot" message so cancelled jobs are preserved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb9c07bb-11ec-452c-a25c-13cc26a9e992
📒 Files selected for processing (1)
internal/jobs/manager.go
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-323.fly.dev |
Summary
GNH_SITEMAP_CONCURRENCY = 3): at most 3 jobs may insert sitemap batches concurrently. Previously, N simultaneous job creations launched N independent goroutines each writing 100-URL UNNEST batches every 200ms, which combined into a write burst that pushed DB EMA above 60ms and shed the concurrency limit to the 10-slot floor — leaving 80+ jobs starved.html_archived_at = NOW()and excluded from future sweeps. Previously it was left withhtml_archived_at = NULL, causing the archive scheduler to retry the same unrecoverable tasks on every sweep.Test plan
go build ./...go test ./internal/jobs/... ./internal/archive/...🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Chores
Documentation