Add archive and activity indexes on tasks#312
Conversation
|
Updates to Preview Branch (optimise/db-archive-and-stuck-job-indexes) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds three partial indexes on public.tasks, consolidates per-row INSERT triggers into a statement-level batched updater, moves page-analytics traffic-score updates out of the job-row lock into a conditional post-commit transaction, and replaces single-task promotion with a batch promotion function and callers. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant Queue as DbQueue
participant DB as Database
participant Log as Logger
Caller->>Queue: EnqueueURLs(request)
Queue->>DB: BEGIN (acquire job lock)
Queue->>DB: INSERT/UPDATE tasks, update job counters (use total_tasks - skipped_tasks)
DB-->>Queue: COMMIT (release job lock)
Queue->>Queue: evaluate cfg, sourceType, page priorities
alt cfg valid and gating satisfied
Queue->>DB: BEGIN post-commit transaction
Queue->>DB: UPDATE tasks FROM page_analytics (apply traffic scores)
DB-->>Queue: COMMIT / rows affected
else gating not met
Queue->>Log: skip traffic-score update
end
alt update error
Queue->>Log: log non-fatal warning
end
Queue->>DB: call promote_waiting_tasks_for_job(job_id, freed_slots)
DB-->>Queue: return promoted_count
Queue-->>Caller: return success
Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Release VersionsApp patch: ChangelogPerformance
|
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/db/queue.go`:
- Around line 1547-1578: The UPDATE currently scopes by job_id only and can
affect unrelated tasks; instead build an eligiblePageIDs slice from the same
filtered pages used for enqueueing (the variable pages), and when sourceType ==
"link" include only pages with Priority >= minPriorityForTrafficScore; then call
q.Execute (the same block around tx.ExecContext) but pass eligiblePageIDs as an
additional parameter and change the SQL WHERE to restrict to those page IDs
(e.g. AND p.id = ANY($4)) so only pages from this enqueue batch get their
priority_score updated; keep using cfg.orgID, cfg.domainID and jobID as before
and only run the update when eligiblePageIDs is non-empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e3bdd8db-c358-4889-9e94-e78233af6ec7
📒 Files selected for processing (1)
internal/db/queue.go
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/db/queue.go (1)
1549-1573:⚠️ Potential issue | 🟠 MajorScope the post-commit traffic-score UPDATE to this enqueue batch.
Once any page in
pagescrosses the threshold, this query re-scores every pending/waiting task forjob_id, including older deep-link tasks and even no-op enqueues. That leaks the new gate across batches and can reorder unrelated work.Suggested fix
- if cfg.orgID.Valid && cfg.domainID.Valid { + if cfg.orgID.Valid && cfg.domainID.Valid && len(eligiblePageIDs) > 0 { eligible := sourceType != "link" if !eligible { for _, p := range pages { if p.Priority >= minPriorityForTrafficScore { eligible = true break } } } if eligible { if err2 := q.Execute(ctx, func(tx *sql.Tx) error { _, err := tx.ExecContext(ctx, ` UPDATE tasks t SET priority_score = GREATEST(t.priority_score, COALESCE(pa.traffic_score, 0)) FROM pages p JOIN page_analytics pa ON pa.organisation_id = $1 AND pa.domain_id = $2 AND pa.path = p.path WHERE t.page_id = p.id AND t.job_id = $3 + AND t.page_id = ANY($4) AND t.status IN ('pending', 'waiting') AND COALESCE(pa.traffic_score, 0) > t.priority_score - `, cfg.orgID.String, cfg.domainID.Int64, jobID) + `, cfg.orgID.String, cfg.domainID.Int64, jobID, pq.Array(eligiblePageIDs)) return err }); err2 != nil {Build
eligiblePageIDsfrom the pages that actually survived enqueueing, and forsourceType == "link"keep only those that meet the threshold.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/db/queue.go` around lines 1549 - 1573, The UPDATE that writes COALESCE(pa.traffic_score,0) to tasks currently runs for the whole job when any page crosses the threshold; instead, compute eligiblePageIDs from the pages that actually survived enqueueing (use the same post-enqueue filter used elsewhere), and if sourceType == "link" further restrict eligiblePageIDs to those with Priority >= minPriorityForTrafficScore, then call q.Execute/q.Tx to run the UPDATE using WHERE t.job_id = jobID AND t.page_id = ANY($4) (or an IN list) and only when eligiblePageIDs is non-empty so you only re-score tasks belonging to the newly enqueued pages; reference cfg.orgID, cfg.domainID, sourceType, pages, minPriorityForTrafficScore, q.Execute, and jobID when locating the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@supabase/migrations/20260409025105_merge_insert_trigger_to_statement_level.sql`:
- Around line 51-56: Remove INSERT from the row-level trigger by adding a
migration that drops and recreates trg_update_job_queue_counters so it only
fires on UPDATE or DELETE; specifically, DROP TRIGGER IF EXISTS
trg_update_job_queue_counters ON public.tasks and then CREATE TRIGGER
trg_update_job_queue_counters AFTER UPDATE OR DELETE ON public.tasks FOR EACH
ROW EXECUTE FUNCTION update_job_queue_counters(); this ensures
update_job_queue_counters() is not invoked per-row on INSERT and completes the
switch to statement-level inserts handled by
update_job_counters_insert_batch()/trg_job_counters_insert.
---
Duplicate comments:
In `@internal/db/queue.go`:
- Around line 1549-1573: The UPDATE that writes COALESCE(pa.traffic_score,0) to
tasks currently runs for the whole job when any page crosses the threshold;
instead, compute eligiblePageIDs from the pages that actually survived
enqueueing (use the same post-enqueue filter used elsewhere), and if sourceType
== "link" further restrict eligiblePageIDs to those with Priority >=
minPriorityForTrafficScore, then call q.Execute/q.Tx to run the UPDATE using
WHERE t.job_id = jobID AND t.page_id = ANY($4) (or an IN list) and only when
eligiblePageIDs is non-empty so you only re-score tasks belonging to the newly
enqueued pages; reference cfg.orgID, cfg.domainID, sourceType, pages,
minPriorityForTrafficScore, q.Execute, and jobID when locating the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 20a33fa1-ec1b-49ab-b2e9-5938b6173961
📒 Files selected for processing (3)
CHANGELOG.mdinternal/db/queue.gosupabase/migrations/20260409025105_merge_insert_trigger_to_statement_level.sql
| DROP TRIGGER IF EXISTS trg_job_counters_insert ON public.tasks; | ||
| CREATE TRIGGER trg_job_counters_insert | ||
| AFTER INSERT ON public.tasks | ||
| REFERENCING NEW TABLE AS new_tasks | ||
| FOR EACH STATEMENT | ||
| EXECUTE FUNCTION update_job_counters_insert_batch(); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C3 'trg_job_queue_counters|EXECUTE FUNCTION update_job_queue_counters|CREATE TRIGGER' supabase/migrationsRepository: Harvey-AU/hover
Length of output: 15048
🏁 Script executed:
cat -n supabase/migrations/20260409025105_merge_insert_trigger_to_statement_level.sqlRepository: Harvey-AU/hover
Length of output: 9945
Finish the trigger split by removing INSERT from the existing row-level trigger.
update_job_queue_counters() now returns early on INSERT, but its trigger (trg_update_job_queue_counters) still listens to INSERT events. This means every task insert still fires a row-level trigger invocation that immediately returns, creating avoidable hot-path overhead and leaving the statement-level optimisation incomplete.
Add a migration to drop and recreate trg_update_job_queue_counters with AFTER UPDATE OR DELETE only (removing INSERT), so batched inserts no longer trigger the row-level function unnecessarily.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@supabase/migrations/20260409025105_merge_insert_trigger_to_statement_level.sql`
around lines 51 - 56, Remove INSERT from the row-level trigger by adding a
migration that drops and recreates trg_update_job_queue_counters so it only
fires on UPDATE or DELETE; specifically, DROP TRIGGER IF EXISTS
trg_update_job_queue_counters ON public.tasks and then CREATE TRIGGER
trg_update_job_queue_counters AFTER UPDATE OR DELETE ON public.tasks FOR EACH
ROW EXECUTE FUNCTION update_job_queue_counters(); this ensures
update_job_queue_counters() is not invoked per-row on INSERT and completes the
switch to statement-level inserts handled by
update_job_counters_insert_batch()/trg_job_counters_insert.
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
internal/db/queue.go (1)
1543-1579:⚠️ Potential issue | 🟠 MajorScope the post-commit traffic-score update to the current eligible pages.
Once a single eligible link page is present, this query still updates every
pending/waitingtask for the job. That re-scores older deep-link tasks that were meant to stay on structural ordering only, and even a no-op enqueue can rewrite unrelated rows. Restrict theUPDATEto page IDs from this enqueue batch that actually met the eligibility rule.Patch idea
+ eligiblePageIDs := make([]int, 0, len(pageIDs)) + for i, pageID := range pageIDs { + if sourceType != "link" || priorities[i] >= minPriorityForTrafficScore { + eligiblePageIDs = append(eligiblePageIDs, pageID) + } + } - eligible := sourceType != "link" - if !eligible { - for _, p := range pages { - if p.Priority >= minPriorityForTrafficScore { - eligible = true - break - } - } - } - if eligible { + if len(eligiblePageIDs) > 0 { if err2 := q.Execute(ctx, func(tx *sql.Tx) error { _, err := tx.ExecContext(ctx, ` UPDATE tasks t SET priority_score = GREATEST(t.priority_score, COALESCE(pa.traffic_score, 0)) FROM pages p JOIN page_analytics pa ON pa.organisation_id = $1 AND pa.domain_id = $2 AND pa.path = p.path WHERE t.page_id = p.id AND t.job_id = $3 + AND t.page_id = ANY($4) AND t.status IN ('pending', 'waiting') AND COALESCE(pa.traffic_score, 0) > t.priority_score - `, cfg.orgID.String, cfg.domainID.Int64, jobID) + `, cfg.orgID.String, cfg.domainID.Int64, jobID, pq.Array(eligiblePageIDs)) return err }); err2 != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/db/queue.go` around lines 1543 - 1579, The UPDATE currently applies traffic scores to all pending/waiting tasks for the job; collect the IDs of pages from this enqueue batch that met the eligibility check (i.e., for sourceType "link" only include pages with p.Priority >= minPriorityForTrafficScore, otherwise include all page IDs), and pass that slice into the post-commit q.Execute call (use jobID, cfg.orgID/cfg.domainID as before) and restrict the UPDATE to only those page IDs (e.g., add AND t.page_id = ANY($4) or JOIN on pages p WHERE p.id = ANY($4)), so only tasks for the current eligible pages are updated. Ensure the new parameter uses the same sql.ExecContext invocation that contains the UPDATE and adjust arguments accordingly in the q.Execute closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@internal/db/queue.go`:
- Around line 1543-1579: The UPDATE currently applies traffic scores to all
pending/waiting tasks for the job; collect the IDs of pages from this enqueue
batch that met the eligibility check (i.e., for sourceType "link" only include
pages with p.Priority >= minPriorityForTrafficScore, otherwise include all page
IDs), and pass that slice into the post-commit q.Execute call (use jobID,
cfg.orgID/cfg.domainID as before) and restrict the UPDATE to only those page IDs
(e.g., add AND t.page_id = ANY($4) or JOIN on pages p WHERE p.id = ANY($4)), so
only tasks for the current eligible pages are updated. Ensure the new parameter
uses the same sql.ExecContext invocation that contains the UPDATE and adjust
arguments accordingly in the q.Execute closure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 02690264-2135-46fc-af27-e25d9e3fdf7b
📒 Files selected for processing (4)
CHANGELOG.mdinternal/db/batch.gointernal/db/queue.gosupabase/migrations/20260409064200_promote_waiting_tasks_batch.sql
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-312.fly.dev |
…job-indexes Add archive and activity indexes on tasks
Summary
Reduces Supabase CPU load identified in load testing (98% at 65 concurrent jobs) via three indexes and a lock-hold reduction in `EnqueueURLs`.
1. Three indexes on `tasks` (migration)
`idx_tasks_job_has_html_storage` (partial, `WHERE html_storage_path IS NOT NULL`)
`idx_tasks_job_html_archived` (partial, `WHERE html_archived_at IS NOT NULL`)
`idx_tasks_job_activity_times` (compound `job_id, started_at DESC, completed_at DESC`)
2. Traffic score update moved outside the job-row lock (`EnqueueURLs`)
Problem: The `FOR UPDATE OF j` lock in `EnqueueURLs` was held while running a `page_analytics` JOIN to apply traffic scores. With 10 concurrent workers per job all calling `EnqueueURLs`, they queue behind each other waiting for this lock — the 1,228ms average is almost entirely lock wait time, not query time.
Fix: Traffic score `UPDATE` now runs in a separate `Execute` call after the first transaction commits and the job lock is released. Other workers for the same job can proceed immediately after the INSERT.
Added threshold: For `sourceType == "link"` (link-discovered pages), traffic scores are only applied if the source page priority is ≥ 0.729 (≈ 0.9³, meaning ~3 link-hops from the homepage). Pages deeper than this are too numerous and their relative ordering matters less to crawl efficiency. Sitemap and other source types always qualify.
What was skipped and why
Test plan
Summary by CodeRabbit
Performance
Chores