Skip to content

Harden WAF pre-flight + breaker edge cases#368

Merged
simonsmallchua merged 8 commits into
mainfrom
work/waf-followups-a
Apr 29, 2026
Merged

Harden WAF pre-flight + breaker edge cases#368
simonsmallchua merged 8 commits into
mainfrom
work/waf-followups-a

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented Apr 28, 2026

Summary

All five CodeRabbit follow-ups to #367 (WAF detection, row 1 of #365), in one
PR per request. Each addresses a finding that would only bite under a specific
failure mode; none change the contract working in production today.

Three small ones (commit 1)

  • Pre-flight fail-fallback (manager.go runWAFPreflight): if BlockJob's DB write errs, transition the job to failed via a status-guarded write rather than returning to the caller with the job still in pending and no tasks.
  • JobStatusBlocked terminal cleanup (manager.go UpdateJobStatus): adds JobStatusBlocked to the case-arm that drops processedPages and milestone state, matching the other terminal statuses.
  • Case-insensitive probe scheme (probe.go normaliseProbeTarget): HTTPS://example.com no longer double-prefixes to https://HTTPS://....

Two meatier ones (commit 2)

  • BlockJob CAS guard (manager.go BlockJob): the terminal UPDATE jobs now restricts to pre-terminal statuses and rolls the whole tx back if zero rows match. Stops a stale GetJob pre-read from overwriting a freshly-completed/failed/cancelled job and stamping domains.waf_blocked off a verdict that didn't land. Race-lost surfaces as nil success, not a red error.
  • Circuit breaker async dispatch + re-arm (waf_circuit_breaker.go MaybeTripFromOutcome): BlockJob now runs in a detached goroutine with a 30 s timeout so the stream worker hot path can't stall on terminal-state DB lock contention. On BlockJob failure the breaker re-arms for the job — a transient DB blip no longer permanently disables it.

Tests added

  • TestNormaliseProbeTarget — table-driven, covers mixed-case schemes.
  • TestFailJobWithMessage_StatusGuard — sqlmock, asserts the fallback SQL has the four-status WHERE guard.
  • TestBlockJob_RaceLostReturnsNil — sqlmock, asserts the CAS miss rolls the tx back and surfaces as nil success (no domain stamp, no outbox delete).
  • TestBlockJob_LockOrder — updated to assert the new SQL signature including the CAS clause.
  • TestMaybeTripFromOutcome_AsyncDispatch — holds BlockJob inside a stub via a channel; verifies the caller returns promptly.
  • TestMaybeTripFromOutcome_RearmAfterFailure — BlockJob returns error; subsequent Observe for the same job must trip again.
  • TestMaybeTripFromOutcome_NoRearmOnSuccess — BlockJob returns nil; subsequent Observe must NOT trip again (single-fire preserved).

Test plan

  • go test ./... green (verified locally).
  • Existing target.com.au / woolworths.com.au blocked behaviour unchanged.
  • Existing amazon.com healthy behaviour unchanged.

Summary by CodeRabbit

  • Bug Fixes

    • Pre-flight WAF failures now reliably mark affected jobs failed with clear errors.
    • Race conditions fixed so terminal job updates are not overwritten.
    • Blocked-job transitions perform the same in-process cleanup as other terminal states.
    • Enqueueing and sitemap discovery stop promptly when a job becomes terminal.
  • Behavior Changes

    • WAF probe scheme detection is case-insensitive to prevent malformed targets.
    • Akamai Bot Manager cookies (_abck, bm_sz) are now recognized as Akamai blocking signals.
    • WAF mid-job circuit breaker trips sooner (2→) and dispatches blocking actions asynchronously with a 30s timeout; failures re-arm for retry.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces Unreleased changelog entry and implements WAF and job-lifecycle reliability fixes: case-insensitive probe scheme detection, Akamai cookie detection, BlockJob CAS guard + fail-fallback, async BlockJob dispatch with re-arm on failure, enqueue terminal short-circuit, and sitemap mid-batch terminal checks.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Populate Unreleased with detailed Fixes: WAF and job-lifecycle reliability changes.
WAF detection & probe normalization
internal/crawler/probe.go, internal/crawler/probe_test.go, internal/crawler/waf.go, internal/crawler/waf_test.go
Make scheme detection case-insensitive; add unit test for normalization. Extend Akamai detection to include _abck and bm_sz cookies (lowercased check) and add corresponding tests.
BlockJob & JobManager race-safety
internal/jobs/manager.go, internal/jobs/block_job_test.go
Add CAS-style UPDATE ... WHERE status IN (...) for BlockJob, return sentinel on 0 rows (race lost) treated as success by callers; add fallback path to mark pre-flight failures as failed via failJobWithMessage; ensure JobStatusBlocked clears in-memory tracking. Tests updated/added for CAS and race scenarios.
failJobWithMessage unit test
internal/jobs/fail_job_message_test.go
Add test ensuring guarded status update to failed with message only applies when current status is in allowed pre-terminal set.
WAFCircuitBreaker async dispatch & rearm
internal/jobs/waf_circuit_breaker.go, internal/jobs/waf_circuit_breaker_dispatch_test.go
Lower default trip threshold (3→2); dispatch BlockJob asynchronously with 30s detached timeout; add Rearm(jobID) to clear tripped flag on dispatch failure; add tests for async dispatch, rearm-on-failure, and no-rearm-on-success.
Enqueue terminal guard & API
internal/db/queue.go, internal/db/queue_test.go
Read j.status under FOR UPDATE OF j and short-circuit enqueue when job is terminal; add exported IsTerminalJobStatus and tests.
Sitemap terminal guard tests
internal/jobs/sitemap_terminal_guard_test.go
Add tests for isJobInTerminalStatus covering terminal, non-terminal, and query-error flows to stop sitemap parsing mid-discovery when job becomes terminal.

Sequence Diagram(s)

sequenceDiagram
  participant WAF as "WAF Observer"
  participant Breaker as "WAFCircuitBreaker"
  participant Dispatch as "Dispatch Goroutine\n(30s timeout)"
  participant JM as "JobManager.BlockJob"
  participant DB as "Database"

  WAF->>Breaker: observe blocked outcome for jobID
  Breaker->>Breaker: mark jobID tripped
  Breaker->>Dispatch: spawn detached dispatch (returns immediately)
  Dispatch->>JM: call BlockJob(jobID) (detached ctx, 30s)
  JM->>DB: UPDATE jobs ... WHERE id=? AND status IN (pre-terminal)
  DB-->>JM: RowsAffected (1 => success / 0 => race lost)
  alt success
    JM-->>Dispatch: success
    Dispatch-->>Breaker: no rearm
  else failure or timeout
    Dispatch-->>Breaker: report failure
    Breaker->>Breaker: Rearm(jobID) (clear tripped flag)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 A hop, a sniff, a lowercase scheme,
Cookies caught where blockers gleam.
Jobs that race now gently fall,
Breakers shout and then re-call.
Carrots clapped — the crawl runs clean.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main intent: hardening edge cases in WAF pre-flight and circuit breaker logic. It directly matches the PR's core objectives—handling failures, CAS guards, async dispatch, and re-arming.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/waf-followups-a

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented Apr 28, 2026

Updates to Preview Branch (work/waf-followups-a) ↗︎

Deployments Status Updated
Database Tue, 28 Apr 2026 23:14:48 UTC
Services Tue, 28 Apr 2026 23:14:48 UTC
APIs Tue, 28 Apr 2026 23:14:48 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Tue, 28 Apr 2026 23:14:50 UTC
Migrations Tue, 28 Apr 2026 23:14:52 UTC
Seeding Tue, 28 Apr 2026 23:14:53 UTC
Edge Functions Tue, 28 Apr 2026 23:14:53 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Release Versions

App patch: v0.33.13v0.33.14

Changelog

Fixed

  • WAF pre-flight no longer strands jobs in pending if BlockJob's DB write
    fails — a fallback transition writes failed with an explanatory message so
    the job always reaches a terminal state.
  • Customer-facing jobs.error_message for the WAF fallback path is now a stable
    "WAF detected but block transition failed" string. The raw underlying error
    (which could include DB driver text like
    pq: SSL is not enabled on the server) is still logged via the structured ops
    logger with vendor/reason/domain context, but no longer leaks into the
    customer-visible field.
  • JobStatusBlocked now triggers the same per-job in-process state cleanup
    (processedPages, milestones) as the other terminal statuses; long-running
    workers no longer leak map entries for blocked jobs.
  • WAF probe scheme detection is now case-insensitive — HTTPS://example.com no
    longer double-prefixes to https://HTTPS://... and silently skips the
    verdict.
  • BlockJob now CAS-guards the terminal UPDATE jobs against a stale pre-read
    status, so a freshly-completed/failed/cancelled job from a concurrent worker
    is no longer overwritten with blocked (and the domain row no longer stamped
    off a verdict that didn't actually land for that run). A lost race rolls the
    whole transaction back and surfaces as nil success.
  • The WAF mid-job circuit breaker now dispatches BlockJob in a detached
    goroutine with a 30 s timeout, so the stream worker hot path can't stall on
    terminal-state DB lock contention. On BlockJob failure the breaker re-arms
    for the job, so a transient DB blip no longer permanently disables it.
  • EnqueueURLs now short-circuits under its existing FOR UPDATE OF j row lock
    when the target job is in a terminal status (blocked, cancelled, failed,
    completed, archived). Without this, sitemap discovery and link extraction kept
    inserting orphan tasks for jobs that had already transitioned terminal
    mid-flight — kmart.com.au-class jobs were accreting 32k+ pending rows after
    the circuit breaker had already fired. The sitemap-discovery loop additionally
    reads job status between batches as a cheap pre-flight, so terminal jobs stop
    parsing remaining batches instead of round-tripping to the DB just to be
    rejected.

Changed

  • WAF detector now recognises Akamai Bot Manager _abck and bm_sz cookies on
    blocking status codes (403/202) as Akamai signals. Catches BM-fronted sites
    that don't emit Server: AkamaiGHost or akaalb_* cookies (e.g.
    kmart.com.au) and gives the mid-job circuit breaker vendor=akamai
    attribution instead of falling through to generic. Cookies on a 200 response
    are explicitly NOT treated as a block — many sites run BM in monitor mode
    without ever blocking.
  • WAF mid-job circuit breaker default threshold lowered from 3 → 2 consecutive
    WAF responses. Trips ~33% earlier, capping orphan-task accumulation when a
    large sitemap is mid-discovery. Override via
    GNH_WAF_CIRCUIT_BREAKER_THRESHOLD.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 24 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/jobs/manager.go 67.34% 15 Missing and 1 partial ⚠️
internal/db/queue.go 45.45% 6 Missing ⚠️
internal/jobs/waf_circuit_breaker.go 90.90% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/jobs/manager.go`:
- Line 588: The job failure message currently includes raw internal error text
by concatenating err.Error() into the call to jm.failJobWithMessage; replace
that concatenation with a stable, customer-safe message (e.g., "WAF detected but
block transition failed") and log the detailed err separately to the service
logger. Locate the failing call to jm.failJobWithMessage (using job.ID) and
update it to pass only the generic message, then add a nearby log entry (using
the existing logger) that records err and contextual info for debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 067a8b31-595d-4c1e-8dfd-d80dbba86946

📥 Commits

Reviewing files that changed from the base of the PR and between 4e98d97 and 3cb8948.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • internal/crawler/probe.go
  • internal/crawler/probe_test.go
  • internal/jobs/fail_job_message_test.go
  • internal/jobs/manager.go

Comment thread internal/jobs/manager.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

@simonsmallchua simonsmallchua changed the title Harden WAF pre-flight edge cases Harden WAF pre-flight + breaker edge cases Apr 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/jobs/waf_circuit_breaker.go`:
- Around line 121-128: Rearm currently only clears the tripped flag so a job
must accumulate threshold more failures before retrying; change Rearm (and the
other similar rearm occurrence) to also seed the per-job counter to threshold-1
and set the last vendor so the next blocked outcome will immediately retrip:
locate WAFCircuitBreaker.Rearm and the duplicate rearm block, acquire the mutex
as now, unset tripped[jobID], then set counts[jobID] = b.threshold - 1 and
lastVendor[jobID] = <previous vendor value stored on trip> (or preserve existing
lastVendor if already set) so Observe() logic will trigger a retrip on the next
blocked event. Ensure you handle nil/empty jobID checks exactly as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7fc90932-6ad5-4227-b716-42b3f36a5ec2

📥 Commits

Reviewing files that changed from the base of the PR and between a6799b9 and cdd3782.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • internal/crawler/waf.go
  • internal/crawler/waf_test.go
  • internal/jobs/waf_circuit_breaker.go

Comment thread internal/jobs/waf_circuit_breaker.go Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Docstrings generation - SUCCESS
Generated docstrings and committed to branch work/waf-followups-a (commit: 50e6ab33be6c038da9b6398d6b4dfea68094395e)

coderabbitai Bot and others added 2 commits April 28, 2026 22:15
Docstrings generation was requested by @simonsmallchua.

The following files were modified:

* `internal/crawler/probe.go`
* `internal/crawler/waf.go`
* `internal/db/queue.go`

These files were ignored:
* `internal/crawler/probe_test.go`
* `internal/crawler/waf_test.go`
* `internal/db/queue_test.go`
* `internal/jobs/block_job_test.go`
* `internal/jobs/fail_job_message_test.go`
* `internal/jobs/sitemap_terminal_guard_test.go`
* `internal/jobs/waf_circuit_breaker_dispatch_test.go`

These file types are not supported:
* `CHANGELOG.md`
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-368.fly.dev
Dashboard: https://hover-pr-368.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit 7ac7898 into main Apr 29, 2026
19 checks passed
@simonsmallchua simonsmallchua deleted the work/waf-followups-a branch April 29, 2026 00:29
@coderabbitai coderabbitai Bot mentioned this pull request May 11, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant