Skip to content

Retry Redis ping on startup#386

Open
simonsmallchua wants to merge 3 commits into
mainfrom
work/gallant-elbakyan-64b835
Open

Retry Redis ping on startup#386
simonsmallchua wants to merge 3 commits into
mainfrom
work/gallant-elbakyan-64b835

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented May 11, 2026

Summary

  • Add (*broker.Client).PingWithRetry(ctx, total, perAttempt) with capped exponential backoff and per-attempt timeout.
  • Swap the three Ping call sites in cmd/app, cmd/worker, cmd/analysis to use it (30s budget, 3s per attempt).
  • Add unit coverage for the retry loop (immediate success, transient errors, budget exhaustion, context cancellation).

Why

Every PR preview spin-up generated a small burst of Sentry errors on staging*errors.errorString: EOF reported as failed to ping Redis from each of the three binaries. Triaged in Sentry:

  • HOVER-JXhover-worker-pr-*, 169 occurrences since 2026-04-19.
  • HOVER-MDhover-analysis-pr-*, 158 occurrences since 2026-04-28.
  • HOVER-JZhover-pr-*, 153 occurrences since 2026-04-20.

Review apps provision a fresh per-PR Upstash-on-Fly Redis and pass the URL as a secret. The Fly machine boots and calls client.Ping(context.Background()) immediately. During the Upstash cold-start window TCP connects but the server closes the connection with EOF before answering PING. The client's built-in MaxRetries: 3 burns through in milliseconds inside the same dead window, the binary Fatals, Fly restarts the machine, and the next boot succeeds — hence the burst-per-deploy pattern with zero prod impact (production Redis is warm).

The fix lets the binary ride out the cold-start window instead of crashing. On a healthy Redis the first ping succeeds and the helper returns immediately, so there's no production latency regression. On genuine misconfiguration the helper still exhausts its budget and Fatals — Sentry still gets one signal instead of three back-to-back.

Fixes HOVER-JX HOVER-MD HOVER-JZ

Test plan

  • gofmt, goimports, go vet ./internal/broker/... ./cmd/app ./cmd/worker ./cmd/analysis
  • go test ./internal/broker/ — full package, including new TestPingWithRetry
  • go build ./cmd/app ./cmd/worker ./cmd/analysis
  • PR review-apps deploy: confirm no EOF events on hover-pr-<N>, hover-worker-pr-<N>, hover-analysis-pr-<N> during boot and that connected to Redis appears in all three startup logs

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Summary by CodeRabbit

  • New Features

    • Improved Redis startup resilience: services now retry connecting for a bounded window to tolerate transient connection issues.
  • Bug Fixes

    • Prevents premature shutdown on transient Redis failures while still failing on persistent misconfiguration.
  • Tests

    • Added tests covering retry behavior, failure handling, and cancellation scenarios.
  • Documentation

    • Changelog updated to reflect the startup retry behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a030b9cd-a9d1-46a6-af17-f470c6f8d567

📥 Commits

Reviewing files that changed from the base of the PR and between 4338f2e and b79c830.

📒 Files selected for processing (2)
  • internal/broker/redis.go
  • internal/broker/redis_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/broker/redis.go

📝 Walkthrough

Walkthrough

Startup Redis checks now use a bounded retry loop. A new Client.PingWithRetry(ctx, total, perAttempt) retries PING with per-attempt timeouts and capped exponential backoff; tests and three command entry points (analysis, app, worker) were updated and the changelog documents the fix.

Changes

Redis Retry-Based Health Checking

Layer / File(s) Summary
Redis Retry Health Check Interface
internal/broker/redis.go
Introduces exported PingWithRetry(ctx, total, perAttempt) with docstring describing retry semantics.
Retry Logic Implementation
internal/broker/redis.go
Implements the retry loop: absolute deadline, per-attempt timeouts, repeated PING attempts, capped exponential backoff, early exit on context cancellation, and final error return on budget exhaustion.
Retry Behavior Tests
internal/broker/redis_test.go
Adds TestPingWithRetry with subtests for immediate success, eventual success after transient failures, retry budget exhaustion, context cancellation, and a regression guard for per-attempt timeout clamping; imports updated.
Startup Health Check Integration
cmd/analysis/main.go, cmd/app/main.go, cmd/worker/main.go
Replaces single-shot Ping calls with PingWithRetry(context.Background(), 30*time.Second, 3*time.Second) at startup, preserving fatal-on-error behavior and success logging.
Changelog
CHANGELOG.md
Adds a ### Fixed entry under Unreleased describing the bounded-retry behavior for Redis PING at startup.

Sequence Diagram(s)

sequenceDiagram
  participant Entrypoint
  participant BrokerClient
  participant Redis
  Entrypoint->>BrokerClient: PingWithRetry(ctx, total, perAttempt)
  BrokerClient->>Redis: PING (per-attempt timeout)
  Redis-->>BrokerClient: PONG or error
  alt PONG
    BrokerClient-->>Entrypoint: success
  else error and time remaining
    BrokerClient->>BrokerClient: sleep (exponential backoff, capped)
    BrokerClient->>Redis: PING (next attempt)
  else deadline exceeded or ctx canceled
    BrokerClient-->>Entrypoint: return last error or ctx.Err()
  end
Loading

🎯 3 (Moderate) | ⏱️ ~25 minutes

"I nibble on retries, patient and spry,
30 seconds of hope before I sigh,
Backoff like hops, small and then wide,
Redis wakes up — I smile with pride." 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Retry Redis ping on startup' directly and clearly summarizes the main change: replacing single Ping calls with a retry-capable PingWithRetry method across three startup entry points.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/gallant-elbakyan-64b835

Comment @coderabbitai help to get the list of available commands and usage tips.

@supabase
Copy link
Copy Markdown

supabase Bot commented May 11, 2026

Updates to Preview Branch (work/gallant-elbakyan-64b835) ↗︎

Deployments Status Updated
Database Mon, 11 May 2026 22:54:00 UTC
Services Mon, 11 May 2026 22:54:00 UTC
APIs Mon, 11 May 2026 22:54:00 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Mon, 11 May 2026 22:54:02 UTC
Migrations Mon, 11 May 2026 22:54:04 UTC
Seeding Mon, 11 May 2026 22:54:05 UTC
Edge Functions Mon, 11 May 2026 22:54:06 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Release Versions

App patch: v0.34.10v0.34.11

Changelog

Fixed

  • App, worker, and analysis binaries no longer Fatal on the first Redis PING
    failure at startup. The ping is now wrapped in a bounded retry loop (30 s
    total, 3 s per attempt, capped exponential backoff) so the binary rides out
    the Upstash-on-Fly cold-start window that briefly closes connections with EOF
    on freshly-provisioned review apps. Production behaviour is unchanged — a
    healthy Redis still succeeds on the first attempt and persistent
    misconfiguration still fails fast. Resolves the recurring EOF burst on every
    PR preview deploy (Sentry: HOVER-JX, HOVER-MD, HOVER-JZ).

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/broker/redis_test.go (1)

15-56: ⚡ Quick win

Add a regression case for perAttempt > total budget handling.

Nice coverage overall. Please add one subtest that asserts retry returns within the total budget when perAttempt is larger, so budget semantics stay protected.

🧪 Suggested test shape
 func TestPingWithRetry(t *testing.T) {
+	t.Run("does not exceed total budget when per-attempt timeout is larger", func(t *testing.T) {
+		start := time.Now()
+		err := pingWithRetry(context.Background(), 80*time.Millisecond, time.Second,
+			func(ctx context.Context) error {
+				<-ctx.Done()
+				return ctx.Err()
+			})
+		require.Error(t, err)
+		assert.LessOrEqual(t, time.Since(start), 200*time.Millisecond)
+	})
+
 	t.Run("immediate success", func(t *testing.T) {
 		var calls int
 		err := pingWithRetry(context.Background(), time.Second, 100*time.Millisecond,
 			func(context.Context) error { calls++; return nil })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/broker/redis_test.go` around lines 15 - 56, Add a new subtest inside
TestPingWithRetry that calls pingWithRetry with total shorter than perAttempt
(e.g., total=100ms, perAttempt=200ms) using a stub ping function that
immediately returns a sentinel error; capture time before/after the call and
assert that the call returns the expected error and that elapsed time is <=
total (plus a tiny tolerance), ensuring pingWithRetry respects the overall
budget when perAttempt > total. Reference the pingWithRetry helper and add the
subtest under TestPingWithRetry.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/broker/redis.go`:
- Around line 90-123: In pingWithRetry the per-attempt context always uses
perAttempt which can let the final try exceed the overall deadline; compute the
remaining total budget as time.Until(deadline) and clamp the attempt timeout to
min(perAttempt, remaining) before calling context.WithTimeout. If the remaining
budget is <= 0 return the lastErr (or ctx.Err() if set) instead of starting a
timed attempt; replace the direct context.WithTimeout(ctx, perAttempt) call in
pingWithRetry with this clamped timeout logic to enforce the total budget.

---

Nitpick comments:
In `@internal/broker/redis_test.go`:
- Around line 15-56: Add a new subtest inside TestPingWithRetry that calls
pingWithRetry with total shorter than perAttempt (e.g., total=100ms,
perAttempt=200ms) using a stub ping function that immediately returns a sentinel
error; capture time before/after the call and assert that the call returns the
expected error and that elapsed time is <= total (plus a tiny tolerance),
ensuring pingWithRetry respects the overall budget when perAttempt > total.
Reference the pingWithRetry helper and add the subtest under TestPingWithRetry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 27c218b9-9459-4bc2-bbd1-0d4a1bc92a30

📥 Commits

Reviewing files that changed from the base of the PR and between 41af55a and df771e6.

📒 Files selected for processing (5)
  • cmd/analysis/main.go
  • cmd/app/main.go
  • cmd/worker/main.go
  • internal/broker/redis.go
  • internal/broker/redis_test.go

Comment thread internal/broker/redis.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 74.35897% with 10 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/broker/redis.go 80.55% 6 Missing and 1 partial ⚠️
cmd/analysis/main.go 0.00% 1 Missing ⚠️
cmd/app/main.go 0.00% 1 Missing ⚠️
cmd/worker/main.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-386.fly.dev
Dashboard: https://hover-pr-386.fly.dev/dashboard

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-386.fly.dev
Dashboard: https://hover-pr-386.fly.dev/dashboard

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-386.fly.dev
Dashboard: https://hover-pr-386.fly.dev/dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant