Merge main into feature/redis-broker-v2#335
Conversation
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThis PR migrates the application's logging infrastructure from Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Log as slog.Default()
participant Fanout as fanoutHandler
participant Text as Text Handler
participant Sentry as Sentry Handler
participant SentryAPI as Sentry Backend
App->>Log: logger.Error("msg", "error", err, ...)
Log->>Fanout: Handle(record)
par Text Output
Fanout->>Text: Handle(record)
Text->>Text: Format to stdout
and Sentry Capture
Fanout->>Sentry: Handle(record)
Note over Sentry: Check level >= error<br/>Normalize message (UUIDs, numbers)<br/>Check no_capture attribute
alt no_capture present
Sentry->>Sentry: Drop event
else capture enabled
Sentry->>Sentry: Build Sentry event<br/>component tag, fingerprint,<br/>exception from error field
Sentry->>SentryAPI: Send event
SentryAPI->>Sentry: Event accepted
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes This PR involves substantial scope—a comprehensive logging infrastructure migration across dozens of files—but the changes are largely homogeneous (repetitive application of the same refactoring pattern). The new Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
Updates to Preview Branch (work/funny-bohr-6c8c3e) ↗︎
Tasks are run on every commit but only new migration files are pushed.
View logs for this Workflow Run ↗︎. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/api/handlers.go (1)
627-635:⚠️ Potential issue | 🟠 MajorAvoid Sentry-flooding
/config.jsmisconfigurations.
apiLog.Error(...)is auto-captured, and/config.jsis requested on every page load. One missing env var will now produce an event per request until the deploy is fixed. Please validate this once at startup, or log this handler failure without capture.Minimal mitigation in this handler
- apiLog.Error("supabase config missing", "error", err) + apiLog.Warn("Supabase config unavailable for config.js", "error", err)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/handlers.go` around lines 627 - 635, Call buildConfigSnippet once at startup (init/bootstrap) and cache its result so the /config.js handler reads the cached snippet instead of calling buildConfigSnippet per request; if the startup call fails, fail fast or record a single non-capturing error there. In the handler (the block currently calling buildConfigSnippet and apiLog.Error), remove the apiLog.Error call that triggers automatic capture and instead use the cached value and, on missing config, log with a non-capturing mechanism (e.g., standard log.Printf or a warning-level logger) and return the same HTTP 500 message. Ensure references: buildConfigSnippet, the /config.js handler block, and apiLog.Error so you locate and replace the request-time error capture with startup validation and non-capturing request-time logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/skills/structured-logging/SKILL.md:
- Around line 31-46: The component taxonomy references deleted files—remove or
update the `worker` row (currently pointing to `internal/jobs/worker.go`) and
the `jobs` row that lists `domain_limiter.go` so contributors aren’t led to dead
paths; replace them with the current locations (or consolidate under the `jobs`
component to point to existing files in `internal/jobs/`), and ensure any
mention of `domain_limiter.go` is removed or replaced with the new
symbol/location for the domain-limiting logic.
In `@alloy.river`:
- Around line 1-6: The scrape target in prometheus.scrape "hover" currently uses
identical addresses so add a unique label on the target before forwarding:
update the targets block for prometheus.scrape "hover" to include a
distinguishing label (for example an env/app/instance label) for each
machine/app, so that when forward_to =
[prometheus.remote_write.grafana_cloud.receiver] the remote write stream
contains the unique label; modify the target definition inside prometheus.scrape
"hover" (the targets array) to inject that label so the remote write receiver
sees non-colliding series.
In `@CHANGELOG.md`:
- Around line 89-95: The release note is inaccurate about removals—while zerolog
is gone, svc-level Sentry SDK imports remain in cmd/app/main.go and
cmd/worker/main.go for initialization; update the CHANGELOG entry to tighten
wording to state that zerolog has been removed and that only SDK initialization
still imports sentry-go in the main entrypoints (e.g., cmd/app/main.go and
cmd/worker/main.go), or otherwise rephrase to avoid claiming “no direct ...
sentry imports remain.”
In `@cmd/app/main.go`:
- Around line 699-706: Replace the hard-coded development message so it’s
environment-neutral and optionally include the current env; specifically, change
the call to startupLog.Info("Hover Development Server Ready!") to a neutral
message like startupLog.Info("Hover Server Ready") or startupLog.Info("Hover
Server Ready", "env", config.Env); update references around baseURL and other
startupLog.Info calls to keep consistent logging format.
In `@cmd/worker/main.go`:
- Around line 76-77: Remove the explicit sentry.CaptureException calls and rely
on workerLog.Fatal to emit the error via the slog→Sentry fanout; specifically
delete the sentry.CaptureException(err) calls that appear immediately before
workerLog.Fatal("failed to connect to PostgreSQL", "error", err) and the similar
block around the Redis/startup failure (lines referenced in the review), leaving
only the workerLog.Fatal calls; also remove the Sentry package import if it
becomes unused after these deletions.
In `@docs/diagnostics/SENTRY_AUDIT.md`:
- Around line 24-35: Regenerate the SENTRY_AUDIT.md from the current branch
state and update the table so it no longer shows stale findings (remove entries
for symbols no longer present such as internal/jobs/worker.go and ensure
cmd/app/main.go’s actual BeforeSend status is reflected); run the
audit-generation tool or script you used originally to produce the table, verify
entries for startHealthMonitoring, Flush, flushIndividualUpdates, GetJobStats,
and ProcessJob (or their replacements) match the live code, update rows covering
the 95-108 range accordingly, and commit the regenerated SENTRY_AUDIT.md.
In `@internal/api/admin.go`:
- Around line 51-54: Introduce a small helper function
organisationIDOrNone(orgID *string) string that returns "none" when orgID is nil
and the dereferenced string otherwise, then replace the duplicated conversion
logic that sets orgIDStr and orgIDStr2 in admin.go with calls to
organisationIDOrNone(user.OrganisationID) (and the corresponding other variable)
so both places use the helper and remove the repeated if-nil branches.
In `@internal/api/auth_google.go`:
- Around line 328-330: The code is logging the raw pending GA session ID
(sessionID) which is a bearer key; remove it from the logger.Info call in the
block that calls storePendingGASession(session) and only log non-sensitive
metadata (e.g., "account_count" and "property_count"). If you still need a
correlate-able value for debugging, replace session_id with a non-reversible
fingerprint (e.g., a hash or masked prefix) computed from sessionID and log that
instead, but do not emit the raw sessionID; update references to
storePendingGASession(session) and the logger.Info invocation accordingly.
In `@internal/crawler/tls_aia.go`:
- Around line 184-187: The code calls parsed.Host even when url.Parse(rawURL)
may return a nil parsed and non-nil err; update the logic in the TLS AIA
handling so you first check if err != nil and handle/log the parse error
(include rawURL and err) and return nil before referencing parsed; then perform
the scheme check on the non-nil parsed (ensure parsed.Scheme is only accessed
when parsed != nil) and log the rejected non-HTTP(S) URL using parsed.Host only
after successful parse.
In `@internal/db/pages.go`:
- Line 73: The log call dbLog.Warn("Skipping invalid URL", "error", err, "url",
u) writes the raw untrusted URL value (u) to logs; instead sanitize or omit
sensitive parts before logging: parse u with the URL parser, remove or mask
RawQuery and Fragment (or only log hostname and path), then pass the sanitized
string to dbLog.Warn (e.g., replace the "url" field value with the
sanitized/masked URL or a boolean flag). Update the call site that constructs
this warning so it no longer logs the raw u variable but the safely redacted
value.
In `@internal/db/reset_migrations.go`:
- Around line 16-18: Replace the banner log calls that use dbLog.Warn with
dbLog.Info because these are normal lifecycle progress messages; specifically
locate the calls emitting "=== DATABASE RESET STARTED ===", "Dropping
job-related tables, clearing migrations, and rebuilding schema", and "Users and
organisations will be preserved" (and the similar set at the later block around
the other banner lines) and change them from dbLog.Warn(...) to dbLog.Info(...);
keep the message text identical and only adjust the log level so dashboards
won't be skewed by benign warnings.
In `@internal/db/webflow_connections.go`:
- Line 201: The db log at the dbLog.Error call in webflow_connections.go uses
the key "org_id"; change that key to the standardized "organisation_id" so it
matches other DB logs (update the call in the error path that logs connectionID
and organisationID inside the function handling deletion of webflow connections,
e.g., the dbLog.Error invocation referencing connectionID and organisationID).
Ensure any other nearby log statements in the same file use the same
"organisation_id" key for consistency.
In `@internal/jobs/stream_worker.go`:
- Around line 418-420: The refreshActiveJobs loop currently calls jobsLog.Error
on failure which will auto-capture to Sentry every 10s; change the call in
refreshActiveJobs to jobsLog.Warn (or use the logging API flag to suppress
capture) so transient control DB outages do not flood Sentry. Locate the error
handling in refreshActiveJobs where jobsLog.Error("failed to refresh active
jobs", "error", err) is invoked and replace it with a non-capturing warning
(e.g., jobsLog.Warn with the same message and error) or call the logger method
that explicitly disables Sentry capture.
In `@internal/notifications/listener.go`:
- Around line 136-139: The warning branch in NewListener currently starts the
polling goroutine unconditionally, which will dereference a nil service and
panic; modify the logic so startPolling is only invoked when service is non-nil
(or replace the call with a safe no-op) — update the branch that checks if
listener == nil and the similar block around lines 198-200 so they do not call
startPolling(ctx, service) when service == nil, referencing the NewListener
function, the listener variable, the service variable, and the startPolling
function to locate the code to change.
In `@scripts/aggregate_logs.py`:
- Around line 5-8: The documentation header under the Outputs block in
scripts/aggregate_logs.py is stale because the script now always writes
components_per_minute.csv; update the Outputs list to include a line describing
components_per_minute.csv (e.g., "components_per_minute.csv — timestamp |
component: count | ...") alongside the existing time_series.csv,
events_per_minute.csv and summary.md entries so the built-in usage docs match
the actual outputs.
In `@scripts/process_logs.py`:
- Around line 38-52: The function _normalise_full_timestamp currently drops the
source timezone by using dt.strftime("%Y-%m-%dT%H:%M:%S"); change it to emit the
timezone offset as well by returning dt.isoformat(timespec="seconds") (the
cleaned variable already converts "Z" to "+00:00" so isoformat will preserve the
offset), leaving the rest of the parsing and the ValueError fallback unchanged.
---
Outside diff comments:
In `@internal/api/handlers.go`:
- Around line 627-635: Call buildConfigSnippet once at startup (init/bootstrap)
and cache its result so the /config.js handler reads the cached snippet instead
of calling buildConfigSnippet per request; if the startup call fails, fail fast
or record a single non-capturing error there. In the handler (the block
currently calling buildConfigSnippet and apiLog.Error), remove the apiLog.Error
call that triggers automatic capture and instead use the cached value and, on
missing config, log with a non-capturing mechanism (e.g., standard log.Printf or
a warning-level logger) and return the same HTTP 500 message. Ensure references:
buildConfigSnippet, the /config.js handler block, and apiLog.Error so you locate
and replace the request-time error capture with startup validation and
non-capturing request-time logging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5b871799-64ad-4b98-a92a-bc1a23e1f972
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (82)
.claude/skills/structured-logging/SKILL.md.github/workflows/fly-deploy.yml.github/workflows/review-apps.yml.gitignore.husky/pre-commitCHANGELOG.mdDockerfilealloy.riverarchive_key_migratecmd/app/main.gocmd/archive_key_migrate/main.gocmd/worker/main.godocs/architecture/CONFIG_REFERENCE.mddocs/diagnostics/LOGGING_AUDIT.mddocs/diagnostics/SENTRY_AUDIT.mddocs/operations/ENV_VARS.mdfly.tomlgo.modinternal/api/admin.gointernal/api/auth.gointernal/api/auth_google.gointernal/api/auth_webflow.gointernal/api/domains.gointernal/api/errors.gointernal/api/google_data_api.gointernal/api/handlers.gointernal/api/jobs.gointernal/api/logging.gointernal/api/middleware.gointernal/api/notifications.gointernal/api/organisations.gointernal/api/response.gointernal/api/schedulers.gointernal/api/share_links.gointernal/api/slack.gointernal/api/webflow_sites.gointernal/archive/scheduler.gointernal/auth/middleware.gointernal/broker/consumer.gointernal/broker/counters.gointernal/broker/counters_test.gointernal/broker/dispatcher.gointernal/broker/pacer.gointernal/broker/pacer_test.gointernal/broker/redis.gointernal/broker/scheduler.gointernal/broker/scheduler_test.gointernal/crawler/config.gointernal/crawler/config_test.gointernal/crawler/crawler.gointernal/crawler/robots.gointernal/crawler/sitemap.gointernal/crawler/tls_aia.gointernal/db/batch.gointernal/db/dashboard.gointernal/db/db.gointernal/db/google_connections.gointernal/db/notifications.gointernal/db/pages.gointernal/db/pressure.gointernal/db/queue.gointernal/db/reset_migrations.gointernal/db/retry.gointernal/db/schedulers.gointernal/db/slack.gointernal/db/users.gointernal/db/webflow_connections.gointernal/db/webflow_site_settings.gointernal/jobs/executor.gointernal/jobs/link_discovery.gointernal/jobs/manager.gointernal/jobs/stream_worker.gointernal/logging/logging.gointernal/logging/logging_test.gointernal/notifications/listener.gointernal/notifications/slack.gointernal/techdetect/detector.gointernal/util/url.goscripts/aggregate_logs.pyscripts/lint-go.jsscripts/process_logs.pyscripts/start.sh
💤 Files with no reviewable changes (1)
- .gitignore
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
internal/db/reset_migrations.go (1)
91-103:⚠️ Potential issue | 🟡 MinorHandle row-iteration errors after
rows.Next()loops.These loops ignore
rows.Err()and silently skip some scan failures. That can mask partial reset failures and leave stale DB objects/connections.Suggested pattern to apply to each loop
for viewRows.Next() { var v string - if err := viewRows.Scan(&v); err == nil { - views = append(views, v) - } + if err := viewRows.Scan(&v); err != nil { + _ = viewRows.Close() + return fmt.Errorf("failed to scan view row: %w", err) + } + views = append(views, v) } + if err := viewRows.Err(); err != nil { + _ = viewRows.Close() + return fmt.Errorf("failed while iterating views: %w", err) + }Also applies to: 118-130, 145-157, 352-360, 379-387
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/db/reset_migrations.go` around lines 91 - 103, The row-iteration loops (e.g., viewRows with viewRows.Scan) currently ignore rows.Err() and potential Scan failures; after each for viewRows.Next() { ... } loop call and check err := viewRows.Err() and handle it (log via dbLog.Error/dbLog.Warn and return the error or abort the reset) rather than silently continuing, also ensure any scan failure inside the loop is propagated instead of skipped; apply the same fix to the other row loops (e.g., seqRows, tableRows, typeRows or similarly named row variables) and ensure rows.Close() is checked (or use defer rows.Close()) so Close errors are not ignored.scripts/process_logs.py (1)
90-130:⚠️ Potential issue | 🟠 MajorStream the flat CSV instead of buffering every row.
Lines 90-130 keep a full copy of every parsed record in
flat_rowsbefore writing it out. On large log exports that doubles memory usage for no benefit and can turn this into an OOM-prone path. Open the CSV once and write each row inside the read loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/process_logs.py` around lines 90 - 130, The code currently accumulates every row in flat_rows before writing, which uses unbounded memory; instead, if flat_csv_path is not None open the CSV file and create a csv.DictWriter and write the header before entering the _iter_records loop, then inside the loop (where flat_rows.append is called) build the same dict using _normalise_full_timestamp(record), level, component, message and extras (filtered by _CORE_FIELDS and json.dumps) and call writer.writerow(...) immediately rather than appending to flat_rows; remove the flat_rows list and the post-loop writer.writerows block so the file is streamed while iterating through _iter_records.internal/api/auth_google.go (1)
328-337:⚠️ Potential issue | 🟠 MajorDo not ship the pending GA session as a URL query token.
Line 336 puts
sessionIDintoga_sessionon the redirect URL. That ID is the bearer key forpendingGASessions, so it can leak via browser history, copied links, and referrers even though you've stopped logging it. Bind the pending session to the authenticated user/org server-side and move the token into an HttpOnly cookie or another non-URL channel.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/auth_google.go` around lines 328 - 337, Currently the code stores the pending GA session via storePendingGASession(session) and places the returned sessionID into the redirect URL using buildSettingsURL("analytics", params, "google-analytics"); instead of exposing sessionID in the query string, bind the pending session server-side to the authenticated user/org (e.g. associate the pending session ID with the current user/org record in the pendingGASessions store) and return the redirect without ga_session, or set the sessionID in a Secure, HttpOnly, SameSite cookie (set by the handler that calls storePendingGASession) so the client can continue the flow without the token appearing in URLs or logs; ensure cookie attributes (HttpOnly, Secure, appropriate Path/Expiry) are applied and remove any code that adds ga_session to the URL in the redirect logic.
♻️ Duplicate comments (2)
internal/db/pages.go (1)
73-73:⚠️ Potential issue | 🟠 MajorAvoid logging raw URL content via
errat Line 73.Even after removing the
"url"field, the loggederrcan still contain the original URL (including query params/tokens) from parse/normalisation failures. Log a sanitised reason instead of the raw error text.🔧 Proposed fix
- dbLog.Warn("Skipping invalid URL", "error", err, "domain", domain) + dbLog.Warn("Skipping invalid URL", "domain", domain, "reason", "URL normalisation failed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/db/pages.go` at line 73, The dbLog.Warn call in pages.go currently logs the raw err which may contain the original URL and sensitive query params; change the warning to log a sanitized reason instead of the raw error text by replacing the err argument in dbLog.Warn("Skipping invalid URL", ...) with a concise, non-sensitive message (e.g., "parse/normalisation failed" or err type only) and/or call a small helper (e.g., sanitizeError or buildSafeErrorMessage) that returns a redacted string; keep the existing "domain" and other context but never include err.Error() or the raw URL in the log entry.internal/jobs/stream_worker.go (1)
170-173:⚠️ Potential issue | 🟠 MajorAvoid auto-capturing stream read failures in the hot loop.
Line 172 now emits an
Erroron every failedReadNonBlocking. With the new logger semantics, that becomes one Sentry event per worker × job iteration while Redis is degraded, which will drown out the signal from real incidents very quickly. Please downgrade this path toWarn, or suppress capture explicitly, because the loop is already retrying.Possible fix
- logger.Error("stream read error", "error", err, "job_id", jobID) + logger.Warn("stream read error", "error", err, "job_id", jobID)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/jobs/stream_worker.go` around lines 170 - 173, The hot loop currently logs every ReadNonBlocking failure at Error level (logger.Error) for swp.consumer.ReadNonBlocking with jobID, which creates noisy Sentry events; change that call to a non-capturing/Warn-level log (e.g., logger.Warn) or explicitly disable error capture for that log so retries don't generate Sentry events—update the logging call in stream_worker.go where swp.consumer.ReadNonBlocking is handled to use Warn (or pass a flag/field to suppress capture) while keeping jobID and the error message for local debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/app/main.go`:
- Around line 697-708: The readiness logs are emitted before the HTTP listener
is bound; change the flow to create a net.Listener first, then log readiness and
URLs only after Listen succeeds, and finally call server.Serve(listener) instead
of server.ListenAndServe(). Specifically, replace the current pre-Listen
startupLog.Info calls around server.ListenAndServe() with code that calls
net.Listen("tcp", ":"+config.Port), checks for errors, then on success compute
baseURL and call startupLog.Info("Starting server"...), startupLog.Info("Hover
server ready"...), startupLog.Info("Open Homepage"...), startupLog.Info("Open
Dashboard"...), startupLog.Info("Health Check"...), and the Supabase Studio log
under config.Env == "development", and then pass that listener into
server.Serve(listener) (referencing server.ListenAndServe and server.Serve and
the startupLog.Info and config.Env identifiers).
- Around line 59-63: In startJobScheduler, replace the startupLog.Error calls in
the retry branches (the error handling after calls like
pgDB.GetSchedulersReadyToRun and the other similar DB/service calls in the loop)
with startupLog.Warn (or otherwise use a non-error/no-capture logging path) so
transient/retryable dependency failures don’t generate Error-level Sentry
alerts; update each occurrence referenced (the error branches in
startJobScheduler at the shown ranges) to log a warning with the error attached
instead of an error-level log.
In `@docs/diagnostics/SENTRY_AUDIT.md`:
- Around line 44-65: The audit table still references removed legacy DB queue
helpers (executeOnce, executeOnceWithContext, ExecuteMaintenance,
CleanupStuckJobs) in internal/db/queue.go; update
docs/diagnostics/SENTRY_AUDIT.md by removing or replacing those rows (and the
duplicate references at lines 84–86) with the current live symbols or file
locations that replaced the legacy helpers, and regenerate the table entries so
the audit points to existing functions (search the repo for the new
queue-related functions to reference instead of the deleted symbols).
In `@internal/api/handlers.go`:
- Around line 979-983: The DB lookup errors for webhook auth resolution (calls
to h.DB.GetPlatformOrgMapping and the other platform/webhook lookup around the
second block) are being treated identically and returned as a 404; change the
error handling to distinguish a "not found" result from other DB failures: if
the error indicates no rows (e.g., sql.ErrNoRows or your project's NotFound
sentinel) then call NotFound(w, r, ...) as before, otherwise log the full error
(use logger.Error or logger.Warn with the error) and return an internal server
error response (500) so real backend issues are surfaced and webhook retries are
not masked. Apply this change to the GetPlatformOrgMapping handling and the
other lookup block referenced near the second call so only true
missing-resources produce 404s.
In `@internal/db/reset_migrations.go`:
- Around line 207-214: The schema verification currently only logs warnings and
still marks Step 4 complete; modify ResetSchema() so that any schema
verification failure causes the function to return an error instead of
continuing. In the verification loop (where dbLog.Warn("Schema verification
warning", "table", table, "exists", exists) is called) set a failure flag (e.g.,
verificationFailed := true) or immediately call dbLog.Error with context and
return a descriptive error from ResetSchema(); ensure you do not log "Step 4/4
completed" when verificationFailed is true. Update the final block that logs
time.Since(verifyStart) to check the flag and either return the error or log
completion only on success.
In `@scripts/aggregate_logs.py`:
- Around line 69-70: The code currently keys incremental state by filename (see
processed_files and saved.get("processed_files", [])), which causes stale skips
when a file is rewritten; change processed_files to store a fingerprint per file
(e.g., dict mapping basename -> fingerprint) instead of a plain list, and
persist that in the .aggregate_data.json save/load logic. Update the check that
treats a file as already-processed to compare the stored fingerprint against the
current fingerprint (use os.path.getmtime/size or a content hash), and when a
file changes update its fingerprint and reprocess; apply the same change to the
other occurrences referenced around lines 77-85 and 277-304 so all load/save and
membership checks use the fingerprinted mapping rather than filenames alone.
---
Outside diff comments:
In `@internal/api/auth_google.go`:
- Around line 328-337: Currently the code stores the pending GA session via
storePendingGASession(session) and places the returned sessionID into the
redirect URL using buildSettingsURL("analytics", params, "google-analytics");
instead of exposing sessionID in the query string, bind the pending session
server-side to the authenticated user/org (e.g. associate the pending session ID
with the current user/org record in the pendingGASessions store) and return the
redirect without ga_session, or set the sessionID in a Secure, HttpOnly,
SameSite cookie (set by the handler that calls storePendingGASession) so the
client can continue the flow without the token appearing in URLs or logs; ensure
cookie attributes (HttpOnly, Secure, appropriate Path/Expiry) are applied and
remove any code that adds ga_session to the URL in the redirect logic.
In `@internal/db/reset_migrations.go`:
- Around line 91-103: The row-iteration loops (e.g., viewRows with
viewRows.Scan) currently ignore rows.Err() and potential Scan failures; after
each for viewRows.Next() { ... } loop call and check err := viewRows.Err() and
handle it (log via dbLog.Error/dbLog.Warn and return the error or abort the
reset) rather than silently continuing, also ensure any scan failure inside the
loop is propagated instead of skipped; apply the same fix to the other row loops
(e.g., seqRows, tableRows, typeRows or similarly named row variables) and ensure
rows.Close() is checked (or use defer rows.Close()) so Close errors are not
ignored.
In `@scripts/process_logs.py`:
- Around line 90-130: The code currently accumulates every row in flat_rows
before writing, which uses unbounded memory; instead, if flat_csv_path is not
None open the CSV file and create a csv.DictWriter and write the header before
entering the _iter_records loop, then inside the loop (where flat_rows.append is
called) build the same dict using _normalise_full_timestamp(record), level,
component, message and extras (filtered by _CORE_FIELDS and json.dumps) and call
writer.writerow(...) immediately rather than appending to flat_rows; remove the
flat_rows list and the post-loop writer.writerows block so the file is streamed
while iterating through _iter_records.
---
Duplicate comments:
In `@internal/db/pages.go`:
- Line 73: The dbLog.Warn call in pages.go currently logs the raw err which may
contain the original URL and sensitive query params; change the warning to log a
sanitized reason instead of the raw error text by replacing the err argument in
dbLog.Warn("Skipping invalid URL", ...) with a concise, non-sensitive message
(e.g., "parse/normalisation failed" or err type only) and/or call a small helper
(e.g., sanitizeError or buildSafeErrorMessage) that returns a redacted string;
keep the existing "domain" and other context but never include err.Error() or
the raw URL in the log entry.
In `@internal/jobs/stream_worker.go`:
- Around line 170-173: The hot loop currently logs every ReadNonBlocking failure
at Error level (logger.Error) for swp.consumer.ReadNonBlocking with jobID, which
creates noisy Sentry events; change that call to a non-capturing/Warn-level log
(e.g., logger.Warn) or explicitly disable error capture for that log so retries
don't generate Sentry events—update the logging call in stream_worker.go where
swp.consumer.ReadNonBlocking is handled to use Warn (or pass a flag/field to
suppress capture) while keeping jobID and the error message for local debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 377c8fe7-f06d-479e-aa6a-3b4e9198ecaf
📒 Files selected for processing (17)
.claude/skills/structured-logging/SKILL.mdCHANGELOG.mdalloy.rivercmd/app/main.gocmd/worker/main.godocs/diagnostics/SENTRY_AUDIT.mdinternal/api/admin.gointernal/api/auth_google.gointernal/api/handlers.gointernal/crawler/tls_aia.gointernal/db/pages.gointernal/db/reset_migrations.gointernal/db/webflow_connections.gointernal/jobs/stream_worker.gointernal/notifications/listener.goscripts/aggregate_logs.pyscripts/process_logs.py
| | # | File | Line | Function | Error Source | | ||
| | --- | ----------------------------- | ---- | ------------------------ | ----------------------------- | | ||
| | 1 | `cmd/app/main.go` | 183 | `startHealthMonitoring` | DB job update query | | ||
| | 2 | `cmd/app/main.go` | 508 | `main` (goroutine) | Metrics HTTP server | | ||
| | 3 | `cmd/app/main.go` | 533 | `main` | PostgreSQL close | | ||
| | 4 | `cmd/app/main.go` | 558 | `main` | Queue DB close | | ||
| | 5 | `cmd/app/main.go` | 767 | `main` | Server.Shutdown | | ||
| | 6 | `cmd/app/main.go` | 818 | `main` | Final server error | | ||
| | 7 | `internal/auth/middleware.go` | 103 | `AuthMiddleware` | Token signature invalid | | ||
| | 8 | `internal/auth/middleware.go` | 108 | `AuthMiddleware` | JWKS/keyfunc config | | ||
| | 9 | `internal/api/auth.go` | 132 | `SignupUser` | CreateUser DB failure | | ||
| | 10 | `internal/api/auth.go` | 216 | `GetCurrentUser` | GetOrCreateUser DB failure | | ||
| | 11 | `internal/db/batch.go` | 252 | `Flush` | Poison pill detection | | ||
| | 12 | `internal/db/batch.go` | 349 | `Flush` | DB unavailable on shutdown | | ||
| | 13 | `internal/db/batch.go` | 371 | `Flush` | Shutdown data errors | | ||
| | 14 | `internal/db/batch.go` | 604 | `flushIndividualUpdates` | Poison pill individual update | | ||
| | 15 | `internal/db/dashboard.go` | 76 | `GetJobStats` | Dashboard stats query | | ||
| | 16 | `internal/db/queue.go` | 489 | `executeOnce` | Begin transaction | | ||
| | 17 | `internal/db/queue.go` | 521 | `executeOnce` | Commit transaction | | ||
| | 18 | `internal/db/queue.go` | 553 | `executeOnceWithContext` | Begin transaction | | ||
| | 19 | `internal/db/queue.go` | 585 | `executeOnceWithContext` | Commit transaction | | ||
| | 20 | `internal/db/queue.go` | 762 | `ExecuteMaintenance` | Maintenance begin | |
There was a problem hiding this comment.
Regenerate the internal/db/queue.go audit rows.
The PR objectives say this merge removed the legacy DB queue helpers, but the audit still points at executeOnce, executeOnceWithContext, ExecuteMaintenance, and CleanupStuckJobs under internal/db/queue.go. That leaves the document directing the next clean-up pass at dead symbols.
Also applies to: 84-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/diagnostics/SENTRY_AUDIT.md` around lines 44 - 65, The audit table still
references removed legacy DB queue helpers (executeOnce, executeOnceWithContext,
ExecuteMaintenance, CleanupStuckJobs) in internal/db/queue.go; update
docs/diagnostics/SENTRY_AUDIT.md by removing or replacing those rows (and the
duplicate references at lines 84–86) with the current live symbols or file
locations that replaced the legacy helpers, and regenerate the table entries so
the audit points to existing functions (search the repo for the new
queue-related functions to reference instead of the deleted symbols).
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/app/main.go (1)
554-597:⚠️ Potential issue | 🟠 MajorDon't accept jobs when Redis dispatch is unavailable.
This branch has no local worker fallback, but both failure paths here still let the app continue:
REDIS_URLmissing only warns, andScheduleBatchfailure only logs. That leaves tasks committed in Postgres with no consumer, so job creation can succeed while execution is impossible. Please fail fast when Redis dispatch is required, or make enqueueing propagate an error / durable retry so the job is not considered created until dispatch succeeds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/app/main.go` around lines 554 - 597, The code currently allows job creation to succeed when Redis dispatch is unavailable (redisCfg.URL empty or scheduler.ScheduleBatch fails); change this to fail fast and/or propagate enqueue errors: if redisCfg.URL == "" replace startupLog.Warn with a startupLog.Fatal (or otherwise exit) so the process doesn't accept jobs without a dispatcher, and change jobsManager.OnTasksEnqueued (currently assigned as func(ctx context.Context, jobID string, entries []jobs.TaskScheduleEntry)) to return an error (e.g., func(... ) error) and update all callers so that scheduler.ScheduleBatch errors are returned up the stack and cause job creation to fail (or trigger durable retries) instead of only logging; ensure scheduler.ScheduleBatch remains used but its errors are propagated from OnTasksEnqueued to the job creation flow so failures block commit.
♻️ Duplicate comments (1)
docs/diagnostics/SENTRY_AUDIT.md (1)
3-8:⚠️ Potential issue | 🟡 MinorRegenerate this audit from the current branch state.
The note on Lines 3-8 says the symbols and files are accurate, but the tables still are not:
cmd/app/main.gono longer uses the bareCaptureExceptioncall-sites listed here, and its stuck-job/task reporting now goes throughstartupLogwith static messages pluslogging.BeforeSend. The audit also still points at removedinternal/db/queue.gosymbols likeexecuteOnceWithContext,ExecuteMaintenance, andCleanupStuckJobs. As written, this document will send the next clean-up pass back to stale or dead code.Also applies to: 29-63, 82-86
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/diagnostics/SENTRY_AUDIT.md` around lines 3 - 8, The audit is stale: regenerate the SENTRY_AUDIT.md from current branch state and update all symbol/file references—remove obsolete internal/db/queue.go entries (executeOnceWithContext, ExecuteMaintenance, CleanupStuckJobs), stop listing bare CaptureException call-sites for cmd/app/main.go (it now uses startupLog and logging.BeforeSend), and replace removed worker references with the current worker lifecycle locations (cmd/worker/main.go and internal/jobs/stream_worker.go) including the slog→Sentry fanout behavior; run the repo-wide symbol scan and refresh the tables so the document points only to existing functions and the correct logging hooks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/app/main.go`:
- Around line 474-477: Change the metrics goroutine to bind the socket before
logging readiness: replace the direct call to metricsSrv.ListenAndServe() with
net.Listen("tcp", config.MetricsAddr) to obtain a listener, log readiness via
startupLog.Info("Metrics server listening", "addr", config.MetricsAddr) only
after the bind succeeds, and then call metricsSrv.Serve(listener); keep the
existing error check and startupLog.Error("Metrics server failed", "error", err)
but treat listener bind errors separately so you don't emit a false ready
message for metricsSrv, referencing metricsSrv, config.MetricsAddr,
startupLog.Info, startupLog.Error, ListenAndServe and Serve in your changes.
In `@internal/api/handlers.go`:
- Around line 949-950: The request-scoped logger is inheriting r.URL.Path which
on routes like /v1/webhooks/webflow/{token} contains the raw secret; before
calling loggerWithRequest(r) or before logging (e.g., logger.Warn in the Webflow
handler and subsequent calls shown), sanitize the path by redacting the token
segment (replace that path segment with "<redacted>" or remove it) and use that
sanitized path when attaching the "path" field or by cloning the request with a
sanitized URL.Path; update all occurrences (the logger.Warn/BadRequest calls
around the Webflow handler and the other ranges noted) to log the sanitized path
instead of the raw r.URL.Path.
In `@internal/db/users.go`:
- Around line 19-21: GetOrCreateUser currently treats any error from GetUser as
“user missing” and calls CreateUser; change it so only ErrUserNotFound triggers
creation. Specifically, in GetOrCreateUser check the error returned by GetUser
with errors.Is(err, ErrUserNotFound) and call CreateUser only in that case; for
any other non-nil error return/propagate the original error instead of creating
a new user. Ensure you reference ErrUserNotFound, GetUser, and CreateUser in the
updated logic.
- Around line 432-436: The organisation lookup failure is being
swallowed—replace the dbLog.Warn + "return existingUser, nil, nil" behavior so
that the function logs the failure as an error (use dbLog.Error and include the
error and *existingUser.OrganisationID) and returns the lookup error to callers
(return existingUser, nil, err or a wrapped error) instead of nil error; locate
the handling around GetOrganisation (the block that currently calls dbLog.Warn
and returns existingUser, nil, nil) and change it to log the error at error
level with context and propagate the error upward.
In `@scripts/aggregate_logs.py`:
- Line 263: The current write uses event[:80] which truncates by code points not
visual display width; replace this by truncating based on display width (e.g.,
use the wcwidth library) before escaping '|' so the table column aligns
visually. Add a small helper (referencing event and the f.write call) that walks
characters, accumulating wcwidth.wcwidth() until the target width (80) would be
exceeded, append a truncation marker if needed, then escape '|' and use that
result in the existing f.write call. If you prefer not to depend on wcwidth, add
a clear comment near event[:80] and keep the code-point truncation but document
it as not being a visual-width trim.
- Around line 69-77: The current incremental logic only tracks processed_files
as a fingerprint map and thus cannot remove prior contributions when a file
changes; update the code to persist per-file contributions and subtract them on
reprocess: extend processed_files entries to store both the fingerprint and the
file's contribution (e.g., a map of event keys to counts and per-minute
buckets), on load handle the legacy non-dict case by coercing to an empty map,
and in the reprocessing path (where you compare the new fingerprint against
processed_files[basename]) if the fingerprint differs first subtract the stored
per-file contribution from totals and by_minute before adding the newly parsed
data, then overwrite the processed_files entry with the new fingerprint and
contribution; ensure functions/variables mentioned—processed_files,
raw_processed, by_minute, totals—are updated accordingly so stale data is
removed when files are regenerated.
---
Outside diff comments:
In `@cmd/app/main.go`:
- Around line 554-597: The code currently allows job creation to succeed when
Redis dispatch is unavailable (redisCfg.URL empty or scheduler.ScheduleBatch
fails); change this to fail fast and/or propagate enqueue errors: if
redisCfg.URL == "" replace startupLog.Warn with a startupLog.Fatal (or otherwise
exit) so the process doesn't accept jobs without a dispatcher, and change
jobsManager.OnTasksEnqueued (currently assigned as func(ctx context.Context,
jobID string, entries []jobs.TaskScheduleEntry)) to return an error (e.g.,
func(... ) error) and update all callers so that scheduler.ScheduleBatch errors
are returned up the stack and cause job creation to fail (or trigger durable
retries) instead of only logging; ensure scheduler.ScheduleBatch remains used
but its errors are propagated from OnTasksEnqueued to the job creation flow so
failures block commit.
---
Duplicate comments:
In `@docs/diagnostics/SENTRY_AUDIT.md`:
- Around line 3-8: The audit is stale: regenerate the SENTRY_AUDIT.md from
current branch state and update all symbol/file references—remove obsolete
internal/db/queue.go entries (executeOnceWithContext, ExecuteMaintenance,
CleanupStuckJobs), stop listing bare CaptureException call-sites for
cmd/app/main.go (it now uses startupLog and logging.BeforeSend), and replace
removed worker references with the current worker lifecycle locations
(cmd/worker/main.go and internal/jobs/stream_worker.go) including the
slog→Sentry fanout behavior; run the repo-wide symbol scan and refresh the
tables so the document points only to existing functions and the correct logging
hooks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0ce01afc-57a9-4a5e-9805-c555ac37cdde
📒 Files selected for processing (6)
cmd/app/main.godocs/diagnostics/SENTRY_AUDIT.mdinternal/api/handlers.gointernal/db/reset_migrations.gointernal/db/users.goscripts/aggregate_logs.py
| # processed_files is a {basename: "mtime:size"} fingerprint map. | ||
| # Older runs stored a plain list; coerce that into an empty map so | ||
| # rewritten files are reprocessed rather than silently skipped. | ||
| raw_processed = saved.get("processed_files", {}) | ||
| if isinstance(raw_processed, dict): | ||
| processed_files = dict(raw_processed) | ||
| else: | ||
| processed_files = {} | ||
| return by_minute, totals, processed_files |
There was a problem hiding this comment.
Changed files cause double-counting in incremental mode.
The fingerprint mechanism correctly detects when a file has been modified (comparing mtime:size), but when reprocessing, the old contribution from that file is not subtracted before adding the new one. This leads to inflated aggregates.
Example: If foo.json is processed with {event_A: 10}, then regenerated with {event_A: 5}, the result is {event_A: 15} instead of {event_A: 5}.
To fix properly, either:
- Store per-file contributions so old data can be removed before adding new, or
- Document that
--fullis required when input files may be regenerated
This addresses the detection part of the past review comment but not the stale data aspect.
Also applies to: 295-299, 312-313
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 77-77: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/aggregate_logs.py` around lines 69 - 77, The current incremental
logic only tracks processed_files as a fingerprint map and thus cannot remove
prior contributions when a file changes; update the code to persist per-file
contributions and subtract them on reprocess: extend processed_files entries to
store both the fingerprint and the file's contribution (e.g., a map of event
keys to counts and per-minute buckets), on load handle the legacy non-dict case
by coercing to an empty map, and in the reprocessing path (where you compare the
new fingerprint against processed_files[basename]) if the fingerprint differs
first subtract the stored per-file contribution from totals and by_minute before
adding the newly parsed data, then overwrite the processed_files entry with the
new fingerprint and contribution; ensure functions/variables
mentioned—processed_files, raw_processed, by_minute, totals—are updated
accordingly so stale data is removed when files are regenerated.
| f.write("| Count | Event |\n") | ||
| f.write("|-------|-------|\n") | ||
| for event, count in sorted(event_totals.items(), key=lambda x: -x[1])[:30]: | ||
| f.write(f"| {count:,} | {event[:80].replace('|', '\\|')} |\n") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
String slicing may split multi-byte UTF-8 characters.
event[:80] slices by code points in Python 3, so this is safe for Unicode. However, if you intend to limit display width (e.g., for terminal alignment), code points don't equal visual width for CJK or emoji characters.
If display width matters, consider using a library like wcwidth or simply note that this is a code-point limit, not a visual-width limit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/aggregate_logs.py` at line 263, The current write uses event[:80]
which truncates by code points not visual display width; replace this by
truncating based on display width (e.g., use the wcwidth library) before
escaping '|' so the table column aligns visually. Add a small helper
(referencing event and the f.write call) that walks characters, accumulating
wcwidth.wcwidth() until the target width (80) would be exceeded, append a
truncation marker if needed, then escape '|' and use that result in the existing
f.write call. If you prefer not to depend on wcwidth, add a clear comment near
event[:80] and keep the code-point truncation but document it as not being a
visual-width trim.
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/db/users.go (1)
469-477:⚠️ Potential issue | 🟠 MajorDo not treat an organisation lookup failure as “organisation missing”.
Any
GetOrganisationByNameerror other thannilcurrently falls through to the create path. If that lookup fails transiently, this code will try to create a fresh organisation instead of surfacing the read failure, which can duplicate data. Mirror the newErrUserNotFoundpattern here and only create on an explicit not-found result.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/db/users.go` around lines 469 - 477, The current logic assumes any non-nil error from db.GetOrganisationByName means "not found" and proceeds to create an organisation; change this to mirror the ErrUserNotFound pattern: after calling db.GetOrganisationByName (used when isBusinessEmail(email) is true), handle three cases—if err == nil use existing org; else if errors.Is(err, ErrOrganisationNotFound) (or the project equivalent) proceed to create a new organisation; otherwise return or propagate the read error (do not fall through to create). Update the branch around db.GetOrganisationByName to explicitly check for the not-found sentinel and to surface other errors instead of creating a duplicate organisation.cmd/app/main.go (1)
577-600:⚠️ Potential issue | 🟠 MajorDon't silently drop Redis dispatch failures when tasks are already persisted to Postgres.
The
OnTasksEnqueuedcallback only logsScheduleBatcherrors, allowing job creation to succeed with tasks written to Postgres but never published to Redis. Because the worker service retrieves tasks exclusively from Redis (viaDueItems), these stranded tasks will never execute. No replay or outbox mechanism exists to recover them.Either make the callback retryable or propagate the failure back into the job creation path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/app/main.go` around lines 577 - 600, The OnTasksEnqueued handler currently swallows scheduler.ScheduleBatch failures (jobsManager.OnTasksEnqueued -> scheduler.ScheduleBatch) which leaves tasks persisted in Postgres but never published to Redis; change the handler to make ScheduleBatch retry with backoff (context-aware exponential backoff + jitter using ctx.Done()) and, if retries exhaust, propagate the failure back to the job creation flow instead of only logging: after final failure call the job-facing failure path (e.g., return or invoke a job-failure/rollback method on jobsManager or signal the caller so job creation returns an error) and include startupLog.Error("failed to schedule tasks into Redis", "error", err, "job_id", jobID, "count", len(schedEntries)) with the actual error. Ensure you use jobsManager.OnTasksEnqueued, scheduler.ScheduleBatch and startupLog.Error as the points to change.
♻️ Duplicate comments (2)
internal/api/handlers.go (1)
937-949:⚠️ Potential issue | 🟠 MajorSanitise the request passed to shared response helpers as well.
This only redacts the local
logger. The handler still passes the originalrintoBadRequest,NotFound,InternalError, andWriteSuccess. If any of those helpers derive their own request logger fromr.URL.Path, the webhook token or workspace ID will still reach logs or Sentry. Please either clonerwith a redactedURL.Pathbefore any helper call, or plumb the redacted path through those helpers too.Run this to confirm whether the shared API helpers rebuild log context from the raw request path:
#!/bin/bash rg -n -C4 'func (BadRequest|NotFound|InternalError|WriteSuccess)\(' internal/api rg -n -C4 'loggerWithRequest\(r\)|r\.URL\.Path' internal/api🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/handlers.go` around lines 937 - 949, The request's URL.Path is still passed unredacted into shared helpers (BadRequest, NotFound, InternalError, WriteSuccess) so sensitive tokens can leak; create a redacted copy of the request (e.g., clone r via r.Clone(r.Context()) or shallow-copy and replace URL.Path) and set its URL.Path to "/v1/webhooks/webflow/[redacted]" or "/v1/webhooks/webflow/workspaces/[redacted]" to match the loggerWithRequestPath logic, then pass that redacted request into the helper calls (or alternatively modify those helpers to accept and use the already-created request-scoped logger from loggerWithRequest/loggerWithRequestPath) so no code derives log context from the original r.URL.Path.scripts/aggregate_logs.py (1)
78-85:⚠️ Potential issue | 🟠 MajorChanged inputs still inflate the aggregate.
processed_filesonly stores the latest fingerprint, so when a summary is regenerated you merge the new counts without removing the old contribution first. A file changing from{event_A: 10}to{event_A: 5}leaves15inby_minute/warn_error_counts, which then skews every derived CSV and the summary. Persist per-file contributions and subtract them before re-merging, or fall back to a full rebuild whenever a fingerprint changes.Also applies to: 105-120, 304-323
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/aggregate_logs.py` around lines 78 - 85, The aggregate inflation bug occurs because processed_files only stores the latest fingerprint and you re-merge counts without removing a previous contribution; update the logic around raw_processed/processed_files to either (a) persist per-file contribution vectors (e.g., store per-basename contribution objects) and on detecting a fingerprint change subtract the stored per-file contributions from by_minute and warn_error_counts before adding the regenerated counts, or (b) simpler: detect any fingerprint mismatch for a basename and trigger a full rebuild of the aggregated structures instead of incremental merge; change the code that reads saved.get("processed_files", {}) and the merge paths that update by_minute and warn_error_counts to implement one of these two approaches and ensure processed_files stores whatever per-file contribution you need to subtract later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/db/users.go`:
- Around line 103-108: GetOrCreateUser now returns ErrUserNotFound only for
genuine "not found" and surfaces other DB failures; update the notification
handlers that call GetOrCreateUser so they distinguish errors: when
errors.Is(err, ErrUserNotFound) return the existing Unauthorised("User not
found") response, but for any other error return a 5xx/internal error (or
propagate the error) instead of collapsing everything into Unauthorised. Locate
the calls to GetOrCreateUser in the notification handlers and change the error
handling to check errors.Is(err, ErrUserNotFound) first, then handle
non-ErrUserNotFound errors as server/database failures.
In `@scripts/aggregate_logs.py`:
- Around line 296-313: The current early return when incremental is True and
new_files is empty prevents rewriting derived outputs (events_per_minute.csv,
components_per_minute.csv, summary.md) even though load_data() has already
restored the aggregates; remove the early return and instead fall through to the
existing output-writing logic so that by_minute, totals, and processed_map (from
load_data) are used to regenerate the CSVs and summary; in practice modify the
block around new_files/if not new_files to skip processing new files but still
call the same output writers (the code that writes events_per_minute.csv,
components_per_minute.csv, and summary.md) and then return True, keeping use of
load_data, new_files, processed_map, and _file_fingerprint.
- Around line 271-282: Precompute the escaped event string before using it in
f-strings to avoid backslash-in-expression syntax issues: in the loop that
iterates over event_totals (symbol: event_totals) and in the warn_and_error loop
(symbols: warn_and_error, totals["warn_error_counts"]) compute something like
escaped_event = event[:80].replace('|', '\\|') and then use f.write(f"|
{count:,} | {escaped_event} |\n") instead of embedding .replace(...) directly
inside the f-string.
---
Outside diff comments:
In `@cmd/app/main.go`:
- Around line 577-600: The OnTasksEnqueued handler currently swallows
scheduler.ScheduleBatch failures (jobsManager.OnTasksEnqueued ->
scheduler.ScheduleBatch) which leaves tasks persisted in Postgres but never
published to Redis; change the handler to make ScheduleBatch retry with backoff
(context-aware exponential backoff + jitter using ctx.Done()) and, if retries
exhaust, propagate the failure back to the job creation flow instead of only
logging: after final failure call the job-facing failure path (e.g., return or
invoke a job-failure/rollback method on jobsManager or signal the caller so job
creation returns an error) and include startupLog.Error("failed to schedule
tasks into Redis", "error", err, "job_id", jobID, "count", len(schedEntries))
with the actual error. Ensure you use jobsManager.OnTasksEnqueued,
scheduler.ScheduleBatch and startupLog.Error as the points to change.
In `@internal/db/users.go`:
- Around line 469-477: The current logic assumes any non-nil error from
db.GetOrganisationByName means "not found" and proceeds to create an
organisation; change this to mirror the ErrUserNotFound pattern: after calling
db.GetOrganisationByName (used when isBusinessEmail(email) is true), handle
three cases—if err == nil use existing org; else if errors.Is(err,
ErrOrganisationNotFound) (or the project equivalent) proceed to create a new
organisation; otherwise return or propagate the read error (do not fall through
to create). Update the branch around db.GetOrganisationByName to explicitly
check for the not-found sentinel and to surface other errors instead of creating
a duplicate organisation.
---
Duplicate comments:
In `@internal/api/handlers.go`:
- Around line 937-949: The request's URL.Path is still passed unredacted into
shared helpers (BadRequest, NotFound, InternalError, WriteSuccess) so sensitive
tokens can leak; create a redacted copy of the request (e.g., clone r via
r.Clone(r.Context()) or shallow-copy and replace URL.Path) and set its URL.Path
to "/v1/webhooks/webflow/[redacted]" or
"/v1/webhooks/webflow/workspaces/[redacted]" to match the loggerWithRequestPath
logic, then pass that redacted request into the helper calls (or alternatively
modify those helpers to accept and use the already-created request-scoped logger
from loggerWithRequest/loggerWithRequestPath) so no code derives log context
from the original r.URL.Path.
In `@scripts/aggregate_logs.py`:
- Around line 78-85: The aggregate inflation bug occurs because processed_files
only stores the latest fingerprint and you re-merge counts without removing a
previous contribution; update the logic around raw_processed/processed_files to
either (a) persist per-file contribution vectors (e.g., store per-basename
contribution objects) and on detecting a fingerprint change subtract the stored
per-file contributions from by_minute and warn_error_counts before adding the
regenerated counts, or (b) simpler: detect any fingerprint mismatch for a
basename and trigger a full rebuild of the aggregated structures instead of
incremental merge; change the code that reads saved.get("processed_files", {})
and the merge paths that update by_minute and warn_error_counts to implement one
of these two approaches and ensure processed_files stores whatever per-file
contribution you need to subtract later.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 38d8ee16-8df4-47dc-ad9d-a199237f3905
📒 Files selected for processing (5)
cmd/app/main.gointernal/api/handlers.gointernal/api/logging.gointernal/db/users.goscripts/aggregate_logs.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/api/notifications.go`:
- Around line 90-100: Create a single helper (e.g., resolveCurrentUser or
loadCurrentUser) that encapsulates the repeated GetOrCreateUser call and the
ErrUserNotFound/5xx branching so all three handlers use the same behaviour and
logging: call h.DB.GetOrCreateUser, if errors.Is(err, db.ErrUserNotFound) then
logger.Warn("User not found", "error", err) and call Unauthorised(w, r, "User
not found"); on other errors log via logger.Error("Failed to load user for
notifications", "error", err) and call InternalError(w, r, fmt.Errorf("failed to
load user: %w", err)); return the resolved user on success so the existing
handlers can replace their duplicated blocks with a single call to this helper.
In `@scripts/aggregate_logs.py`:
- Around line 179-181: Narrow the broad except in the log-processing function by
replacing "except Exception as e:" with a specific tuple of exceptions (e.g.,
OSError, json.JSONDecodeError, KeyError, TypeError) to avoid masking bugs;
follow the same pattern used in load_data() so the error handling in the
function that prints "Error processing {json_file}: {e}" only catches expected
IO/parsing/key errors and lets unexpected exceptions surface.
- Around line 16-23: Clarify the docstring in scripts/aggregate_logs.py to
explicitly state the double-counting limitation: note that incremental mode
fingerprints inputs by mtime+size and, if a file's fingerprint changes and it is
re-ingested, the old contribution is not subtracted from aggregates (so
regenerated files can inflate totals); advise callers to run with --full to
force a clean reprocess when upstream producers may rewrite files in place, and
add a short example or sentence mentioning this specific caveat next to the
existing fingerprinting explanation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4376cf98-6845-4a0c-a7bb-9fbf2bca036a
📒 Files selected for processing (2)
internal/api/notifications.goscripts/aggregate_logs.py
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/api/notifications.go (1)
183-192: 🧹 Nitpick | 🔵 TrivialConsider using a sentinel error instead of string matching.
The
strings.Contains(err.Error(), "not found")check on line 185 is fragile—it relies on the error message text which could change. This pattern exists elsewhere in the codebase too.A more robust approach would be to define a sentinel error (e.g.,
db.ErrNotificationNotFound) in the DB layer and check witherrors.Is(), consistent with howErrUserNotFoundis handled.This is pre-existing code, so not blocking this PR, but worth addressing in a follow-up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/api/notifications.go` around lines 183 - 192, Replace the fragile string match on err.Error() in the MarkNotificationRead call with a sentinel error check: add a package-level sentinel (e.g., db.ErrNotificationNotFound) in the DB layer and have MarkNotificationRead return that sentinel when the row is absent; then change the handler to use errors.Is(err, db.ErrNotificationNotFound) (similar to existing ErrUserNotFound usage) and call NotFound(w, r, ...) on that condition; keep all other error logging/handling the same.
♻️ Duplicate comments (1)
scripts/aggregate_logs.py (1)
319-337:⚠️ Potential issue | 🟠 MajorIncremental reprocessing still double-counts changed files.
When a fingerprint changes, the old contribution for that basename is still present in
by_minuteandtotals;process_json_file()only adds the new snapshot on top. A regenerated summary therefore inflates counts instead of replacing them. Persist per-file contributions and subtract them before re-merge, or force--fullwhenever an already-seen file changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/aggregate_logs.py` around lines 319 - 337, The incremental path double-counts because when a file's fingerprint changes you currently only add the new snapshot via process_json_file() without removing the old contribution; update the logic so processed_map stores per-file contribution summaries (e.g., per-file minute buckets and totals) keyed by filename instead of only a fingerprint, and when you detect processed_map.get(f.name) != _file_fingerprint(f) subtract the stored per-file contribution from by_minute and totals before calling process_json_file() (or replace the stored contribution entirely after processing); use the existing symbols processed_map, _file_fingerprint, new_files, process_json_file, by_minute and totals to locate where to persist, subtract, and update per-file aggregates so regenerated summaries replace rather than accumulate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@internal/api/notifications.go`:
- Around line 183-192: Replace the fragile string match on err.Error() in the
MarkNotificationRead call with a sentinel error check: add a package-level
sentinel (e.g., db.ErrNotificationNotFound) in the DB layer and have
MarkNotificationRead return that sentinel when the row is absent; then change
the handler to use errors.Is(err, db.ErrNotificationNotFound) (similar to
existing ErrUserNotFound usage) and call NotFound(w, r, ...) on that condition;
keep all other error logging/handling the same.
---
Duplicate comments:
In `@scripts/aggregate_logs.py`:
- Around line 319-337: The incremental path double-counts because when a file's
fingerprint changes you currently only add the new snapshot via
process_json_file() without removing the old contribution; update the logic so
processed_map stores per-file contribution summaries (e.g., per-file minute
buckets and totals) keyed by filename instead of only a fingerprint, and when
you detect processed_map.get(f.name) != _file_fingerprint(f) subtract the stored
per-file contribution from by_minute and totals before calling
process_json_file() (or replace the stored contribution entirely after
processing); use the existing symbols processed_map, _file_fingerprint,
new_files, process_json_file, by_minute and totals to locate where to persist,
subtract, and update per-file aggregates so regenerated summaries replace rather
than accumulate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 941b173c-3a56-4a36-b9fe-d302c1b27c35
📒 Files selected for processing (2)
internal/api/notifications.goscripts/aggregate_logs.py
|
🐝 Review App Deployed Homepage: https://hover-pr-335.fly.dev |
Summary
Brings
origin/main(37 commits ahead, notably the zerolog → slog migration) into the Stage 3 Redis broker branch so PR #330 can resume from a fresh base.Conflict resolution
worker.go,domain_limiter.go,test_jobs/main.go— main's changes were logger-only; the files remain removed under the Redis dispatch architecture.go.mod,Dockerfile,fly.toml,cmd/app/main.go,internal/jobs/manager.go,internal/db/queue.go— resolved in favour of the Redis broker wiring (keptOnTasksEnqueuedcallback, dropped legacy worker-pool env vars + DB queue helpers with no remaining callers).zerolog → slog port
Converted 13 files to the project's existing
logging.Component("...")pattern (no injected loggers, matchesinternal/jobs/manager.go):internal/broker/{redis,scheduler,counters,pacer,dispatcher,consumer}.go+ tests (miniredis tests no longer needzerolog.Nop())internal/jobs/{stream_worker,executor,link_discovery,manager}.gocmd/worker/main.go(addedlogging.Setup+sentry.InitwithBeforeSend: logging.BeforeSend)cmd/app/main.go(startup log ported; removed unusedgetEnvInt)go.moddropsrs/zerologentirely; keepsredis/go-redis/v9 v9.18.0.Inherited from main
archive_key_migrate— a 37 MB Mach-O binary accidentally committed inbaca9f6on main. Left as-is; should be addressed on main.Verification
go build ./...cleango vet ./...cleango test ./...all packages pass (broker, jobs, db, api, logging, …)gofmt -l ./goimports -l cmd internalcleanprettier --checkcleanSummary by CodeRabbit
New Features
Refactoring
Chores