fix(health-checks): drop deleted teams in HealthIssue.bulk_upsert#60552
Conversation
Health check workflows snapshot team IDs at workflow start and dispatch them to per-batch activities. When a team is deleted between the snapshot and `HealthIssue.bulk_upsert`, the deferred FK on `posthog_healthissue.team_id` fires at COMMIT and rolls back the entire batch — not just the orphan row — causing repeated activity failures and lost observability for every other team in the batch. Filter the incoming rows against the live `posthog_team` set inside the upsert transaction, dropping any team_id that no longer exists and emitting a structured warning with the dropped count. Apply the same filter in `resolve_stale_issues_with_deltas` for consistency so we don't waste work (or, after ID reuse, emit misleading `resolved` alerts) on teams that have been cascade-cleaned. Surfaced via PostHog inbox report 019e0b7a-9c6e-7812-89f6-ab2bc756d935. Generated-By: PostHog Code Task-Id: 4b10d906-f9dd-4833-91f2-66237863b4a7
|
| logger.warning( | ||
| "health_issue_dropping_deleted_teams", | ||
| kind=kind, | ||
| event=log_event, | ||
| dropped_count=len(missing), | ||
| dropped_team_ids=sorted(missing), | ||
| ) |
There was a problem hiding this comment.
event is a reserved keyword argument in structlog — it is the first positional parameter of every log method (def warning(self, event: str, **kw: Any)). Passing event=log_event alongside the positional event string causes TypeError: warning() got multiple values for argument 'event'. This TypeError propagates out of _filter_existing_team_ids, rolls back the transaction.atomic() in bulk_upsert, and causes the activity to fail — reproducing the same batch-level failure the PR set out to fix, just with a different exception. The field should use a name that doesn't clash with structlog's reserved event parameter.
| logger.warning( | |
| "health_issue_dropping_deleted_teams", | |
| kind=kind, | |
| event=log_event, | |
| dropped_count=len(missing), | |
| dropped_team_ids=sorted(missing), | |
| ) | |
| logger.warning( | |
| "health_issue_dropping_deleted_teams", | |
| kind=kind, | |
| trigger=log_event, | |
| dropped_count=len(missing), | |
| dropped_team_ids=sorted(missing), | |
| ) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/models/health_issue.py
Line: 34-40
Comment:
`event` is a reserved keyword argument in structlog — it is the first positional parameter of every log method (`def warning(self, event: str, **kw: Any)`). Passing `event=log_event` alongside the positional event string causes `TypeError: warning() got multiple values for argument 'event'`. This TypeError propagates out of `_filter_existing_team_ids`, rolls back the `transaction.atomic()` in `bulk_upsert`, and causes the activity to fail — reproducing the same batch-level failure the PR set out to fix, just with a different exception. The field should use a name that doesn't clash with structlog's reserved `event` parameter.
```suggestion
logger.warning(
"health_issue_dropping_deleted_teams",
kind=kind,
trigger=log_event,
dropped_count=len(missing),
dropped_team_ids=sorted(missing),
)
```
How can I resolve this? If you propose a fix, please make it concise.|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
|
Size Change: 0 B Total Size: 80.9 MB ℹ️ View Unchanged
|
…0552) Co-authored-by: posthog[bot] <206114724+posthog[bot]@users.noreply.github.com> Co-authored-by: Rafa Audibert <rafael@posthog.com> Co-authored-by: Rafael Audibert <32079912+rafaeelaudibert@users.noreply.github.com> Co-authored-by: tests-posthog[bot] <250237707+tests-posthog[bot]@users.noreply.github.com>
Problem
The internal health-check Temporal workflow snapshots all team IDs at workflow start and dispatches them to per-batch activities. When a team is deleted between the snapshot and
HealthIssue.bulk_upsert, the deferred FK onposthog_healthissue.team_idfires at COMMIT and rolls back the entire batch — not just the orphan row — so every other team's freshly detected issues are lost and the activity surfaces as aForeignKeyViolationonposthog_healthissue_team_id_484cd7e3_fk_posthog_team_id. The same fingerprint family has recurred on four distinct team IDs over a 19-day window in error tracking.Surfaced via PostHog inbox report
019e0b7a-9c6e-7812-89f6-ab2bc756d935.Changes
HealthIssue.bulk_upsert(posthog/models/health_issue.py): inside the upsert transaction, filter the incoming(team_id, unique_hash)rows againstTeam.objects.filter(id__in=...)and drop any team_id that no longer exists. Emits a structuredhealth_issue_dropping_deleted_teamswarning with the dropped count so the occurrence is observable. If every team in the batch is gone, return early instead of opening a no-op write.resolve_stale_issues_with_deltas(posthog/temporal/health_checks/db.py): apply the same live-team filter before callingbulk_resolve. Teams that have been cascade-cleaned by Postgres no longer need work, and we avoid emitting a misleadingresolvedalert if a team ID is ever reused._filter_existing_team_idsinposthog/models/health_issue.pykeeps the two call sites consistent and the warning shape uniform.This keeps the activity idempotent across mid-workflow team deletions without swallowing genuine FK errors — anything other than "team no longer exists" still propagates and surfaces in error tracking.
How did you test this code?
I'm an agent. I added automated tests and did not perform manual testing:
posthog/models/test/test_health_issue.py::TestHealthIssue::test_bulk_upsert_skips_deleted_team_and_persists_remaining— a real team's issue and a non-existent team are passed together; the live team's issue is persisted and the orphan is dropped.posthog/models/test/test_health_issue.py::TestHealthIssue::test_bulk_upsert_all_teams_missing_returns_empty— when every incoming team is gone, returns[]without writing.posthog/temporal/health_checks/tests/test_db.py— new file coveringresolve_stale_issues_with_deltaswith a deleted team_id in the healthy set, plus an end-to-end upsert+resolve mixing a real team and a deleted one.The Docker- and ClickHouse-backed services needed to run the Django test suite were not available in this environment, so the new tests were validated for imports/lint/syntax only and will be exercised by CI.
Automatic notifications
🤖 Agent context
019e0b7a-9c6e-7812-89f6-ab2bc756d935.HealthIssue.bulk_upsertas the failing site and proposed filtering incoming rows againstposthog_teambeforebulk_createto avoid the deferred FK rollback. The fix follows that approach: the filter runs inside the sametransaction.atomic()block as the writes (narrowing the race window without holding a row-level lock onposthog_team), and a structured warning is emitted so the drop count is visible in logs.bulk_create(ignore_conflicts=True)— rejected: it does not apply to FK violations. Considered per-row savepoints — rejected: defeats the bulk path and adds latency. Pre-filtering is the minimal, lowest-risk fix.from posthog.models.team import Teaminposthog/models/health_issue.pywould invert load order inposthog/models/__init__.py(HealthIssue is imported before Team there); the import is kept inline inside the helper, with a comment explaining why, matching the existing pattern inposthog/models/utils.py.