feat(data-stack): add shared duckling backfill concurrency tag#61027
Conversation
Each duckling backfill connection spins up a duckgres worker, and the per-org worker pool is capped and shared with product queries. Events and persons backfills previously carried only their own independent concurrency tags, so they could not be governed by a single combined limit. Add a shared duckling_backfill_concurrency=duckling_v1 tag to both assets and jobs; the combined cap is enforced on this key via the Dagster Cloud deployment settings in the charts repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Hey @fuziontech! 👋 It looks like your git author email on this PR isn't your
You can fix it for this repo with: git config user.email "you@posthog.com"Or set it globally with |
There was a problem hiding this comment.
Pull request overview
Adds a shared Dagster concurrency tag (duckling_backfill_concurrency=duckling_v1) to both the events and persons duckling backfill assets/jobs so a single combined cap (configured separately in the charts repo) can limit their total concurrent runs against the shared duckgres worker pool.
Changes:
- Introduce
DUCKLING_BACKFILL_CONCURRENCY_TAGand merge it intoEVENTS_CONCURRENCY_TAGandPERSONS_CONCURRENCY_TAG. - Add a regression test ensuring both backfills carry the shared key/value.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| posthog/dags/events_backfill_to_duckling.py | Defines the shared concurrency tag and spreads it into both per-product tag dicts. |
| posthog/dags/test_events_backfill_to_duckling.py | Adds a test asserting both backfill tag dicts contain the shared key/value. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
posthog/dags/test_events_backfill_to_duckling.py:913-920
The test checks both backfill types in a single assertion block. Prefer a parameterized test so that when a future edit drops the shared tag from one backfill, the failure message immediately names which one (events vs persons) instead of failing on a dict lookup with no clear owner.
```suggestion
class TestDucklingConcurrencyTags:
# The combined events+persons cap is enforced on the shared key in the charts
# Dagster deployment settings. If either backfill drops the shared tag, the
# cap silently stops applying to it — guard against that here.
@parameterized.expand([
("events", EVENTS_CONCURRENCY_TAG),
("persons", PERSONS_CONCURRENCY_TAG),
])
def test_backfill_has_shared_concurrency_key(self, _name, tag_dict):
((shared_key, shared_value),) = DUCKLING_BACKFILL_CONCURRENCY_TAG.items()
assert tag_dict[shared_key] == shared_value
```
Reviews (1): Last reviewed commit: "feat(data-stack): add shared duckling ba..." | Re-trigger Greptile |
Split the events/persons assertion into a parameterized test so a dropped shared tag names the offending backfill in the failure instead of an anonymous dict lookup. Addresses review feedback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add Concurrency Control section and connection timeout troubleshooting entry to README_DUCKLINGS.md, reflecting the shared concurrency tag mechanism introduced in PR #61027.
Problem
Each duckling backfill op opens a duckgres connection, and duckgres spins up one worker per connection. The per-org worker pool is capped (
maxWorkersin the duckgres chart — 50 in prod-us) and shared with product HogQL queries and other jobs. With no combined limit, the events and persons backfills can fan out across many partitions at once and exhaust the pool, producingpsycopg.ConnectionTimeouton connect.The two backfills each carried only their own concurrency tag (
duckling_events_v1,duckling_persons_v1), so they could only ever be limited independently — there was no single key to enforce a combined cap across both.Changes
Add a shared tag
duckling_backfill_concurrency=duckling_v1to both the events and persons assets and jobs (via the existingEVENTS_CONCURRENCY_TAG/PERSONS_CONCURRENCY_TAGdicts). The per-product keys are kept for optional finer-grained limits later.The limit value is not set here — it's a Dagster Cloud deployment setting applied from the charts repo (companion PR). This change only establishes the shared key the limit targets.
How did you test this code?
I'm an agent; automated tests only.
Added
TestDucklingConcurrencyTags::test_events_and_persons_share_the_combined_concurrency_key— asserts both backfills carry the shared key/value, so a future edit can't silently drop it and detach a backfill from the combined cap. Full file:101 passed.ruffclean.🤖 Agent context
Authored by Claude Code (Opus 4.8). Part of clamping duckling backfill concurrency below the shared duckgres worker pool. A shared tag is required because Dagster
tag_concurrency_limitskey on run tags — two independent per-product keys can only yield two independent limits, not one combined cap. The deployment-settings gitops that sets the limit (combined cap of 5) lives in the charts repo.🤖 Generated with Claude Code