Skip to content

feat(data-stack): add shared duckling backfill concurrency tag#61027

Merged
fuziontech merged 2 commits into
masterfrom
fix/duckling-shared-concurrency-tag
Jun 2, 2026
Merged

feat(data-stack): add shared duckling backfill concurrency tag#61027
fuziontech merged 2 commits into
masterfrom
fix/duckling-shared-concurrency-tag

Conversation

@fuziontech
Copy link
Copy Markdown
Member

@fuziontech fuziontech commented Jun 1, 2026

Problem

Each duckling backfill op opens a duckgres connection, and duckgres spins up one worker per connection. The per-org worker pool is capped (maxWorkers in the duckgres chart — 50 in prod-us) and shared with product HogQL queries and other jobs. With no combined limit, the events and persons backfills can fan out across many partitions at once and exhaust the pool, producing psycopg.ConnectionTimeout on connect.

The two backfills each carried only their own concurrency tag (duckling_events_v1, duckling_persons_v1), so they could only ever be limited independently — there was no single key to enforce a combined cap across both.

Changes

Add a shared tag duckling_backfill_concurrency=duckling_v1 to both the events and persons assets and jobs (via the existing EVENTS_CONCURRENCY_TAG / PERSONS_CONCURRENCY_TAG dicts). The per-product keys are kept for optional finer-grained limits later.

The limit value is not set here — it's a Dagster Cloud deployment setting applied from the charts repo (companion PR). This change only establishes the shared key the limit targets.

Companion charts PR (gitops for the deployment settings): link to be added

How did you test this code?

I'm an agent; automated tests only.

Added TestDucklingConcurrencyTags::test_events_and_persons_share_the_combined_concurrency_key — asserts both backfills carry the shared key/value, so a future edit can't silently drop it and detach a backfill from the combined cap. Full file: 101 passed. ruff clean.

🤖 Agent context

Authored by Claude Code (Opus 4.8). Part of clamping duckling backfill concurrency below the shared duckgres worker pool. A shared tag is required because Dagster tag_concurrency_limits key on run tags — two independent per-product keys can only yield two independent limits, not one combined cap. The deployment-settings gitops that sets the limit (combined cap of 5) lives in the charts repo.

🤖 Generated with Claude Code

Each duckling backfill connection spins up a duckgres worker, and the per-org
worker pool is capped and shared with product queries. Events and persons
backfills previously carried only their own independent concurrency tags, so
they could not be governed by a single combined limit. Add a shared
duckling_backfill_concurrency=duckling_v1 tag to both assets and jobs; the
combined cap is enforced on this key via the Dagster Cloud deployment settings
in the charts repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 21:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Hey @fuziontech! 👋

It looks like your git author email on this PR isn't your @posthog.com address (fuziontech@gmail.com). Since you're on the PostHog team, it's worth pointing your local git author email at your @posthog.com address. Why it matters:

  • Consistent work identity in git history — internal tooling that attributes commits to team members keys off your @posthog.com address.
  • Keeps team contributions easy to tell apart from external community ones when scanning history.

You can fix it for this repo with:

git config user.email "you@posthog.com"

Or set it globally with git config --global user.email "you@posthog.com". No need to redo this PR — just a nudge for next time. 🙂

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a shared Dagster concurrency tag (duckling_backfill_concurrency=duckling_v1) to both the events and persons duckling backfill assets/jobs so a single combined cap (configured separately in the charts repo) can limit their total concurrent runs against the shared duckgres worker pool.

Changes:

  • Introduce DUCKLING_BACKFILL_CONCURRENCY_TAG and merge it into EVENTS_CONCURRENCY_TAG and PERSONS_CONCURRENCY_TAG.
  • Add a regression test ensuring both backfills carry the shared key/value.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
posthog/dags/events_backfill_to_duckling.py Defines the shared concurrency tag and spreads it into both per-product tag dicts.
posthog/dags/test_events_backfill_to_duckling.py Adds a test asserting both backfill tag dicts contain the shared key/value.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
posthog/dags/test_events_backfill_to_duckling.py:913-920
The test checks both backfill types in a single assertion block. Prefer a parameterized test so that when a future edit drops the shared tag from one backfill, the failure message immediately names which one (events vs persons) instead of failing on a dict lookup with no clear owner.

```suggestion
class TestDucklingConcurrencyTags:
    # The combined events+persons cap is enforced on the shared key in the charts
    # Dagster deployment settings. If either backfill drops the shared tag, the
    # cap silently stops applying to it — guard against that here.
    @parameterized.expand([
        ("events", EVENTS_CONCURRENCY_TAG),
        ("persons", PERSONS_CONCURRENCY_TAG),
    ])
    def test_backfill_has_shared_concurrency_key(self, _name, tag_dict):
        ((shared_key, shared_value),) = DUCKLING_BACKFILL_CONCURRENCY_TAG.items()
        assert tag_dict[shared_key] == shared_value
```

Reviews (1): Last reviewed commit: "feat(data-stack): add shared duckling ba..." | Re-trigger Greptile

Comment thread posthog/dags/test_events_backfill_to_duckling.py Outdated
@fuziontech fuziontech requested a review from a team June 1, 2026 22:04
Split the events/persons assertion into a parameterized test so a dropped shared
tag names the offending backfill in the failure instead of an anonymous dict
lookup. Addresses review feedback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fuziontech fuziontech merged commit d5be162 into master Jun 2, 2026
336 of 341 checks passed
@fuziontech fuziontech deleted the fix/duckling-shared-concurrency-tag branch June 2, 2026 03:05
inkeep Bot added a commit that referenced this pull request Jun 2, 2026
Add Concurrency Control section and connection timeout troubleshooting
entry to README_DUCKLINGS.md, reflecting the shared concurrency tag
mechanism introduced in PR #61027.
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 2, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-02 03:26 UTC Run
prod-us ✅ Deployed 2026-06-02 10:18 UTC Run
prod-eu ✅ Deployed 2026-06-02 10:20 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants