Skip to content

fix(cdp): bypass bot filter for non-browser $lib events#60557

Merged
dmarchuk merged 5 commits into
masterfrom
dmarchuk/bot-filter-bypass-non-web-lib
Jun 1, 2026
Merged

fix(cdp): bypass bot filter for non-browser $lib events#60557
dmarchuk merged 5 commits into
masterfrom
dmarchuk/bot-filter-bypass-non-web-lib

Conversation

@dmarchuk
Copy link
Copy Markdown
Contributor

@dmarchuk dmarchuk commented May 28, 2026

Problem

The Filter Bot Events transformation (template-bot-detection) drops events from any source whose user agent matches PostHog's curated bot UA list (nodejs/src/cdp/hog-transformations/bots/bots.ts) or whose IP matches the curated datacenter list (nodejs/assets/bot-ips.txt).

This silently kills server-side captures:

  • $ai_generation and other LLM observability events from posthog-python and posthog-node (UAs python-httpx, axios/*, both substring-match the curated bot UA list).
  • $identify / custom events from a Python/Node backend.
  • Source webhook captures (posthog-webhook).
  • Mobile SDK traffic (posthog-ios, posthog-android) whose HTTP-client UAs can substring-match entries on the curated list.

These all look like "bots" by UA, but the curated lists are tuned for browser traffic. A backend HTTP client is a server, not a crawler.

Surfaced via a support ticket where a customer's $ai_generation events never reached LLM analytics, despite running the transformation in default config (no custom patterns).

Changes

Gate the two curated-list calls (isKnownBotUserAgent and isKnownBotIp) on is_browser_traffic, defined as $lib in {web, js} or $lib missing. Customer-configured customBotPatterns and customIpPrefixes continue to apply for every event regardless of $lib (the customer's explicit configuration is preserved).

No new inputs, no schema changes.

How did you test this code?

I've tested this manually with code before/after and sending events and watching if they land or not.

Added jest cases to bot-detection.template.test.ts:

  • skip curated UA list: $lib in {posthog-python, posthog-node, posthog-webhook, posthog-ios} with bot UA -> event kept
  • skip curated IP list: server-side $lib with bot IP -> event kept
  • still apply customBotPatterns even for non-browser $lib
  • still apply customIpPrefixes even for non-browser $lib
  • still filter known bot UA when $lib in {web, js} or missing -> event dropped

All 24 tests in the file pass locally via hogli test.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

No docs in scope.

🤖 Agent context

Authored by Claude Code (Opus 4.7) on Daniel's machine. Triage started from a support ticket where $ai_generation events never appeared in LLM analytics for a customer; Eli Reisman traced the drop to their bot-detection transformation; Radu Raicea asked me to follow up.

Three approaches considered:

  1. Hardcoded event.event LIKE '$ai_%' carve-out: solves the reported case but leaves the same silent-drop bug for server-side identify / custom events / webhook captures.
  2. Skip hog transformations entirely on the AI subpipeline (nodejs/src/ingestion/ai/pipelines/ai-event-subpipeline.ts): too broad, kills legit PII-scrubbing and property normalization for AI events.
  3. (chosen) $lib-based gate inside the template: matches the curated lists' actual scope (browser-tuned heuristics), covers $ai_* via posthog-python / posthog-node and also fixes server-side identify / custom / webhook traffic and mobile traffic.

Verified $lib conventions before relying on them: posthog-js -> 'web', posthog-js-lite -> 'js', every server SDK -> 'posthog-<lang>', mobile SDKs -> 'posthog-ios' / 'posthog-android' / 'posthog-react-native' / 'posthog-flutter'. No ingestion-side normalization of $lib. posthog-langchain is not a $lib value (langchain events ride posthog-python with $ai_framework='langchain').

Scope decision: the gate covers only the curated lists. Customer-configured customBotPatterns and customIpPrefixes still apply for every event. Initial draft skipped the entire function for non-browser $lib, which would have silently changed customer config behaviour - narrowed after a second pass.

Missing $lib is treated as browser (curated lists still run) - that's the conservative choice for raw HTTP clients hitting /capture/ without identifying themselves. Forging $lib=posthog-python to bypass the filter is theoretically possible but the existing filter is trivially bypassed by changing the UA string; the template targets crawler/datacenter noise, not adversarial abuse.

@dmarchuk dmarchuk marked this pull request as ready for review June 1, 2026 09:06
@dmarchuk dmarchuk requested a review from a team June 1, 2026 09:07
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nodejs/src/cdp/templates/_transformations/bot-detection/bot-detection.template.test.ts:331-345
**Duplicate test scenario**

This test is functionally identical to the existing `'should filter out known bot user agent'` test on line 40 — same Googlebot UA, no `$lib`, same `DEFAULT_INPUTS`, same expectation of `execResult` being falsy. Both implicitly exercise the "missing `$lib``is_browser_traffic = true`" path. The conservative-fallback intent is already captured by the in-code comment on `is_browser_traffic`, so this test adds no new coverage and could be removed or, preferably, folded into the `it.each` above by adding an `undefined` (or absent) `$lib` case alongside `'web'` and `'js'`.

Reviews (1): Last reviewed commit: "fix(cdp): narrow bot-filter bypass to cu..." | Re-trigger Greptile

@dmarchuk dmarchuk merged commit 113690a into master Jun 1, 2026
150 of 152 checks passed
@dmarchuk dmarchuk deleted the dmarchuk/bot-filter-bypass-non-web-lib branch June 1, 2026 11:46
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 1, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-01 12:38 UTC Run
prod-us ✅ Deployed 2026-06-01 12:52 UTC Run
prod-eu ✅ Deployed 2026-06-01 12:55 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants