Skip to content

Remove ^Amazon CloudFront crawler signature#602

Merged
JayBizzle merged 3 commits into
masterfrom
remove-amazon-cloudfront
May 10, 2026
Merged

Remove ^Amazon CloudFront crawler signature#602
JayBizzle merged 3 commits into
masterfrom
remove-amazon-cloudfront

Conversation

@JayBizzle
Copy link
Copy Markdown
Owner

Summary

  • Removes '^Amazon CloudFront' from src/Fixtures/Crawlers.php
  • Regenerates raw/Crawlers.{txt,json} via php export.php
  • Removes the Amazon CloudFront fixture from tests/data/user_agent/crawlers.txt
  • Adds Amazon CloudFront to tests/data/user_agent/devices.txt as a regression guard

Why

Fixes #594.

CloudFront is a reverse proxy / CDN. The Amazon CloudFront UA appears at the origin almost exclusively when CloudFront is fetching on behalf of a real end user (cache miss / origin shield / cache fill). For any site hosted behind CloudFront — which is a huge fraction of AWS deployments — classifying it as a crawler:

  • Silently undercounts real users in analytics
  • Can break auth flows or rate-limit logic that excludes "bots"
  • Affects the deployment pattern recommended by AWS itself

The signature has flip-flopped:

The 2020 removal's reasoning still applies. The 2023 re-addition was anecdotal and didn't analyze whether the traffic was actually crawler-like vs the user's own CloudFront proxying real visitors.

Other CDNs (Fastly, Akamai, Bunny) aren't listed by their generic proxy UAs either, so this is also an inconsistency. Cloudflare-AlwaysOnline stays — that's a specific stale-cache-serving feature, not a generic proxy UA.

The two unambiguously-bot Amazon entries remain:

  • ^Amazon Simple Notification Service Agent$ (SNS HTTP deliveries)
  • ^Amazon-Route53-Health-Check-Service (Route53 health probes)

The regression guard in devices.txt will fail the test suite if anyone re-adds the signature in the future without first removing the guard, prompting a fresh design discussion rather than another silent flip.

Test plan

  • php export.php regenerates raw/Crawlers.{txt,json} cleanly
  • vendor/bin/phpunit — all 18 tests pass (2,202,834 assertions)
  • No remaining CloudFront references anywhere in src/Fixtures/, raw/, or tests/data/user_agent/crawlers.txt

🤖 Generated with Claude Code

JayBizzle and others added 3 commits May 10, 2026 14:30
CloudFront's dominant traffic pattern at the origin is reverse-proxy
fetches on behalf of real end users (cache miss / origin shield), not
crawler activity. Classifying it as a crawler causes silent
analytics/auth/rate-limit bugs for any site hosted behind CloudFront.

The signature has flip-flopped (added 2020, removed 2020, re-added
2023 with anecdotal "getting tons of this lately" justification — see
#392, #410, #504). The 2020 removal had clear technical reasoning that
still applies. Removing again, with a regression guard in devices.txt
to lock in the decision.

`^Amazon Simple Notification Service Agent$` and
`^Amazon-Route53-Health-Check-Service` remain in the list — those are
unambiguously bots.

Fixes #594.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fixture line in devices.txt was easy to delete by accident in a
future change. Replacing it with a named test that explicitly documents
the decision and the issue history, so a future re-add PR will fail
with a clearly-named assertion that prompts a deliberate choice instead
of a silent revert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JayBizzle JayBizzle merged commit 2eacd74 into master May 10, 2026
10 checks passed
@JayBizzle JayBizzle deleted the remove-amazon-cloudfront branch May 10, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove Amazon CloudFront from crawler list ?

1 participant