Fix COT (Cotswold) scraper#334
Merged
Merged
Conversation
The Cotswold ModGov endpoint (meetings.cotswold.gov.uk) times out when reached from Lambda's IP range. The endpoint responds correctly from other IPs, indicating a WAF or IP-based rate-limit is blocking Lambda. Adding http_lib = "playwright" switches to headless Chromium, which bypasses this filtering.
Member
Author
Re-scrape after 82353aeAdded
Verified by running playwright directly with Generated by Claude Code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What broke
The Cotswold ModGov endpoint (
meetings.cotswold.gov.uk) times out after 30 seconds when accessed from Lambda's IP range. The endpoint responds correctly (HTTP 200, valid XML) from other IPs, indicating an application-layer block (WAF or CDN) that targets Lambda's egress IP range. Without a real browser fingerprint, Lambda's wreq client appears to receive a response that is either a JS challenge or a slow-drain from the WAF, causing a 30-second timeout.What was fixed
http_lib = "playwright"to the Scraper class — headless Chromium bypasses application-layer blocks by presenting a genuine browser TLS fingerprint and executing any JS challenges nativelyScrape results
Generated by Claude Code