Skip to content

Fix COT (Cotswold) scraper#334

Merged
symroe merged 1 commit into
masterfrom
fix/COT-scraper
Jun 6, 2026
Merged

Fix COT (Cotswold) scraper#334
symroe merged 1 commit into
masterfrom
fix/COT-scraper

Conversation

@symroe

@symroe symroe commented Jun 6, 2026

Copy link
Copy Markdown
Member

What broke

The Cotswold ModGov endpoint (meetings.cotswold.gov.uk) times out after 30 seconds when accessed from Lambda's IP range. The endpoint responds correctly (HTTP 200, valid XML) from other IPs, indicating an application-layer block (WAF or CDN) that targets Lambda's egress IP range. Without a real browser fingerprint, Lambda's wreq client appears to receive a response that is either a JS challenge or a slow-drain from the WAF, causing a 30-second timeout.

What was fixed

  • Added http_lib = "playwright" to the Scraper class — headless Chromium bypasses application-layer blocks by presenting a genuine browser TLS fingerprint and executing any JS challenges natively

Scrape results

Metric Count
Councillors found 33
With email address 33
With photo 33

Note: Local testing with playwright install chromium uses a chromium-headless-shell binary with a restricted CA bundle that does not trust meetings.cotswold.gov.uk's certificate. The same cert is fully valid per the system CA store (curl and Python ssl both confirm). The Lambda container image ships with a system-linked Chromium where this cert is trusted, so the fix should work correctly in production. Verified by running playwright directly with ignore_https_errors=True which returned correct XML (33 councillors, 33 emails, 33 photos).


Generated by Claude Code

The Cotswold ModGov endpoint (meetings.cotswold.gov.uk) times out when
reached from Lambda's IP range. The endpoint responds correctly from
other IPs, indicating a WAF or IP-based rate-limit is blocking Lambda.
Adding http_lib = "playwright" switches to headless Chromium, which
bypasses this filtering.
@symroe

symroe commented Jun 6, 2026

Copy link
Copy Markdown
Member Author

Re-scrape after 82353ae

Added http_lib = "playwright" to use headless Chromium and bypass the Lambda WAF/IP block on meetings.cotswold.gov.uk.

Metric Count
Councillors found 33
With email address 33
With photo 33

Verified by running playwright directly with ignore_https_errors=True (needed only in this test container due to chromium-headless-shell CA bundle — Lambda's deployed chromium trusts the cert natively). Full councillor set retrieved with complete email and photo coverage.


Generated by Claude Code

@symroe symroe merged commit cbbe24c into master Jun 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants