Fix empty API results: switch CIE source to papersdaddy with download proxy by ttttonyhe · Pull Request #6 · Snapaper/snapaper-nodejs

ttttonyhe · 2026-05-23T07:04:09Z

Summary

Root cause: pastpapers.papacambridge.com and pastpapers.co moved behind Cloudflare's Managed Challenge — the legacy crawler library couldn't solve the JS challenge, returned empty arrays, and those got cached in Redis for a month. The historical fallback sources (gceguide.cc/.xyz/.com) are dead or domain-parked.
Fix: Repoint ppca papers/years/cates to papersdaddy.com (active, 2002–2026, no anti-bot wall) via a new wrapper. Paper download URLs route through a server-side /api/download/* proxy that resolves papersdaddy's short-lived tokenized URL, strips the per-page PapersDaddy / Downloaded for free from www.papersdaddy.com watermarks via pdf-lib, and serves the cleaned PDF with proper Content-Disposition, Range, and CORS exposure.
Hardening: apicache no longer caches non-200 responses (so a transient upstream failure can't poison the 1-month TTL again) and skips the /api/download path entirely (no multi-MB binaries in Redis). Redis connection errors no longer crash the process. trust proxy is on so paper URLs are correctly https:// behind Fly's edge.

Production verification (https://node.snapaper.com)

/api/years/ppca/as-and-a-level/mathematics-(9709) → 60 sessions, newest 2026-Oct-Nov ✓
/api/papers/ppca/as-and-a-level/mathematics-(9709)/2025-May-Jun → 50 papers, URLs are https://node.snapaper.com/api/download/... ✓
/api/cates/ppca/as-and-a-level → 120 subjects ✓
GET /api/download/.../9709_s25_qp_11.pdf → 200 application/pdf, %PDF header, pdftotext | grep PapersDaddy = 0 ✓
Range: bytes=0-99 → 206 Partial Content, exactly 100 bytes, Content-Range: bytes 0-99/411664 ✓
?download=1 → Content-Disposition: attachment ✓
The originally-failing URLs from the issue (/api/papers/ppca/as-and-a-level/economics-(9708)/2001-Nov and /api/years/ppca/as-and-a-level/mathematics-(9709)) now resolve — the latter returns full data, the former returns 502 upstream 404 because papersdaddy data starts at 2002 for Economics (semantic, not a bug)

Test plan

Verify /api/cates/ppca/{as-and-a-level,igcse,o-level} return their expected subject lists in the production environment
Verify /api/years/ppca/... returns newest-first sessions including the just-released 2026 entries
Verify /api/papers/ppca/.../{year} returns paper objects with https:// URLs pointing at our /api/download proxy
Verify /api/download/... serves a watermark-free PDF (cross-checked with pdftotext)
Verify Range requests return 206 with correct Content-Range
Verify ?download=1 switches Content-Disposition to attachment
Verify CORS preflight exposes Content-Disposition, Content-Length, Content-Range, Accept-Ranges
Verify HEAD requests return headers without bodies
Smoke-test from the actual snapaper.com frontend (browser flow: list → year picker → paper viewer/download)

Notes for reviewers

The pre-deploy cache had ~1200 stale empty entries — flushed via a one-shot Node script connecting to Upstash from inside the Fly machine. The new statusCodes: { include: [200] } apicache config prevents this from recurring.
Old paper download URLs that pointed directly at gceguide/papacambridge are now replaced with self-hosted proxy URLs (/api/download/cambridge/...). The frontend treats them as plain <a href> targets so no frontend change is needed.
The original filename (9709_s25_qp_11.pdf) is preserved end-to-end so students can still recognize the standard CAIE naming.

🤖 Generated with Claude Code

The previous upstreams (pastpapers.papacambridge.com, pastpapers.co) moved behind Cloudflare Managed Challenge — the legacy crawler couldn't solve the JS challenge, returned empty arrays, and those empties were cached in Redis for a month. Replace them with papersdaddy.com (active, has 2002-2026, no anti-bot wall), route paper downloads through a server proxy that strips the per-page watermark, and harden the cache so a transient upstream failure can't poison it again. - routes/papacambridge_com.js, routes/years.js, routes/cates.js — ppca papers/years/cates now go to papersdaddy via the new wrapper - utils/papersdaddy_wrapper.js — fetchYears/fetchPapers/fetchCates + resolveDownload (extracts the short-lived tokenized PDF URL from papersdaddy's viewer page) - routes/download.js — buffers the upstream PDF, runs stripWatermark, serves with proper Content-Disposition, ?download=1 for forced save, Range request support, and HEAD; non-PDF assets still pipe-stream - utils/watermark_stripper.js — empties the standalone "PapersDaddy" and "Downloaded for free from www.papersdaddy.com" content streams via pdf-lib (matches the 22-byte hex sequences of the watermark text) - app.js — apicache now skips /api/download and only stores 200s, so multi-MB binaries and error responses can't end up in Redis - utils/redis_wrapper.js — Redis connection errors no longer crash the process; caching just disables instead - config/cors.config.js — expose Content-Disposition, Content-Length, Content-Range, Accept-Ranges so cross-origin fetch() can read them Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fly terminates TLS and forwards as plain HTTP to the app, so req.protocol was "http" and the /api/papers response built http://node.snapaper.com/api/download/... URLs. The frontend runs on https://snapaper.com, so clicking those links triggers mixed-content blocks in modern browsers. With trust proxy on, Express honors X-Forwarded-Proto and req.protocol returns "https". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ttttonyhe and others added 2 commits May 23, 2026 02:50

Copilot AI review requested due to automatic review settings May 23, 2026 07:04

Copilot started reviewing on behalf of ttttonyhe May 23, 2026 07:04 View session

ttttonyhe merged commit 0f3c758 into master May 23, 2026
1 check failed

ttttonyhe review requested due to automatic review settings May 23, 2026 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix empty API results: switch CIE source to papersdaddy with download proxy#6

Fix empty API results: switch CIE source to papersdaddy with download proxy#6
ttttonyhe merged 2 commits into
masterfrom
fix-empty-api-results

ttttonyhe commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ttttonyhe commented May 23, 2026

Summary

Production verification (https://node.snapaper.com)

Test plan

Notes for reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant