Skip to content

Fix empty API results: switch CIE source to papersdaddy with download proxy#6

Merged
ttttonyhe merged 2 commits into
masterfrom
fix-empty-api-results
May 23, 2026
Merged

Fix empty API results: switch CIE source to papersdaddy with download proxy#6
ttttonyhe merged 2 commits into
masterfrom
fix-empty-api-results

Conversation

@ttttonyhe
Copy link
Copy Markdown
Member

Summary

  • Root cause: pastpapers.papacambridge.com and pastpapers.co moved behind Cloudflare's Managed Challenge — the legacy crawler library couldn't solve the JS challenge, returned empty arrays, and those got cached in Redis for a month. The historical fallback sources (gceguide.cc/.xyz/.com) are dead or domain-parked.
  • Fix: Repoint ppca papers/years/cates to papersdaddy.com (active, 2002–2026, no anti-bot wall) via a new wrapper. Paper download URLs route through a server-side /api/download/* proxy that resolves papersdaddy's short-lived tokenized URL, strips the per-page PapersDaddy / Downloaded for free from www.papersdaddy.com watermarks via pdf-lib, and serves the cleaned PDF with proper Content-Disposition, Range, and CORS exposure.
  • Hardening: apicache no longer caches non-200 responses (so a transient upstream failure can't poison the 1-month TTL again) and skips the /api/download path entirely (no multi-MB binaries in Redis). Redis connection errors no longer crash the process. trust proxy is on so paper URLs are correctly https:// behind Fly's edge.

Production verification (https://node.snapaper.com)

  • /api/years/ppca/as-and-a-level/mathematics-(9709) → 60 sessions, newest 2026-Oct-Nov
  • /api/papers/ppca/as-and-a-level/mathematics-(9709)/2025-May-Jun → 50 papers, URLs are https://node.snapaper.com/api/download/...
  • /api/cates/ppca/as-and-a-level → 120 subjects ✓
  • GET /api/download/.../9709_s25_qp_11.pdf → 200 application/pdf, %PDF header, pdftotext | grep PapersDaddy = 0 ✓
  • Range: bytes=0-99 → 206 Partial Content, exactly 100 bytes, Content-Range: bytes 0-99/411664
  • ?download=1Content-Disposition: attachment
  • The originally-failing URLs from the issue (/api/papers/ppca/as-and-a-level/economics-(9708)/2001-Nov and /api/years/ppca/as-and-a-level/mathematics-(9709)) now resolve — the latter returns full data, the former returns 502 upstream 404 because papersdaddy data starts at 2002 for Economics (semantic, not a bug)

Test plan

  • Verify /api/cates/ppca/{as-and-a-level,igcse,o-level} return their expected subject lists in the production environment
  • Verify /api/years/ppca/... returns newest-first sessions including the just-released 2026 entries
  • Verify /api/papers/ppca/.../{year} returns paper objects with https:// URLs pointing at our /api/download proxy
  • Verify /api/download/... serves a watermark-free PDF (cross-checked with pdftotext)
  • Verify Range requests return 206 with correct Content-Range
  • Verify ?download=1 switches Content-Disposition to attachment
  • Verify CORS preflight exposes Content-Disposition, Content-Length, Content-Range, Accept-Ranges
  • Verify HEAD requests return headers without bodies
  • Smoke-test from the actual snapaper.com frontend (browser flow: list → year picker → paper viewer/download)

Notes for reviewers

  • The pre-deploy cache had ~1200 stale empty entries — flushed via a one-shot Node script connecting to Upstash from inside the Fly machine. The new statusCodes: { include: [200] } apicache config prevents this from recurring.
  • Old paper download URLs that pointed directly at gceguide/papacambridge are now replaced with self-hosted proxy URLs (/api/download/cambridge/...). The frontend treats them as plain <a href> targets so no frontend change is needed.
  • The original filename (9709_s25_qp_11.pdf) is preserved end-to-end so students can still recognize the standard CAIE naming.

🤖 Generated with Claude Code

ttttonyhe and others added 2 commits May 23, 2026 02:50
The previous upstreams (pastpapers.papacambridge.com, pastpapers.co)
moved behind Cloudflare Managed Challenge — the legacy crawler couldn't
solve the JS challenge, returned empty arrays, and those empties were
cached in Redis for a month. Replace them with papersdaddy.com (active,
has 2002-2026, no anti-bot wall), route paper downloads through a server
proxy that strips the per-page watermark, and harden the cache so a
transient upstream failure can't poison it again.

- routes/papacambridge_com.js, routes/years.js, routes/cates.js — ppca
  papers/years/cates now go to papersdaddy via the new wrapper
- utils/papersdaddy_wrapper.js — fetchYears/fetchPapers/fetchCates +
  resolveDownload (extracts the short-lived tokenized PDF URL from
  papersdaddy's viewer page)
- routes/download.js — buffers the upstream PDF, runs stripWatermark,
  serves with proper Content-Disposition, ?download=1 for forced save,
  Range request support, and HEAD; non-PDF assets still pipe-stream
- utils/watermark_stripper.js — empties the standalone "PapersDaddy" and
  "Downloaded for free from www.papersdaddy.com" content streams via
  pdf-lib (matches the 22-byte hex sequences of the watermark text)
- app.js — apicache now skips /api/download and only stores 200s, so
  multi-MB binaries and error responses can't end up in Redis
- utils/redis_wrapper.js — Redis connection errors no longer crash the
  process; caching just disables instead
- config/cors.config.js — expose Content-Disposition, Content-Length,
  Content-Range, Accept-Ranges so cross-origin fetch() can read them

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fly terminates TLS and forwards as plain HTTP to the app, so req.protocol
was "http" and the /api/papers response built http://node.snapaper.com/api/download/...
URLs. The frontend runs on https://snapaper.com, so clicking those links
triggers mixed-content blocks in modern browsers. With trust proxy on,
Express honors X-Forwarded-Proto and req.protocol returns "https".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 23, 2026 07:04
@ttttonyhe ttttonyhe merged commit 0f3c758 into master May 23, 2026
1 check failed
@ttttonyhe ttttonyhe review requested due to automatic review settings May 23, 2026 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant