feat: SPA rendering support via Playwright fallback#7
feat: SPA rendering support via Playwright fallback#7JDRanpariya wants to merge 4 commits intoDhravya:mainfrom
Conversation
Detects JavaScript-rendered SPAs and uses headless Chromium to extract content that fetch() alone cannot reach. Adds route discovery from JS bundles, hash-based routing support, UA rotation, and robots.txt validation. Fixes empty output on sites like paperclip.gxl.ai/docs (0→51KB).
- Prevent path traversal via hash fragments (validate output stays under outDir) - Surface clear error when Chromium is not installed - Fix TypeScript strict null check in detect.ts - Limit worker count to 4 in browser mode to prevent memory exhaustion
|
addressed the fixes in 1a243a7 |
There was a problem hiding this comment.
Reviewed the fixes in 1a243a7 -- the 4 original issues were addressed, but 1 fix is incomplete and 2 issues remain. Found 3 issues (2 high, 1 medium).
Testing ResultsRan 52 unit tests and 2 E2E integration tests against the current PR HEAD + fix commit Bug 1 -- Operator precedence error in
|
| Test file | Tests | Status |
|---|---|---|
test_detect.ts -- SPA shell detection |
13 | PASS |
test_ua.ts -- User-Agent rotation |
5 | PASS |
test_routes.ts -- JS bundle route extraction |
9 | PASS |
test_write.ts -- hash-URL file paths |
9 | PASS |
test_path_traversal.ts -- path traversal guard |
2 | PASS |
test_discover_hash.ts -- hash link extraction |
9 | PASS |
test_renderer.ts -- Playwright renderer |
5 | PASS |
E2E Integration Tests (2/2 pass)
E2E 1 -- JS bundle route discovery + Playwright rendering: Static fetch of a React SPA returned empty <div id="root"></div>. After running webpull, Playwright rendered the page and extracted 3 pages with full content that only exists after JS execution. Browser rendering confirmed working.
E2E 2 -- Worker browser fallback: Verified that individual workers correctly fall back to Playwright when they detect an SPA shell, producing content that only exists after JS execution.
- Fix hash link extraction logic (remove confusing double-branch) - Convert string concatenation to template literals - Fix import ordering per biome rules - Fix noAssignInExpressions in routes.ts - Fix formatting in detect.ts, renderer.ts, ua.ts - Use relative() path check for robust traversal prevention - Update README with SPA support and Playwright requirement
|
addressed the fixes |
There was a problem hiding this comment.
Reviewed commit b54d4e2 -- the 3 previously flagged issues are all properly fixed and tsc --noEmit passes. Two minor items remain (1 medium, 1 low).
- Fix hash length check from >2 to >1 (include root "#/" route) - Add fullUrl before slicing to prevent exceeding --max by one - Apply same threshold fix in index.ts
|
addressed the fixes |
|
Reviewed commit 6e8f1e7 -- all previously flagged issues are now resolved. No new issues found. Summary of fixes across all rounds:
LGTM. |
Summary
Adds SPA rendering support using Playwright as a fallback when it returns an empty JavaScript shell. #6
Problem
produces empty output on any site that renders content client-side (React, Vue, Svelte SPAs).
For example, returned 0 bytes of content.
Solution
works with following site now
Notes