Skip to content

feat: SPA rendering support via Playwright fallback#7

Open
JDRanpariya wants to merge 4 commits intoDhravya:mainfrom
JDRanpariya:feat/spa-rendering
Open

feat: SPA rendering support via Playwright fallback#7
JDRanpariya wants to merge 4 commits intoDhravya:mainfrom
JDRanpariya:feat/spa-rendering

Conversation

@JDRanpariya
Copy link
Copy Markdown

@JDRanpariya JDRanpariya commented May 2, 2026

Summary

Adds SPA rendering support using Playwright as a fallback when it returns an empty JavaScript shell. #6

Problem

produces empty output on any site that renders content client-side (React, Vue, Svelte SPAs).
For example, returned 0 bytes of content.

Solution

  1. Detect SPA shells, identify empty HTML with root div + JS
  2. Discover pages via JS bundle route parsing, headless browser nav extraction, or hash-routing detection
  3. Render content with Playwright when the worker detects an SPA shell

works with following site now

paperclip.gxl.ai
vitejs.dev
docs.logseq.com
linear.app
pglite.dev

Notes

  • Browser only launches when SPA is detected
  • Defuddle timeout increased from 300ms → 5s for complex rendered pages
  • Adds playwright as runtime dependency
  • logseq using obsidian like graph based docs, might not work properly in all cases but that has to be another PR

Detects JavaScript-rendered SPAs and uses headless Chromium to extract
content that fetch() alone cannot reach. Adds route discovery from JS
bundles, hash-based routing support, UA rotation, and robots.txt
validation.

Fixes empty output on sites like paperclip.gxl.ai/docs (0→51KB).
Copy link
Copy Markdown

@vorflux vorflux Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed -- found 4 issues (2 high, 2 medium).


Review with Vorflux

Comment thread src/write.ts
Comment thread src/renderer.ts Outdated
Comment thread src/detect.ts Outdated
Comment thread src/worker.ts
- Prevent path traversal via hash fragments (validate output stays under outDir)
- Surface clear error when Chromium is not installed
- Fix TypeScript strict null check in detect.ts
- Limit worker count to 4 in browser mode to prevent memory exhaustion
@JDRanpariya
Copy link
Copy Markdown
Author

addressed the fixes in 1a243a7

Copy link
Copy Markdown

@vorflux vorflux Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the fixes in 1a243a7 -- the 4 original issues were addressed, but 1 fix is incomplete and 2 issues remain. Found 3 issues (2 high, 1 medium).


Review with Vorflux

Comment thread src/write.ts Outdated
Comment thread src/index.ts Outdated
Comment thread src/discover.ts Outdated
@vorflux
Copy link
Copy Markdown

vorflux Bot commented May 2, 2026

Testing Results

Ran 52 unit tests and 2 E2E integration tests against the current PR HEAD + fix commit 1a243a7. Three additional bugs were found during testing:

Bug 1 -- Operator precedence error in discover.ts (line 194)

Already flagged in the inline review above. The !m[1].startsWith("#/") === false condition is equivalent to m[1].startsWith("#/") due to ! binding before ===. Both branches match the same links; non-#/ hash routes (e.g. #!/guide) are silently dropped.

Bug 2 -- Off-by-one in hash-URL path threshold (write.ts line 11, index.ts line 136)

Both files check parsed.hash.length > 2, but "#/".length === 2, so the hash-router root URL (e.g. https://example.com/app#/) silently fails the check and falls through to the pathname branch, writing app.md instead of the correct index.md. All deeper hash routes (#/docs/intro, length >= 7) are unaffected.

Fix: Change > 2 to > 1 in both files. Bare # normalizes to "" (length 0) in the URL API, so > 1 correctly skips it while capturing #/ and longer paths.

Bug 3 -- TypeScript compile error (index.ts line 110)

pool was changed to WorkerPool | null but pool.pullAll(...) was not updated. bunx tsc --noEmit fails with TS18047: 'pool' is possibly 'null'. Fix with a non-null assertion (pool!.pullAll(...)) or a local variable pattern.


Unit Test Summary (52/52 pass after fixes)

Test file Tests Status
test_detect.ts -- SPA shell detection 13 PASS
test_ua.ts -- User-Agent rotation 5 PASS
test_routes.ts -- JS bundle route extraction 9 PASS
test_write.ts -- hash-URL file paths 9 PASS
test_path_traversal.ts -- path traversal guard 2 PASS
test_discover_hash.ts -- hash link extraction 9 PASS
test_renderer.ts -- Playwright renderer 5 PASS

E2E Integration Tests (2/2 pass)

E2E 1 -- JS bundle route discovery + Playwright rendering: Static fetch of a React SPA returned empty <div id="root"></div>. After running webpull, Playwright rendered the page and extracted 3 pages with full content that only exists after JS execution. Browser rendering confirmed working.

E2E 2 -- Worker browser fallback: Verified that individual workers correctly fall back to Playwright when they detect an SPA shell, producing content that only exists after JS execution.

- Fix hash link extraction logic (remove confusing double-branch)
- Convert string concatenation to template literals
- Fix import ordering per biome rules
- Fix noAssignInExpressions in routes.ts
- Fix formatting in detect.ts, renderer.ts, ua.ts
- Use relative() path check for robust traversal prevention
- Update README with SPA support and Playwright requirement
@JDRanpariya
Copy link
Copy Markdown
Author

addressed the fixes

Copy link
Copy Markdown

@vorflux vorflux Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed commit b54d4e2 -- the 3 previously flagged issues are all properly fixed and tsc --noEmit passes. Two minor items remain (1 medium, 1 low).


Review with Vorflux

Comment thread src/write.ts Outdated
Comment thread src/discover.ts Outdated
- Fix hash length check from >2 to >1 (include root "#/" route)
- Add fullUrl before slicing to prevent exceeding --max by one
- Apply same threshold fix in index.ts
@JDRanpariya
Copy link
Copy Markdown
Author

addressed the fixes

@vorflux
Copy link
Copy Markdown

vorflux Bot commented May 2, 2026

Reviewed commit 6e8f1e7 -- all previously flagged issues are now resolved. No new issues found.

Summary of fixes across all rounds:

  • Path traversal guard uses relative() with ../absolute check (robust against sibling-prefix bypass)
  • Playwright missing-browser error surfaces a clear message with install instructions
  • TypeScript strict null check on bodyMatch[1] fixed with ?? ""
  • Worker count capped at 4 in browser mode, using activePool local variable to fix TS narrowing
  • Hash link extraction simplified to a single clear predicate
  • Hash-route threshold corrected from > 2 to > 1 (captures #/ root)
  • Discovery max-overflow fixed by unshifting fullUrl before slicing

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant