fix: extract plain-text URLs from Google Docs HTML export by gvonness-apolitical · Pull Request #20 · Entrolution/speed-read

gvonness-apolitical · 2026-02-03T13:17:42Z

Summary

Google Docs HTML exports can contain bare plain-text URLs alongside hyperlinked ones, and may split URLs across <span> elements — both cases were previously missed
Replaced the two-pass extraction (hrefs first, then plain-text) with a single-pass strategy that resolves anchor hrefs inline, strips HTML tags, and scans once
Preserves document order for all extracted URLs regardless of whether they were linked or bare

Test plan

Added test: mixed content with both linked and bare URLs — all extracted
Added test: plain-text URLs in HTML with no Tumblr hrefs — still extracted
Added test: URL split across <span> tags — reassembled and extracted
Added test: interleaved linked and bare URLs preserve document order
All 236 existing + new tests pass
Clean production build

Google Docs can contain a mix of hyperlinked and bare plain-text URLs, and may split plain-text URLs across <span> elements. The previous two-pass approach skipped plain-text extraction when hrefs were found, and didn't strip HTML tags before regex matching. Replace the two-pass approach with a single-pass strategy: resolve anchor hrefs inline as plain text, strip remaining tags, then scan once — preserving document order for both linked and bare URLs.

gvonness-apolitical merged commit cf98816 into main Feb 3, 2026
2 checks passed

gvonness-apolitical deleted the fix/extract-plain-text-urls-from-google-docs branch February 3, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: extract plain-text URLs from Google Docs HTML export#20

fix: extract plain-text URLs from Google Docs HTML export#20
gvonness-apolitical merged 1 commit intomainfrom
fix/extract-plain-text-urls-from-google-docs

gvonness-apolitical commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gvonness-apolitical commented Feb 3, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant