v0.1.1

ArkNill released this 16 Mar 23:57

· 17 commits to main since this release

e3c1f94

Initial Public Release

Universal web content extraction — any URL to LLM-ready markdown.

Highlights

HTML: BeautifulSoup + content density filtering (removes nav, sidebar, ads)
YouTube: Transcript extraction with timestamps and multi-language support
PDF: Text extraction with page structure (pdfplumber)
DOCX: Paragraph and heading extraction (python-docx)
Auto-fallback: httpx first, Playwright for JS-heavy pages
Async-first: Built on httpx and Playwright async APIs
CLI: markgrab <url> with markdown/text/JSON output
Anti-bot stealth: Opt-in Playwright stealth scripts
114 unit tests, all passing
MIT licensed

Install

pip install markgrab

Python 3.11+ required. See README for details.

Assets 2