Skip to content

v0.1.1

Choose a tag to compare

@ArkNill ArkNill released this 16 Mar 23:57
· 17 commits to main since this release

Initial Public Release

Universal web content extraction — any URL to LLM-ready markdown.

Highlights

  • HTML: BeautifulSoup + content density filtering (removes nav, sidebar, ads)
  • YouTube: Transcript extraction with timestamps and multi-language support
  • PDF: Text extraction with page structure (pdfplumber)
  • DOCX: Paragraph and heading extraction (python-docx)
  • Auto-fallback: httpx first, Playwright for JS-heavy pages
  • Async-first: Built on httpx and Playwright async APIs
  • CLI: markgrab <url> with markdown/text/JSON output
  • Anti-bot stealth: Opt-in Playwright stealth scripts
  • 114 unit tests, all passing
  • MIT licensed

Install

pip install markgrab

Python 3.11+ required. See README for details.