Skip to content

v3.0.0 — Documents, media, YouTube & frontmatter-first output

Latest

Choose a tag to compare

@syswave-dev syswave-dev released this 10 Jun 11:49
· 1 commit to main since this release
00d7d91

PullMD v3 grows from a web-page reader into a general anything-to-Markdown service for agents, with a leaner default output. Everything beyond plain web extraction is opt-in and degrades gracefully — left unconfigured, v3 handles web pages exactly like v2, just with a cleaner body by default.

Breaking

  • Clean markdown body by default. The inline source-attribution line is no longer emitted in the body — for web pages and Reddit posts (**r/sub** · u/user · N ↑ · … is gone; subreddit, author, published, upvotes are frontmatter fields now). Set PULLMD_SOURCE_HEADER=true to restore the legacy inline header. See MIGRATION.md.

Added

  • Document conversion via the new markitdown sidecar: PDF, DOCX, PPTX, XLSX, EPUB, ZIP, CSV, JSON, XML — by URL (GET /api?url=…) or upload (POST /api/file, 25 MB), with PWA drag-and-drop & file picker. Conversions run sandboxed (subprocess timeout + memory cap).
  • Opt-in media tier (PULLMD_VISION_* / PULLMD_STT_*): image captioning and audio transcription via any OpenAI-compatible endpoint (cloud or local) — runs inside pullmd, no extra container.
  • Opt-in PDF OCR tier (PULLMD_PDF_OCR_* + ?pdf=ocr): table-grade PDF conversion via a vendor-neutral OCR provider (reference: Mistral OCR), with automatic fallback to the free path. Includes a PWA toggle.
  • Keyless YouTube transcripts (MARKITDOWN_YOUTUBE=true): title + description + transcript with clickable timecodes; per-request yt_timecodes / yt_chunk.
  • PULLMD_FRONTMATTER_FIELDS allowlist to trim frontmatter, plus source-specific fields (LLM usage, image size, audio seconds, YouTube duration/views, Reddit meta, pdf_pages).

Changed

  • Claude Code skill bundle renamed web-readerpullmd (GET /pullmd.zip; the old URL redirects). Existing installs: rm -rf ~/.claude/skills/web-reader before installing the new zip.
  • MCP read_url description and skill instructions cover the v3 capabilities; the MCP server reports the real package version.

Fixed

  • Relative image/link URLs are resolved against the source page (no more broken images on share pages).
  • Media frontmatter survives the cache; provider errors degrade to plain extraction.
  • PWA: permalink bar moved above the result header.

Full details in the CHANGELOG.

Images: aeternalabshq/{pullmd,pullmd-trafilatura,pullmd-playwright,pullmd-markitdown}:3.0.0 on Docker Hub and GHCR (multi-arch amd64+arm64). :latest now tracks v3 — pin :2 to stay on the v2 output format.