Skip to content

PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture

Pre-release
Pre-release

Choose a tag to compare

@Clockworkhg Clockworkhg released this 13 Jun 05:26

🆕 What's New

MarkItDown Document Ingestion (Phase 1.5)

  • markitdown_convert.py — local document-to-Markdown pre-extraction engine (PDF, DOCX, PPTX, XLSX, XLS).
  • pkb_ingest.py — local file ingest orchestrator (import → MarkItDown → cache → wiki).
  • Runtime version detection via importlib.metadata.version("markitdown").
  • Fallback state machine: MarkItDown success → cache; failure → LLM direct read → _PENDING_CONVERSION.md.
  • Legacy .doc returns explicit legacy_doc_unsupported status (use Word/LibreOffice to convert).
  • Conversion cache in .pkb-cache/extractions/ (gitignored).
pip install -r tools/requirements-markitdown.txt

Web Pack v3.1 — Playwright Dynamic Content Fallback

  • content_quality.py — content quality scoring that decides when Playwright fallback is needed.
  • playwright_renderer.py — optional Playwright Chromium DOM rendering.
  • network_capture.py — XHR/Fetch network response extraction with sensitive URL sanitization.
  • network_content.py — network body candidate extraction with deduplication.
  • selection_engine.py — three-way selector: HTTP static → Playwright DOM → Playwright Network.
Flag Behavior
--render Enables Playwright only when static quality is insufficient
--headed Visible Chromium window (auto-enables --render)
--debug-network Sanitized network diagnostics (no body/headers/cookies)
pip install -r tools/requirements-playwright.txt
playwright install chromium

⬆️ Upgrading

python tools/pkb_update_client.py --apply

No breaking changes — static collection behavior unchanged. Chromium not launched unless quality gates trigger.

🔒 Privacy & Security

  • Default behavior unchanged — no browser automation unless explicitly requested.
  • Does NOT bypass login, CAPTCHA, or access controls.
  • PKB-dedicated browser profile (not user's daily Chrome).
  • Safe mode does not persist login state.
  • MarkItDown cache is gitignored — never committed.

🧪 Test Results

244 passed, 0 failed
  - 145 Web Pack unit tests
  -  10 Chromium integration tests
  -  89 MarkItDown regression tests

📋 Known Limitations

Limitation Note
Legacy .doc files Not supported — convert to .docx with Word/LibreOffice first
OCR not enabled Phase 2+ — scanned PDF images are not text-extracted
Playwright is optional pip install -r tools/requirements-playwright.txt if needed
App-only pages May still be uncollectable even with Playwright

Full Changelog: CHANGELOG.md