PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture
Pre-release
Pre-release
🆕 What's New
MarkItDown Document Ingestion (Phase 1.5)
markitdown_convert.py— local document-to-Markdown pre-extraction engine (PDF, DOCX, PPTX, XLSX, XLS).pkb_ingest.py— local file ingest orchestrator (import → MarkItDown → cache → wiki).- Runtime version detection via
importlib.metadata.version("markitdown"). - Fallback state machine: MarkItDown success → cache; failure → LLM direct read →
_PENDING_CONVERSION.md. - Legacy
.docreturns explicitlegacy_doc_unsupportedstatus (use Word/LibreOffice to convert). - Conversion cache in
.pkb-cache/extractions/(gitignored).
pip install -r tools/requirements-markitdown.txtWeb Pack v3.1 — Playwright Dynamic Content Fallback
content_quality.py— content quality scoring that decides when Playwright fallback is needed.playwright_renderer.py— optional Playwright Chromium DOM rendering.network_capture.py— XHR/Fetch network response extraction with sensitive URL sanitization.network_content.py— network body candidate extraction with deduplication.selection_engine.py— three-way selector: HTTP static → Playwright DOM → Playwright Network.
| Flag | Behavior |
|---|---|
--render |
Enables Playwright only when static quality is insufficient |
--headed |
Visible Chromium window (auto-enables --render) |
--debug-network |
Sanitized network diagnostics (no body/headers/cookies) |
pip install -r tools/requirements-playwright.txt
playwright install chromium⬆️ Upgrading
python tools/pkb_update_client.py --applyNo breaking changes — static collection behavior unchanged. Chromium not launched unless quality gates trigger.
🔒 Privacy & Security
- Default behavior unchanged — no browser automation unless explicitly requested.
- Does NOT bypass login, CAPTCHA, or access controls.
- PKB-dedicated browser profile (not user's daily Chrome).
- Safe mode does not persist login state.
- MarkItDown cache is gitignored — never committed.
🧪 Test Results
244 passed, 0 failed
- 145 Web Pack unit tests
- 10 Chromium integration tests
- 89 MarkItDown regression tests
📋 Known Limitations
| Limitation | Note |
|---|---|
Legacy .doc files |
Not supported — convert to .docx with Word/LibreOffice first |
| OCR not enabled | Phase 2+ — scanned PDF images are not text-extracted |
| Playwright is optional | pip install -r tools/requirements-playwright.txt if needed |
| App-only pages | May still be uncollectable even with Playwright |
Full Changelog: CHANGELOG.md