Skip to content

PKB Starter v0.8.0 — MinerU Phase 2 + OCR + Global Pipeline

Latest

Choose a tag to compare

@Clockworkhg Clockworkhg released this 03 Jul 09:08

What's New

🔥 MinerU Phase 2 — PDF OCR + Layout Analysis

  • High-quality PDF extraction: layout analysis, LaTeX formula recognition, 84-language OCR
  • Automatic routing: PDF → MinerU (Phase 2), DOCX/PPTX/XLSX → MarkItDown
  • New tool: tools/mineru_extract.py

📋 OUTPUT.md — Output Format Specification

  • Formal specification for PKB output formats (replaces inapplicable DESIGN.md)
  • Agent report format, query response format, wiki page format

🔗 Obsidian MCP — Native Integration

  • .mcp.json now includes Obsidian MCP server config
  • Direct Obsidian vault access from Claude Code

🛠 Core Tools (NEW)

  • mineru_extract.py — PDF extraction engine
  • cnki_setup.py — CNKI infrastructure diagnostics
  • chrome_mcp_scraper.py — Chrome MCP data processor
  • runtime_detect.py — Multi-platform runtime detection

⚡ System Upgrades

  • CLAUDE.md rewrite: Skill routing table, tool catalog, hooks reference, L1-L5 query tiers
  • Hybrid search (/ask): BM25 + vector RRF + Cross-encoder pipeline
  • Plan-as-Contract (/pkb Step 0): auto/manual ingest mode with artifact logging
  • Smart hot.md: Weighted composite scoring (type boost + size penalty + diversity)
  • Structured routing: [ROUTE] PRIMARY/ALSO format in hooks
  • Index health: /lint now checks index freshness and coverage

🛡 Manifest Hardening

  • 3 path leak fixes (scihub_fetch, pkb_doctor, settings.json)
  • 7 coverage gaps closed (previously missing from sync manifest)
  • --validate flag: auto-detects manifest gaps before every sync
  • JSON-escaped path sanitization for settings.json

📦 What's in the Box

  • 127 manifest mappings (120 → 127)
  • 41 files changed, +3,374/-357 lines
  • 12 built-in commands, 44+ expandable skills