Releases: Clockworkhg/pkb-starter
Release list
PKB Starter v0.8.0 — MinerU Phase 2 + OCR + Global Pipeline
What's New
🔥 MinerU Phase 2 — PDF OCR + Layout Analysis
- High-quality PDF extraction: layout analysis, LaTeX formula recognition, 84-language OCR
- Automatic routing: PDF → MinerU (Phase 2), DOCX/PPTX/XLSX → MarkItDown
- New tool:
tools/mineru_extract.py
📋 OUTPUT.md — Output Format Specification
- Formal specification for PKB output formats (replaces inapplicable DESIGN.md)
- Agent report format, query response format, wiki page format
🔗 Obsidian MCP — Native Integration
.mcp.jsonnow includes Obsidian MCP server config- Direct Obsidian vault access from Claude Code
🛠 Core Tools (NEW)
mineru_extract.py— PDF extraction enginecnki_setup.py— CNKI infrastructure diagnosticschrome_mcp_scraper.py— Chrome MCP data processorruntime_detect.py— Multi-platform runtime detection
⚡ System Upgrades
- CLAUDE.md rewrite: Skill routing table, tool catalog, hooks reference, L1-L5 query tiers
- Hybrid search (
/ask): BM25 + vector RRF + Cross-encoder pipeline - Plan-as-Contract (
/pkbStep 0): auto/manual ingest mode with artifact logging - Smart hot.md: Weighted composite scoring (type boost + size penalty + diversity)
- Structured routing:
[ROUTE] PRIMARY/ALSOformat in hooks - Index health:
/lintnow checks index freshness and coverage
🛡 Manifest Hardening
- 3 path leak fixes (scihub_fetch, pkb_doctor, settings.json)
- 7 coverage gaps closed (previously missing from sync manifest)
--validateflag: auto-detects manifest gaps before every sync- JSON-escaped path sanitization for settings.json
📦 What's in the Box
- 127 manifest mappings (120 → 127)
- 41 files changed, +3,374/-357 lines
- 12 built-in commands, 44+ expandable skills
v0.6.15-starter: Anti-Degradation + Global Bridge + Sync Hardening
v0.6.15-starter — Anti-Degradation + Global Bridge + Sync Pipeline Hardening
🛡️ Anti-Degradation System v1.0
- 15 bug fixes in hook pipeline (03_post_tool_use + 05_stop), validated by E2E test suite (582 lines)
- File-persistent write counter for index rebuild scheduling
- 4-field frontmatter validation (created/updated/tags/type)
- CRLF regex compatibility for Windows
- Atomic file writes (tempfile + rename) for hot.md
- UTF-8 BOM handling for Windows Notepad compatibility
- Active topics regex merge (preserve existing)
- Smart routing patterns expanded 11→26
🌐 Global Bridge Engine v1.0
pkb_bridge.py(1030 lines): cross-project/ask-pkb+/pkb-capture- 4-layer PKB_ROOT discovery: env var → auto-detect → config → user prompt
- 9 security block patterns + 4 warning patterns
- Graceful degradation with timeout + fallback
🔧 Sync Pipeline Hardening
- Critical: binary file handling — BINARY_EXTENSIONS detection with read_bytes/write_bytes
- Critical: sanitize_patterns double-backslash bug fix — residual PKB paths now properly sanitized
- Manifest: 120 mappings / 0 duplicates / 0 dead entries
- 3×
errors='replace'for Windows GBK safety
📁 Obsidian Vault Migration
- Root
.obsidian/→wiki/.obsidian/(single active vault) - Dataview plugin + pkb-colors CSS snippet + dashboard-dataview
- Wiki templates (concept/paper/project/source)
📝 Documentation
- All version badges unified to v0.6.15-starter
- README skills count corrected: 7→3 built-in
- Stale root-level CLAUDE.md removed
- Optional tools properly qualified in docs
Full Changelog: https://github.com/Clockworkhg/pkb-starter/blob/master/CHANGELOG.md
PKB Starter v0.6.8-alpha — Scholarly Metadata Enrichment
🆕 What's New
-
Scholarly Literature Detection — automatic recognition via DOI, ISSN, arXiv, PMID, and structured metadata signals.
-
Crossref Metadata Enrichment — author, title, journal, year, volume, pages, publisher.
-
OpenAlex Work & Source Metrics (optional) — citation counts, open access status, source rankings.
-
Local Journal-Ranking Registry — user-imported CSSCI, PKU Core, AMI, CSCD, and custom lists. Journal matching via DOI-resolved ISSN, ISSN, EISSN, ISSN-L, normalized names, and fuzzy matching.
-
Citation Formatting — GB/T 7714 journal-article, APA 7 (citeproc-py), BibTeX, RIS, CSL-JSON export.
-
Batch Enrichment — dry-run, write, only-missing, JSONL, resumable jobs, locked-page protection.
-
Structured Literature Filtering — by ranking scheme, edition, level, year, journal, DOI, citation count, review status.
Usage
# Batch enrich existing literature
python tools/scholarly_enrich.py --scan wiki/ --write
python tools/scholarly_enrich.py --scan wiki/ --write --only-missing
python tools/scholarly_enrich.py --scan wiki/ --write --resume
# Filter literature
python tools/filter_literature.py --ranking CSSCI --year-from 2023 --min-citations 5
# Import journal rankings
python tools/import_journal_rankings.py import rankings.csv⬆️ Upgrading
# Optional: install APA 7 formatting dependencies
pip install -r tools/requirements-scholarly.txt
# Optional: set OpenAlex API key for enhanced metrics
set OPENALEX_API_KEY=your_key_here
# Update system files
python tools/pkb_update_client.py --applyCore PKB workflow works without optional dependencies.
🔒 Privacy & Security
- Private PKB content is never synchronized to the public repository.
- Imported journal-ranking datasets remain under
.pkb_local/scholarly/(gitignored). - Cache databases and resumable job state remain local.
- API keys read only from environment variables.
- Crossref/OpenAlex failures do not block ordinary
/pkbingestion. - No complete proprietary journal-ranking lists are bundled.
🧪 Test Results
| Suite | Tests | Result |
|---|---|---|
| Private PKB | 611 | ✅ passed |
| pkb-starter template | 568 | ✅ passed |
| Fresh install (scholarly + CLI) | 568 | ✅ passed |
Full Changelog: CHANGELOG.md
PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture
🆕 What's New
MarkItDown Document Ingestion (Phase 1.5)
markitdown_convert.py— local document-to-Markdown pre-extraction engine (PDF, DOCX, PPTX, XLSX, XLS).pkb_ingest.py— local file ingest orchestrator (import → MarkItDown → cache → wiki).- Runtime version detection via
importlib.metadata.version("markitdown"). - Fallback state machine: MarkItDown success → cache; failure → LLM direct read →
_PENDING_CONVERSION.md. - Legacy
.docreturns explicitlegacy_doc_unsupportedstatus (use Word/LibreOffice to convert). - Conversion cache in
.pkb-cache/extractions/(gitignored).
pip install -r tools/requirements-markitdown.txtWeb Pack v3.1 — Playwright Dynamic Content Fallback
content_quality.py— content quality scoring that decides when Playwright fallback is needed.playwright_renderer.py— optional Playwright Chromium DOM rendering.network_capture.py— XHR/Fetch network response extraction with sensitive URL sanitization.network_content.py— network body candidate extraction with deduplication.selection_engine.py— three-way selector: HTTP static → Playwright DOM → Playwright Network.
| Flag | Behavior |
|---|---|
--render |
Enables Playwright only when static quality is insufficient |
--headed |
Visible Chromium window (auto-enables --render) |
--debug-network |
Sanitized network diagnostics (no body/headers/cookies) |
pip install -r tools/requirements-playwright.txt
playwright install chromium⬆️ Upgrading
python tools/pkb_update_client.py --applyNo breaking changes — static collection behavior unchanged. Chromium not launched unless quality gates trigger.
🔒 Privacy & Security
- Default behavior unchanged — no browser automation unless explicitly requested.
- Does NOT bypass login, CAPTCHA, or access controls.
- PKB-dedicated browser profile (not user's daily Chrome).
- Safe mode does not persist login state.
- MarkItDown cache is gitignored — never committed.
🧪 Test Results
244 passed, 0 failed
- 145 Web Pack unit tests
- 10 Chromium integration tests
- 89 MarkItDown regression tests
📋 Known Limitations
| Limitation | Note |
|---|---|
Legacy .doc files |
Not supported — convert to .docx with Word/LibreOffice first |
| OCR not enabled | Phase 2+ — scanned PDF images are not text-extracted |
| Playwright is optional | pip install -r tools/requirements-playwright.txt if needed |
| App-only pages | May still be uncollectable even with Playwright |
Full Changelog: CHANGELOG.md