Skip to content

Releases: Clockworkhg/pkb-starter

PKB Starter v0.8.0 — MinerU Phase 2 + OCR + Global Pipeline

Choose a tag to compare

@Clockworkhg Clockworkhg released this 03 Jul 09:08

What's New

🔥 MinerU Phase 2 — PDF OCR + Layout Analysis

  • High-quality PDF extraction: layout analysis, LaTeX formula recognition, 84-language OCR
  • Automatic routing: PDF → MinerU (Phase 2), DOCX/PPTX/XLSX → MarkItDown
  • New tool: tools/mineru_extract.py

📋 OUTPUT.md — Output Format Specification

  • Formal specification for PKB output formats (replaces inapplicable DESIGN.md)
  • Agent report format, query response format, wiki page format

🔗 Obsidian MCP — Native Integration

  • .mcp.json now includes Obsidian MCP server config
  • Direct Obsidian vault access from Claude Code

🛠 Core Tools (NEW)

  • mineru_extract.py — PDF extraction engine
  • cnki_setup.py — CNKI infrastructure diagnostics
  • chrome_mcp_scraper.py — Chrome MCP data processor
  • runtime_detect.py — Multi-platform runtime detection

⚡ System Upgrades

  • CLAUDE.md rewrite: Skill routing table, tool catalog, hooks reference, L1-L5 query tiers
  • Hybrid search (/ask): BM25 + vector RRF + Cross-encoder pipeline
  • Plan-as-Contract (/pkb Step 0): auto/manual ingest mode with artifact logging
  • Smart hot.md: Weighted composite scoring (type boost + size penalty + diversity)
  • Structured routing: [ROUTE] PRIMARY/ALSO format in hooks
  • Index health: /lint now checks index freshness and coverage

🛡 Manifest Hardening

  • 3 path leak fixes (scihub_fetch, pkb_doctor, settings.json)
  • 7 coverage gaps closed (previously missing from sync manifest)
  • --validate flag: auto-detects manifest gaps before every sync
  • JSON-escaped path sanitization for settings.json

📦 What's in the Box

  • 127 manifest mappings (120 → 127)
  • 41 files changed, +3,374/-357 lines
  • 12 built-in commands, 44+ expandable skills

v0.6.15-starter: Anti-Degradation + Global Bridge + Sync Hardening

Choose a tag to compare

@Clockworkhg Clockworkhg released this 24 Jun 14:39

v0.6.15-starter — Anti-Degradation + Global Bridge + Sync Pipeline Hardening

🛡️ Anti-Degradation System v1.0

  • 15 bug fixes in hook pipeline (03_post_tool_use + 05_stop), validated by E2E test suite (582 lines)
  • File-persistent write counter for index rebuild scheduling
  • 4-field frontmatter validation (created/updated/tags/type)
  • CRLF regex compatibility for Windows
  • Atomic file writes (tempfile + rename) for hot.md
  • UTF-8 BOM handling for Windows Notepad compatibility
  • Active topics regex merge (preserve existing)
  • Smart routing patterns expanded 11→26

🌐 Global Bridge Engine v1.0

  • pkb_bridge.py (1030 lines): cross-project /ask-pkb + /pkb-capture
  • 4-layer PKB_ROOT discovery: env var → auto-detect → config → user prompt
  • 9 security block patterns + 4 warning patterns
  • Graceful degradation with timeout + fallback

🔧 Sync Pipeline Hardening

  • Critical: binary file handling — BINARY_EXTENSIONS detection with read_bytes/write_bytes
  • Critical: sanitize_patterns double-backslash bug fix — residual PKB paths now properly sanitized
  • Manifest: 120 mappings / 0 duplicates / 0 dead entries
  • errors='replace' for Windows GBK safety

📁 Obsidian Vault Migration

  • Root .obsidian/wiki/.obsidian/ (single active vault)
  • Dataview plugin + pkb-colors CSS snippet + dashboard-dataview
  • Wiki templates (concept/paper/project/source)

📝 Documentation

  • All version badges unified to v0.6.15-starter
  • README skills count corrected: 7→3 built-in
  • Stale root-level CLAUDE.md removed
  • Optional tools properly qualified in docs

Full Changelog: https://github.com/Clockworkhg/pkb-starter/blob/master/CHANGELOG.md

PKB Starter v0.6.8-alpha — Scholarly Metadata Enrichment

Choose a tag to compare

@Clockworkhg Clockworkhg released this 13 Jun 12:16

🆕 What's New

  • Scholarly Literature Detection — automatic recognition via DOI, ISSN, arXiv, PMID, and structured metadata signals.

  • Crossref Metadata Enrichment — author, title, journal, year, volume, pages, publisher.

  • OpenAlex Work & Source Metrics (optional) — citation counts, open access status, source rankings.

  • Local Journal-Ranking Registry — user-imported CSSCI, PKU Core, AMI, CSCD, and custom lists. Journal matching via DOI-resolved ISSN, ISSN, EISSN, ISSN-L, normalized names, and fuzzy matching.

  • Citation Formatting — GB/T 7714 journal-article, APA 7 (citeproc-py), BibTeX, RIS, CSL-JSON export.

  • Batch Enrichment — dry-run, write, only-missing, JSONL, resumable jobs, locked-page protection.

  • Structured Literature Filtering — by ranking scheme, edition, level, year, journal, DOI, citation count, review status.

Usage

# Batch enrich existing literature
python tools/scholarly_enrich.py --scan wiki/ --write
python tools/scholarly_enrich.py --scan wiki/ --write --only-missing
python tools/scholarly_enrich.py --scan wiki/ --write --resume

# Filter literature
python tools/filter_literature.py --ranking CSSCI --year-from 2023 --min-citations 5

# Import journal rankings
python tools/import_journal_rankings.py import rankings.csv

⬆️ Upgrading

# Optional: install APA 7 formatting dependencies
pip install -r tools/requirements-scholarly.txt

# Optional: set OpenAlex API key for enhanced metrics
set OPENALEX_API_KEY=your_key_here

# Update system files
python tools/pkb_update_client.py --apply

Core PKB workflow works without optional dependencies.

🔒 Privacy & Security

  • Private PKB content is never synchronized to the public repository.
  • Imported journal-ranking datasets remain under .pkb_local/scholarly/ (gitignored).
  • Cache databases and resumable job state remain local.
  • API keys read only from environment variables.
  • Crossref/OpenAlex failures do not block ordinary /pkb ingestion.
  • No complete proprietary journal-ranking lists are bundled.

🧪 Test Results

Suite Tests Result
Private PKB 611 ✅ passed
pkb-starter template 568 ✅ passed
Fresh install (scholarly + CLI) 568 ✅ passed

Full Changelog: CHANGELOG.md

PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture

Choose a tag to compare

@Clockworkhg Clockworkhg released this 13 Jun 05:26

🆕 What's New

MarkItDown Document Ingestion (Phase 1.5)

  • markitdown_convert.py — local document-to-Markdown pre-extraction engine (PDF, DOCX, PPTX, XLSX, XLS).
  • pkb_ingest.py — local file ingest orchestrator (import → MarkItDown → cache → wiki).
  • Runtime version detection via importlib.metadata.version("markitdown").
  • Fallback state machine: MarkItDown success → cache; failure → LLM direct read → _PENDING_CONVERSION.md.
  • Legacy .doc returns explicit legacy_doc_unsupported status (use Word/LibreOffice to convert).
  • Conversion cache in .pkb-cache/extractions/ (gitignored).
pip install -r tools/requirements-markitdown.txt

Web Pack v3.1 — Playwright Dynamic Content Fallback

  • content_quality.py — content quality scoring that decides when Playwright fallback is needed.
  • playwright_renderer.py — optional Playwright Chromium DOM rendering.
  • network_capture.py — XHR/Fetch network response extraction with sensitive URL sanitization.
  • network_content.py — network body candidate extraction with deduplication.
  • selection_engine.py — three-way selector: HTTP static → Playwright DOM → Playwright Network.
Flag Behavior
--render Enables Playwright only when static quality is insufficient
--headed Visible Chromium window (auto-enables --render)
--debug-network Sanitized network diagnostics (no body/headers/cookies)
pip install -r tools/requirements-playwright.txt
playwright install chromium

⬆️ Upgrading

python tools/pkb_update_client.py --apply

No breaking changes — static collection behavior unchanged. Chromium not launched unless quality gates trigger.

🔒 Privacy & Security

  • Default behavior unchanged — no browser automation unless explicitly requested.
  • Does NOT bypass login, CAPTCHA, or access controls.
  • PKB-dedicated browser profile (not user's daily Chrome).
  • Safe mode does not persist login state.
  • MarkItDown cache is gitignored — never committed.

🧪 Test Results

244 passed, 0 failed
  - 145 Web Pack unit tests
  -  10 Chromium integration tests
  -  89 MarkItDown regression tests

📋 Known Limitations

Limitation Note
Legacy .doc files Not supported — convert to .docx with Word/LibreOffice first
OCR not enabled Phase 2+ — scanned PDF images are not text-extracted
Playwright is optional pip install -r tools/requirements-playwright.txt if needed
App-only pages May still be uncollectable even with Playwright

Full Changelog: CHANGELOG.md