Releases · Clockworkhg/pkb-starter

Release list

PKB Starter v0.8.0 — MinerU Phase 2 + OCR + Global Pipeline Latest

Latest

Clockworkhg released this 03 Jul 09:08

v0.8.0-starter

88be277

What's New

🔥 MinerU Phase 2 — PDF OCR + Layout Analysis

High-quality PDF extraction: layout analysis, LaTeX formula recognition, 84-language OCR
Automatic routing: PDF → MinerU (Phase 2), DOCX/PPTX/XLSX → MarkItDown
New tool: tools/mineru_extract.py

📋 OUTPUT.md — Output Format Specification

Formal specification for PKB output formats (replaces inapplicable DESIGN.md)
Agent report format, query response format, wiki page format

🔗 Obsidian MCP — Native Integration

.mcp.json now includes Obsidian MCP server config
Direct Obsidian vault access from Claude Code

🛠 Core Tools (NEW)

mineru_extract.py — PDF extraction engine
cnki_setup.py — CNKI infrastructure diagnostics
chrome_mcp_scraper.py — Chrome MCP data processor
runtime_detect.py — Multi-platform runtime detection

⚡ System Upgrades

CLAUDE.md rewrite: Skill routing table, tool catalog, hooks reference, L1-L5 query tiers
Hybrid search (/ask): BM25 + vector RRF + Cross-encoder pipeline
Plan-as-Contract (/pkb Step 0): auto/manual ingest mode with artifact logging
Smart hot.md: Weighted composite scoring (type boost + size penalty + diversity)
Structured routing: [ROUTE] PRIMARY/ALSO format in hooks
Index health: /lint now checks index freshness and coverage

🛡 Manifest Hardening

3 path leak fixes (scihub_fetch, pkb_doctor, settings.json)
7 coverage gaps closed (previously missing from sync manifest)
--validate flag: auto-detects manifest gaps before every sync
JSON-escaped path sanitization for settings.json

📦 What's in the Box

127 manifest mappings (120 → 127)
41 files changed, +3,374/-357 lines
12 built-in commands, 44+ expandable skills

Assets 2

v0.6.15-starter: Anti-Degradation + Global Bridge + Sync Hardening

Clockworkhg released this 24 Jun 14:39

v0.6.15-starter

f833da9

v0.6.15-starter — Anti-Degradation + Global Bridge + Sync Pipeline Hardening

🛡️ Anti-Degradation System v1.0

15 bug fixes in hook pipeline (03_post_tool_use + 05_stop), validated by E2E test suite (582 lines)
File-persistent write counter for index rebuild scheduling
4-field frontmatter validation (created/updated/tags/type)
CRLF regex compatibility for Windows
Atomic file writes (tempfile + rename) for hot.md
UTF-8 BOM handling for Windows Notepad compatibility
Active topics regex merge (preserve existing)
Smart routing patterns expanded 11→26

🌐 Global Bridge Engine v1.0

pkb_bridge.py (1030 lines): cross-project /ask-pkb + /pkb-capture
4-layer PKB_ROOT discovery: env var → auto-detect → config → user prompt
9 security block patterns + 4 warning patterns
Graceful degradation with timeout + fallback

🔧 Sync Pipeline Hardening

Critical: binary file handling — BINARY_EXTENSIONS detection with read_bytes/write_bytes
Critical: sanitize_patterns double-backslash bug fix — residual PKB paths now properly sanitized
Manifest: 120 mappings / 0 duplicates / 0 dead entries
3× errors='replace' for Windows GBK safety

📁 Obsidian Vault Migration

Root .obsidian/ → wiki/.obsidian/ (single active vault)
Dataview plugin + pkb-colors CSS snippet + dashboard-dataview
Wiki templates (concept/paper/project/source)

📝 Documentation

All version badges unified to v0.6.15-starter
README skills count corrected: 7→3 built-in
Stale root-level CLAUDE.md removed
Optional tools properly qualified in docs

Full Changelog: https://github.com/Clockworkhg/pkb-starter/blob/master/CHANGELOG.md

Assets 2

PKB Starter v0.6.8-alpha — Scholarly Metadata Enrichment Pre-release

Pre-release

Clockworkhg released this 13 Jun 12:16

v0.6.8-alpha

3952891

🆕 What's New

Scholarly Literature Detection — automatic recognition via DOI, ISSN, arXiv, PMID, and structured metadata signals.
Crossref Metadata Enrichment — author, title, journal, year, volume, pages, publisher.
OpenAlex Work & Source Metrics (optional) — citation counts, open access status, source rankings.
Local Journal-Ranking Registry — user-imported CSSCI, PKU Core, AMI, CSCD, and custom lists. Journal matching via DOI-resolved ISSN, ISSN, EISSN, ISSN-L, normalized names, and fuzzy matching.
Citation Formatting — GB/T 7714 journal-article, APA 7 (citeproc-py), BibTeX, RIS, CSL-JSON export.
Batch Enrichment — dry-run, write, only-missing, JSONL, resumable jobs, locked-page protection.
Structured Literature Filtering — by ranking scheme, edition, level, year, journal, DOI, citation count, review status.

Usage

# Batch enrich existing literature
python tools/scholarly_enrich.py --scan wiki/ --write
python tools/scholarly_enrich.py --scan wiki/ --write --only-missing
python tools/scholarly_enrich.py --scan wiki/ --write --resume

# Filter literature
python tools/filter_literature.py --ranking CSSCI --year-from 2023 --min-citations 5

# Import journal rankings
python tools/import_journal_rankings.py import rankings.csv

⬆️ Upgrading

# Optional: install APA 7 formatting dependencies
pip install -r tools/requirements-scholarly.txt

# Optional: set OpenAlex API key for enhanced metrics
set OPENALEX_API_KEY=your_key_here

# Update system files
python tools/pkb_update_client.py --apply

Core PKB workflow works without optional dependencies.

🔒 Privacy & Security

Private PKB content is never synchronized to the public repository.
Imported journal-ranking datasets remain under .pkb_local/scholarly/ (gitignored).
Cache databases and resumable job state remain local.
API keys read only from environment variables.
Crossref/OpenAlex failures do not block ordinary /pkb ingestion.
No complete proprietary journal-ranking lists are bundled.

🧪 Test Results

Suite	Tests	Result
Private PKB	611	✅ passed
pkb-starter template	568	✅ passed
Fresh install (scholarly + CLI)	568	✅ passed

Full Changelog: CHANGELOG.md

Assets 2

PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture Pre-release

Pre-release

Clockworkhg released this 13 Jun 05:26

v0.6.7-alpha

74ba1f6

🆕 What's New

MarkItDown Document Ingestion (Phase 1.5)

markitdown_convert.py — local document-to-Markdown pre-extraction engine (PDF, DOCX, PPTX, XLSX, XLS).
pkb_ingest.py — local file ingest orchestrator (import → MarkItDown → cache → wiki).
Runtime version detection via importlib.metadata.version("markitdown").
Fallback state machine: MarkItDown success → cache; failure → LLM direct read → _PENDING_CONVERSION.md.
Legacy .doc returns explicit legacy_doc_unsupported status (use Word/LibreOffice to convert).
Conversion cache in .pkb-cache/extractions/ (gitignored).

pip install -r tools/requirements-markitdown.txt

Web Pack v3.1 — Playwright Dynamic Content Fallback

content_quality.py — content quality scoring that decides when Playwright fallback is needed.
playwright_renderer.py — optional Playwright Chromium DOM rendering.
network_capture.py — XHR/Fetch network response extraction with sensitive URL sanitization.
network_content.py — network body candidate extraction with deduplication.
selection_engine.py — three-way selector: HTTP static → Playwright DOM → Playwright Network.

Flag	Behavior
`--render`	Enables Playwright only when static quality is insufficient
`--headed`	Visible Chromium window (auto-enables `--render`)
`--debug-network`	Sanitized network diagnostics (no body/headers/cookies)

pip install -r tools/requirements-playwright.txt
playwright install chromium

⬆️ Upgrading

python tools/pkb_update_client.py --apply

No breaking changes — static collection behavior unchanged. Chromium not launched unless quality gates trigger.

🔒 Privacy & Security

Default behavior unchanged — no browser automation unless explicitly requested.
Does NOT bypass login, CAPTCHA, or access controls.
PKB-dedicated browser profile (not user's daily Chrome).
Safe mode does not persist login state.
MarkItDown cache is gitignored — never committed.

🧪 Test Results

244 passed, 0 failed
  - 145 Web Pack unit tests
  -  10 Chromium integration tests
  -  89 MarkItDown regression tests

📋 Known Limitations

Limitation	Note
Legacy `.doc` files	Not supported — convert to `.docx` with Word/LibreOffice first
OCR not enabled	Phase 2+ — scanned PDF images are not text-extracted
Playwright is optional	`pip install -r tools/requirements-playwright.txt` if needed
App-only pages	May still be uncollectable even with Playwright

Full Changelog: CHANGELOG.md

Assets 2

Releases: Clockworkhg/pkb-starter

Release list

PKB Starter v0.8.0 — MinerU Phase 2 + OCR + Global Pipeline

What's New

🔥 MinerU Phase 2 — PDF OCR + Layout Analysis

📋 OUTPUT.md — Output Format Specification

🔗 Obsidian MCP — Native Integration

🛠 Core Tools (NEW)

⚡ System Upgrades

🛡 Manifest Hardening

📦 What's in the Box

Uh oh!

v0.6.15-starter: Anti-Degradation + Global Bridge + Sync Hardening

v0.6.15-starter — Anti-Degradation + Global Bridge + Sync Pipeline Hardening

🛡️ Anti-Degradation System v1.0

🌐 Global Bridge Engine v1.0

🔧 Sync Pipeline Hardening

📁 Obsidian Vault Migration

📝 Documentation

Uh oh!

PKB Starter v0.6.8-alpha — Scholarly Metadata Enrichment

🆕 What's New

Usage

⬆️ Upgrading

🔒 Privacy & Security

🧪 Test Results

Uh oh!

PKB Starter v0.6.7-alpha — MarkItDown Ingestion & Playwright Web Capture

🆕 What's New

MarkItDown Document Ingestion (Phase 1.5)

Web Pack v3.1 — Playwright Dynamic Content Fallback

⬆️ Upgrading

🔒 Privacy & Security

🧪 Test Results

📋 Known Limitations

Uh oh!