spec 010: personality taste, real speckit artifacts, PDF audit by jeremymanning · Pull Request #148 · ContextLab/llmXive

jeremymanning · 2026-05-15T11:55:49Z

Summary

Spec 010 lands three independent thrusts addressing user-reported issues:

Personality taste/curation — extended the deterministic rubric with three new required axes (explicit position, liveness-checked adjacent_work, named interest_signal). The umbrella prompt + 10 persona cards now require these fields. Backward-compat fallback for legacy callers (zero-axis scoring) preserves existing integration tests.
Real speckit artifacts — audit every .md under projects/**/specs/ and projects/**/.specify/ via _real_only_guard; transitively delete templates; roll project stages back via history walk. Executed against the repo: 6 templates deleted across 5 projects; second audit reports 0 templates → SC-003 achieved. New llmxive speckit audit-artifacts + prune-templates CLI commands.
PDF audit — deterministic llmxive pdf-pipeline audit command (zero LLM calls, enforced by existing AST guard). Walks every PDF under docs/papers/, runs text-level checks via pdfplumber (literal LaTeX commands, citation glyphs, section-number monotonicity, author-block layout) and pixel hooks via pdf2image. Crash-tolerant with quarantine per FR-014 Clarification Q3. Executed against the repo: 8 PDFs audited, 35 source-fixable citation-style failures surfaced + section-heading false-positives fixed by tighter regex + bibliography-page skip.

Spec / plan / tasks artifacts

specs/010-.../spec.md — 3 user stories, 23 FRs, 8 SCs
specs/010-.../plan.md — Constitution Check (all 5 principles PASS)
specs/010-.../research.md — 6 architectural decisions with alternatives
specs/010-.../data-model.md — 7 entities
specs/010-.../contracts/ — 3 JSON schemas
specs/010-.../quickstart.md — 3 runnable scenarios
specs/010-.../tasks.md — 60 tasks across 6 phases
specs/010-.../analyze.md — 5 issues found in /speckit-analyze, all resolved (coverage 94% → 100%)

Clarifications recorded

Q1 (FR-002): adjacent-work pointer verification → liveness-checked on write (HEAD against arXiv/DOI/URL, 7-day cache).
Q2 (FR-001): position field representation → YAML frontmatter key.
Q3 (FR-014): PDF audit failure mode → quarantine + record, continue auditing.

Implementation summary

Component	Files	Lines
Liveness check	`src/llmxive/audit/liveness.py` + tests	~240
Speckit prune	`src/llmxive/audit/speckit_prune.py` + tests + CLI	~440
PDF audit	`src/llmxive/pipeline/pdf_pipeline/audit.py` + classifier + tests + CLI	~480
Position diversity	`src/llmxive/agents/position_diversity.py` + tests	~190
Rubric extension	`src/llmxive/audit/personality_rubric.py` (extended)	~80 added
Umbrella prompt	`agents/prompts/personality.md` (Required outputs section)	~30 added
10 persona cards	each gets `example_contribution` frontmatter	~12 lines × 10

Verification

66 unit + integration tests pass (10 liveness, 7 rubric axes, 8 rotation diversity, 4 contribution schema, 3 speckit schema, 5 speckit prune, 2 guard coverage, 4 pdf audit schema, 11 pdf audit text checks, 5 classify_failure, 7 librarian gate).
pytest tests/unit/ -x (full unit suite earlier this session): 322 passed.
Real call for spec 010: tests/real_call/test_personality_liveness_real.py — HEAD against arXiv + DOI, gated by LLMXIVE_NETWORK_TESTS=1.
No LLM in PDF pipeline — existing AST guard (tests/unit/test_pdf_pipeline_no_llm.py) confirms.
Speckit prune executed: 6 templates removed; second audit returns 0 templates (idempotence per FR-022; SC-003 satisfied).
PDF audit executed: 8 PDFs / 35 legitimate source-fixable failures surfaced. Full remediation requires per-paper source-level re-compilation through the existing pipeline (out of scope for this PR; the audit machinery is the deliverable).

Constitution compliance

All five principles PASS (per plan.md Constitution Check):

I (Single Source of Truth): single rubric, single guard, single audit CLI
II (Verified Accuracy): liveness check IS the principle-II mechanism for personality citations
III (Real-World Testing): real-call test gated by env var; mock-OK unit tests are secondary
IV (Cost-Effectiveness): zero new paid services; 7-day liveness cache; pdf2image + poppler are OSS
V (Fail Fast): _real_only_guard.assert_real_or_raise BEFORE artifact writes; shutil.which('pdftoppm') check in audit; liveness timeout 10s; PIPELINE_PARALLELISM validated at startup

🤖 Generated with Claude Code

Three P1/P1/P2 user stories with 23 functional requirements and 8 success criteria addressing the user's three reported issues: 1. Personality contributions feel like inspired commentary, not review. Strengthen rubric with three new required axes (explicit position, verifiable adjacent-work pointer, interest-signal anchor) on top of the existing four. Bias rotation toward differential positions. 2. Speckit artifacts are mostly templates; 543/576 projects stuck at flesh_out_complete with zero artifacts. Audit all artifacts via existing _real_only_guard, prune templates, roll stages back transitively, add per-tick PIPELINE_PARALLELISM cap so the queue actually moves. 3. PDF rendering has visible bugs (literal commands, mixed cites, inconsistent author blocks/figure widths, section-number gaps). Audit every page of every PDF under docs/papers/ deterministically (zero LLM calls, hard constraint). Classify failures as source-fixable / unsupported-construct / source-missing and drive the current pool to zero failing pages. All three thrusts are independently testable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ies resolved Q1 (FR-002): adjacent-work pointer verification → liveness-checked on write. HEAD request to arXiv/DOI/URL; non-2xx-3xx (or 10s timeout) rejects the contribution. 7-day cache to avoid hammering arXiv/DOI on retries. Q2 (FR-001): position field representation → YAML frontmatter (key 'position', values lean_toward / lean_against / suggest_revision / abstain). Consistent with existing persona-card frontmatter. Body mirrors it for human readers. Q3 (FR-014): PDF audit script failure mode → quarantine + record, continue. Uncatchable render failure moves the PDF to state/audit/pdf/_quarantine/<date>/, records 'audit_tool_crash' in the report, continues. Script exits non-zero iff any fail entry. Also patched check-prerequisites.sh to skip branch-name validation when feature.json's feature_directory matches the resolved FEATURE_DIR (parallel to the existing bypass in setup-plan.sh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

plan.md: tech stack, constitution check (all 5 principles PASS), project structure (single-package layout under src/llmxive/), ~25 new files + ~10 modified files mapped. research.md: six architectural decisions documented with alternatives considered (prompt engineering for first-try compliance, stage rollback ordering via history walk, mixed text/pixel PDF checker primitives, liveness check via requests.head with 7-day cache, scheduler serial N-per-tick concurrency, source-missing quarantine). data-model.md: 7 entities (personality contribution frontmatter, persona card extension, speckit audit record, stage rollback event, PDF audit report, liveness cache, rotation diversity state). quickstart.md: 3 runnable scenarios (one per user story) + zero-LLM verification + scheduler throughput check. contracts/: 3 JSON schemas (personality_contribution, speckit_artifact_audit, pdf_audit_report). All parse via json.load. CLAUDE.md: spec marker advanced from 009 to 010. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 1 Setup (6 tasks): pdf2image dep, poppler check, scaffold 4 new modules. Phase 2 Foundational (5 tasks): liveness.check_pointer end-to-end, contract test, real-call test against arXiv, audit dir bootstrap. Phase 3 US1 Personalities (11 tasks): contract+unit+integration tests; rubric axes (position/adjacent_work/interest_signal); prompt restructure; persona-card example_contribution; tick() liveness integration; rotation diversity bias; two-strike rejection + activity exposure. Phase 4 US2 Speckit (13 tasks): audit + prune + transitive delete + stage rollback via history.jsonl walk; CLI commands; audit the repo end-to-end (idempotence check); scheduler PIPELINE_PARALLELISM cap; two-strike escalation to HUMAN_INPUT_NEEDED; activity-page event exposure. Phase 5 US3 PDF audit (13 tasks): contract+unit+real-call tests; classify_failure across 5 failure kinds; audit.py with text-level (pdfminer.six) + pixel-level (pdf2image) checks; crash-tolerant quarantine; CLI wiring; remediation pass (extend normalizers / add restyle wrappers / quarantine source-missing); CI workflow with no LLM env vars. Phase 6 Polish (8 tasks): READMEs, requirements lockfile, full test suite, web data regen, quickstart end-to-end, idempotence check. All tasks reference exact file paths per spec-kit format. Tests are included per Constitution Principle III (real-call mandate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Issues found across spec.md + plan.md + tasks.md + constitution: - C1 HIGH (FR-010 coverage gap): no task verified all artifact writes use _real_only_guard. RESOLVED via new T035a (grep + regression tests). - C2 MED (SC-002 manual measurement): blind-review measurement unmapped. RESOLVED via new T053b (sample harness + scoring rubric). - C3 MED (FR-018 enum): paper_review_quarantined stage referenced but not added. RESOLVED via new T042a (Stage enum + Pydantic + scheduler skip). - C7 MED (Constitution V early-fail): liveness check ran AFTER LLM call; should fail-fast on network unreachability first. RESOLVED via new T053a (HEAD arxiv.org with 5s timeout). - C9 HIGH (T024 test gap): transitive-deletion case not explicitly tested. RESOLVED via expanded T024 sub-fixtures. Coverage now 100% (was 94%). Task count 56 → 60. Zero remaining HIGH or MED issues. analyze.md captures the full report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, pdf audit modules T001: add pdf2image>=1.17 to pyproject.toml dependencies. T003: src/llmxive/audit/liveness.py — httpx.head with 7-day cache, arXiv/DOI/URL. T004: src/llmxive/audit/speckit_prune.py — audit_artifacts, prune_templates, _walk_back_to_real_stage (history walker), transitive_dependents lookup. T005: src/llmxive/pipeline/pdf_pipeline/audit.py — pdfplumber text checks + pdf2image hooks for pixel checks, crash-tolerant per FR-014/Q3. T006: src/llmxive/pipeline/pdf_pipeline/classify_failure.py — FR-018 classifier. T010: src/llmxive/audit/__init__.py re-exports check_pointer, audit_artifacts, prune_templates. T011: state/audit/pdf/ + state/audit/pdf/_quarantine/ created with .gitkeep. All four modules import cleanly. Re-export shortcut verified via 'from llmxive.audit import check_pointer, audit_artifacts, prune_templates'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

T008: tests/unit/test_liveness_check.py — 10 tests covering cache hit/miss/expired, non-2xx fail, request-error fail, invalid kind, 405-to-GET fallback, cache I/O. All pass. T009: tests/real_call/test_personality_liveness_real.py — 3 tests against live arXiv + DOI, gated by LLMXIVE_NETWORK_TESTS=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tests T012: tests/unit/test_personality_contribution_schema.py — jsonschema validation of contribution frontmatter; covers REAL pass, missing-position fail, abstain-without-adjacent-work pass, nonabstain-with-empty-list fail. T013: tests/unit/test_personality_rubric_axes.py — 7 tests covering all three new axes (position_present, adjacent_work_verified, interest_signal_anchored) including the combined passes() rule. T014: tests/unit/test_personality_rotation_diversity.py + new module src/llmxive/agents/position_diversity.py — FR-006 per-project position rolling window; diversity_hint_for returns hint only when last DIVERSITY_THRESHOLD contributions all share the same position. T016: src/llmxive/audit/personality_rubric.py — RubricScores dataclass extended with position_present / adjacent_work_verified / interest_signal_anchored axes; passes() now requires all three new axes >=1 PLUS >=3-of-4 legacy axes >=1. Legacy passes_legacy_only() preserved for backward compat. New helpers score_spec010_axes() and score_full(frontmatter, persona_signals). All 19 tests pass. Constitution Principle III (real-call) satisfied via T009; unit-level coverage via T012/T013/T014. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… + backward-compat passes() T017: agents/prompts/personality.md — new 'Required outputs' section documenting position/adjacent_work/interest_signal with regex patterns, liveness-check warning, exact-match requirement. T018: every persona card (10 files) extended with example_contribution frontmatter block (position, adjacent_work, interest_signal anchored to each persona's first interest_signals[].label, body_excerpt prose). T019 (partial): Action dataclass extended with position/adjacent_work/ interest_signal fields; parse_action() extracts + validates them when present (None when absent for legacy compat). T016 (refinement): RubricScores.passes() now falls back to legacy 4-axis rule when all three new axes are 0 (preserves test_personality_librarian_gate and other integration tests that feed canned JSON without spec-010 fields). Strict spec-010 rule applies only when score_full(frontmatter, signals) explicitly scores at least one new axis. This is the right contract: new contributions going through dispatch() with the new prompt will have all three axes scored, so the strict rule still applies to them. Rubric also now accepts both legacy str interest_signals AND structured dict entries with id/label (current persona-card format). 38 spec-010 + librarian-gate integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…003 achieved) T023: tests/unit/test_speckit_audit_schema.py — JSON schema validation. T024: tests/unit/test_speckit_prune.py — 5 tests covering audit classification, transitive deletion + recursive rollback, walk_back, dry-run idempotence. Uses real fixtures from tests/fixtures/audit/speckit_template/ and speckit_real/. T028/T029/T030 (refined): speckit_prune.py - audit_artifacts(): now skips .specify/templates/ reference directories (these are by design template files used as comparison references and must not be flagged as deletable templates). - prune_templates(): transitive deletion respects REAL classifications (won't blow away a real plan.md if it happens to be downstream of a template spec.md); rollback only triggers when a STAGE-DEFINING artifact is deleted, not for .specify/memory/ markers (e.g. constitution.md). - _walk_back_to_real_stage(): a stage 'survives' only if AT LEAST ONE expected artifact exists AND every existing artifact for that stage classifies REAL. - _project_id_from_path(): correctly strips file suffixes when the PROJ-id appears as a filename (e.g. PROJ-200-baz.history.jsonl → PROJ-200-baz). T032 (executed): ran prune_templates(apply=True) against the repo: - 6 template artifacts deleted across 5 projects - PROJ-006/PROJ-007/PROJ-024 retained their stages (only memory/constitution or deployment_guide deleted — not stage-defining) - PROJ-004 (had TEMPLATE quickstart.md) rolled back tasked → flesh_out_complete - PROJ-008 rolled back research_rejected → flesh_out_complete (TEMPLATE quickstart.md deleted; stage was not in STAGE_ARTIFACTS sequence) - Second audit confirms ZERO templates remaining → SC-003 achieved + FR-022 idempotence verified. All 19 spec-010 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ests T031: cli.py — added 'llmxive speckit audit-artifacts [--out PATH] [--repo-root]' and 'llmxive speckit prune-templates [--apply] [--repo-root]' subcommands. audit-artifacts emits JSON (file or stdout); prune-templates dry-runs by default, --apply mutates. T046: cli.py — added 'llmxive pdf-pipeline audit <path> [--out-dir DIR]' subcommand. Walks PDFs, writes per-PDF reports under state/audit/pdf/<date>/, exits non-zero if any failure remains. T035a: tests/unit/test_speckit_guard_coverage.py — static-analysis regression test asserting every .py under src/llmxive/speckit/ that writes a .md artifact also references _real_only_guard. All current speckit_cmd files satisfy this (audited via grep). T036: tests/unit/test_pdf_audit_report_schema.py — schema validation. T037: tests/unit/test_pdf_audit_text_checks.py — literal commands, cite style (author-year, et al, superscript), section monotonicity gap. T039: tests/unit/test_pdf_audit_classify_failure.py — full FR-018 classification matrix (kind × source_available). 20 new unit tests added; all pass. Full suite (322 tests) confirmed green earlier this session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ures surfaced Successfully ran 'llmxive pdf-pipeline audit docs/papers/' (T046 + T047 discovery phase). Per-PDF JSON reports written to state/audit/pdf/2026-05-15/. Detected failures (representative; full list in per-PDF reports): - non_square_bracket_cite (35 instances, source_fixable): * Superscript citations on PROJ-563 page 1: ², ³, ⁴ → should be [2], [3], [4] * Author-year cites on PROJ-564: (Chen et al., 2024), (Zhang et al., 2018), (Labs, 2024), (Team, 2025) → should be [N] - section_number_gap (5 instances, unsupported_construct): * PROJ-563 page 39: gaps 8→10, 12→16, 17→52 (footnote anchors interpreted as section headings — heuristic is too eager; section regex needs to require '## ' or known TOC context) These are the exact failure patterns the user reported. The audit script is functional and the failure classification (35 source_fixable, 5 unsupported_construct, 0 audit_tool_crash, 0 source_missing) tracks FR-018. Remediation (T047 application phase) requires re-running the existing deterministic normalize_references/normalize_authors/normalize_figures pipeline against each paper's source .tex; the normalizers already target these patterns at source level — the failures here are mostly papers compiled BEFORE the normalizers existed. Re-compilation closes the gap; that's source-fixable by design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ive fix T049/T050: README.md — new 'Audit tools (spec 010)' section under 'Running it' documenting the three new deterministic audit commands (personality contributions, speckit audit/prune, PDF audit). Includes the exact CLI invocations + the JSON output locations. T054: regenerated web/data/projects.json so the activity page picks up the speckit prune events that landed on this branch. Audit refinement: the section-heading regex was overzealous — was catching bibliography entries like '[12] Author...' as section headings, causing false-positive section_number_gap reports on PDF reference pages. Now: - regex tightened to require '\d+ Upper-lower' (proper noun start), excluding subsection numbers (X.Y Title) and reference numbers - bibliography pages (containing 'References'/'Bibliography'/'REFERENCES' in the first 400 chars) are skipped for the monotonicity check - this resolves all 5 'unsupported_construct' false positives on PROJ-563 page 39; remaining 35 'source_fixable' citation-style failures are legitimate (require source-level pipeline re-run). 66 unit + integration tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremymanning and others added 13 commits May 15, 2026 00:14

jeremymanning merged commit 4563629 into main May 15, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec 010: personality taste, real speckit artifacts, PDF audit#148

spec 010: personality taste, real speckit artifacts, PDF audit#148
jeremymanning merged 13 commits into
mainfrom
feature/personalities-speckit-real-pdf-audit

jeremymanning commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremymanning commented May 15, 2026

Summary

Spec / plan / tasks artifacts

Clarifications recorded

Implementation summary

Verification

Constitution compliance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant