fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656) by TPAteeq · Pull Request #1660 · Graphify-Labs/graphify

TPAteeq · 2026-07-04T18:35:08Z

Summary

#1649 and #1656 both live in convert_office_file and pull in opposite
directions, but a single change — a source-content fingerprint — resolves both.

Office files (.docx/.xlsx): modified sources never re-enter --update (sidecar keyed on path hash, source content never checked) #1649 (Office edits never re-enter --update): the sidecar name was keyed on a
hash of the source path, never its content, and an early-return on
out_path.exists() (the detect() creates duplicate converted .md files on macOS due to NFC/NFD Unicode path normalization inconsistency, causing --update to re-extract all Office files on every run #1226 no-churn fix) meant a modified .docx/.xlsx was
never re-converted. The graph silently froze on the stale version, with no error.
Perf: cache word_count in manifest — detect_incremental re-parses every PDF/docx on each run #1656 (re-parses every file every run): convert_office_file fully parsed the
binary before that early-return, so Office files were re-parsed on every run even when
unchanged; PDFs were likewise re-parsed by count_words → extract_pdf_text on every
run, with no caching.

The single-fingerprint approach

convert_office_file now fingerprints the source by its raw bytes (md5).
Hashing the raw bytes does not unzip/parse the OOXML container, so it stays cheap —
that is the whole point. The fingerprint is recorded in the sidecar header:

<!-- converted from report.docx | source-md5: 3b1e… -->

On a later run:

Sidecar exists and the stored fingerprint matches the source → return it
without parsing (satisfies Perf: cache word_count in manifest — detect_incremental re-parses every PDF/docx on each run #1656) and without rewriting (so an unchanged
source never churns its mtime, preserving the detect() creates duplicate converted .md files on macOS due to NFC/NFD Unicode path normalization inconsistency, causing --update to re-extract all Office files on every run #1226 guarantee).
Sidecar missing, or the fingerprint differs (or is absent — a legacy sidecar) →
re-parse and rewrite. The rewritten sidecar's new mtime/md5 then flows through
detect_incremental automatically, so the edit re-enters --update (satisfies Office files (.docx/.xlsx): modified sources never re-enter --update (sidecar keyed on path hash, source content never checked) #1649).
No other file needs to change for change-detection to pick it up.

This threads the needle that a naïve "always re-convert" would miss: it neither freezes
edits (#1649) nor re-parses unchanged sources (#1656/#1226).

PDF / word-count re-parse (#1656)

Word counts are now cached in the manifest entry and reused for unchanged files, so
PDFs are no longer re-parsed for their word count on every incremental run:

save_manifest seeds/refreshes each entry's word_count, reusing the previous
value whenever the content hash is unchanged and only (re)parsing a genuinely
new/changed file (video files are skipped, mirroring detect()).
detect() accepts the previous run's per-file counts and reuses the cached count for
any file whose mtime is unchanged instead of re-parsing.
detect_incremental loads the manifest first, hands those cached counts down to
detect(), and proceeds as before.

total_words stays correct in every path (unchanged files contribute their cached
count; changed files are recomputed), so the benchmark command's corpus_words
keeps working — including via the skill's --update flow, which propagates
total_words into .graphify_detect.json.

Backward compatible: legacy manifests without word_count load unchanged (the existing
_normalise_entry dict passthrough preserves the new field, and old entries simply
recompute once on the next run).

Tests

Added to tests/test_detect.py:

Edited vs. unchanged .docx and .xlsx via convert_office_file — asserts the
converter re-parses after an edit and is not re-invoked when the source is unchanged.
Legacy (pre-fingerprint) sidecar is upgraded once, then reused.
End-to-end: an edited Office source re-enters detect_incremental().new_files, while
an unchanged one stays in unchanged_files.
An unchanged PDF is not re-parsed (extract_pdf_text not called again) on a second
incremental run.

Test results

uv run pytest tests/test_detect.py tests/test_incremental.py tests/test_office_limits.py tests/test_manifest_ingest.py
# 153 passed

Full suite is green except one pre-existing, environment-only failure
(test_extract.py::test_collect_files_skips_hidden) caused solely by running inside a
.claude/worktrees/… checkout — it fails identically on the base commit with this
branch's changes stashed and is unrelated to this diff.

…word counts (Graphify-Labs#1649, Graphify-Labs#1656) convert_office_file now fingerprints the SOURCE by its raw bytes (md5, which does NOT unzip/parse the OOXML container, so it stays cheap) and records it in the sidecar header. On a later run it re-parses and rewrites the sidecar only when that fingerprint differs — so an edited .docx/.xlsx re-enters --update (Graphify-Labs#1649), while an unchanged one is never re-parsed and its mtime never churns (Graphify-Labs#1656, preserving the Graphify-Labs#1226 no-churn guarantee). The single fingerprint resolves both issues, which previously pulled in opposite directions (the Graphify-Labs#1226 early-return skipped the write but still parsed every run, and keyed on the source PATH, not its CONTENT, so edits were silently frozen). Word counts are now cached in the manifest entry and reused for unchanged files, so PDFs are no longer re-parsed for their word count on every incremental run (Graphify-Labs#1656). save_manifest seeds/refreshes the count (reusing the previous value when the content hash is unchanged); detect() reuses it when the file's mtime is unchanged. total_words stays correct, so the benchmark command's corpus_words keeps working. Legacy manifests without word_count recompute once (backward compatible via the existing dict passthrough in _normalise_entry). Tests: edited/unchanged .docx and .xlsx via convert_office_file; legacy sidecar upgrade; edited Office file re-entering detect_incremental; and an unchanged PDF not re-parsed on a second incremental run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aphify-Labs#1649, Graphify-Labs#1656) Address code-review findings on the source-fingerprint / word-count-cache work without altering the reviewed-correct core design: - Anchor the sidecar fingerprint regex to the trailing ` | source-md5: <fp> -->` delimiter/terminator so a source filename that itself contains a "source-md5: <hex>" substring can no longer be captured as the fingerprint (which would make the real fingerprint never match, so the file re-parsed + rewrote + re-queued on every run). Regression test with a pathological filename asserts an unchanged source is parsed once. - Future-proof the save_manifest word_count cache: content_unchanged now keys off the prior hash of the matching kind, so a semantic-only manifest (ast_hash never populated) can actually reuse its cached count. The kind="both"/"ast" paths are unchanged (still key off ast_hash), so a real content change still recomputes. - Add the missing CHANGELOG `## Unreleased` entry covering both issues. Preserves the Graphify-Labs#1226 no-churn guarantee and one-time-double-parse behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

TPAteeq and others added 2 commits July 5, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660

fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660
TPAteeq wants to merge 2 commits into
Graphify-Labs:v8from
TPAteeq:fix/office-source-change-detection

TPAteeq commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

TPAteeq commented Jul 4, 2026

Summary

The single-fingerprint approach

PDF / word-count re-parse (#1656)

Tests

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant