fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660
Open
TPAteeq wants to merge 2 commits into
Open
fix(detect): detect Office source edits and stop re-parsing unchanged files (#1649, #1656)#1660TPAteeq wants to merge 2 commits into
TPAteeq wants to merge 2 commits into
Conversation
…word counts (Graphify-Labs#1649, Graphify-Labs#1656) convert_office_file now fingerprints the SOURCE by its raw bytes (md5, which does NOT unzip/parse the OOXML container, so it stays cheap) and records it in the sidecar header. On a later run it re-parses and rewrites the sidecar only when that fingerprint differs — so an edited .docx/.xlsx re-enters --update (Graphify-Labs#1649), while an unchanged one is never re-parsed and its mtime never churns (Graphify-Labs#1656, preserving the Graphify-Labs#1226 no-churn guarantee). The single fingerprint resolves both issues, which previously pulled in opposite directions (the Graphify-Labs#1226 early-return skipped the write but still parsed every run, and keyed on the source PATH, not its CONTENT, so edits were silently frozen). Word counts are now cached in the manifest entry and reused for unchanged files, so PDFs are no longer re-parsed for their word count on every incremental run (Graphify-Labs#1656). save_manifest seeds/refreshes the count (reusing the previous value when the content hash is unchanged); detect() reuses it when the file's mtime is unchanged. total_words stays correct, so the benchmark command's corpus_words keeps working. Legacy manifests without word_count recompute once (backward compatible via the existing dict passthrough in _normalise_entry). Tests: edited/unchanged .docx and .xlsx via convert_office_file; legacy sidecar upgrade; edited Office file re-entering detect_incremental; and an unchanged PDF not re-parsed on a second incremental run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aphify-Labs#1649, Graphify-Labs#1656) Address code-review findings on the source-fingerprint / word-count-cache work without altering the reviewed-correct core design: - Anchor the sidecar fingerprint regex to the trailing ` | source-md5: <fp> -->` delimiter/terminator so a source filename that itself contains a "source-md5: <hex>" substring can no longer be captured as the fingerprint (which would make the real fingerprint never match, so the file re-parsed + rewrote + re-queued on every run). Regression test with a pathological filename asserts an unchanged source is parsed once. - Future-proof the save_manifest word_count cache: content_unchanged now keys off the prior hash of the matching kind, so a semantic-only manifest (ast_hash never populated) can actually reuse its cached count. The kind="both"/"ast" paths are unchanged (still key off ast_hash), so a real content change still recomputes. - Add the missing CHANGELOG `## Unreleased` entry covering both issues. Preserves the Graphify-Labs#1226 no-churn guarantee and one-time-double-parse behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1649
Fixes #1656
Summary
#1649and#1656both live inconvert_office_fileand pull in oppositedirections, but a single change — a source-content fingerprint — resolves both.
--update): the sidecar name was keyed on ahash of the source path, never its content, and an early-return on
out_path.exists()(the detect() creates duplicate converted .md files on macOS due to NFC/NFD Unicode path normalization inconsistency, causing --update to re-extract all Office files on every run #1226 no-churn fix) meant a modified.docx/.xlsxwasnever re-converted. The graph silently froze on the stale version, with no error.
convert_office_filefully parsed thebinary before that early-return, so Office files were re-parsed on every run even when
unchanged; PDFs were likewise re-parsed by
count_words→extract_pdf_texton everyrun, with no caching.
The single-fingerprint approach
convert_office_filenow fingerprints the source by its raw bytes (md5).Hashing the raw bytes does not unzip/parse the OOXML container, so it stays cheap —
that is the whole point. The fingerprint is recorded in the sidecar header:
On a later run:
without parsing (satisfies Perf: cache word_count in manifest — detect_incremental re-parses every PDF/docx on each run #1656) and without rewriting (so an unchanged
source never churns its mtime, preserving the detect() creates duplicate converted .md files on macOS due to NFC/NFD Unicode path normalization inconsistency, causing --update to re-extract all Office files on every run #1226 guarantee).
re-parse and rewrite. The rewritten sidecar's new mtime/md5 then flows through
detect_incrementalautomatically, so the edit re-enters--update(satisfies Office files (.docx/.xlsx): modified sources never re-enter --update (sidecar keyed on path hash, source content never checked) #1649).No other file needs to change for change-detection to pick it up.
This threads the needle that a naïve "always re-convert" would miss: it neither freezes
edits (#1649) nor re-parses unchanged sources (#1656/#1226).
PDF / word-count re-parse (#1656)
Word counts are now cached in the manifest entry and reused for unchanged files, so
PDFs are no longer re-parsed for their word count on every incremental run:
save_manifestseeds/refreshes each entry'sword_count, reusing the previousvalue whenever the content hash is unchanged and only (re)parsing a genuinely
new/changed file (video files are skipped, mirroring
detect()).detect()accepts the previous run's per-file counts and reuses the cached count forany file whose mtime is unchanged instead of re-parsing.
detect_incrementalloads the manifest first, hands those cached counts down todetect(), and proceeds as before.total_wordsstays correct in every path (unchanged files contribute their cachedcount; changed files are recomputed), so the
benchmarkcommand'scorpus_wordskeeps working — including via the skill's
--updateflow, which propagatestotal_wordsinto.graphify_detect.json.Backward compatible: legacy manifests without
word_countload unchanged (the existing_normalise_entrydict passthrough preserves the new field, and old entries simplyrecompute once on the next run).
Tests
Added to
tests/test_detect.py:.docxand.xlsxviaconvert_office_file— asserts theconverter re-parses after an edit and is not re-invoked when the source is unchanged.
detect_incremental().new_files, whilean unchanged one stays in
unchanged_files.extract_pdf_textnot called again) on a secondincremental run.
Test results
Full suite is green except one pre-existing, environment-only failure
(
test_extract.py::test_collect_files_skips_hidden) caused solely by running inside a.claude/worktrees/…checkout — it fails identically on the base commit with thisbranch's changes stashed and is unrelated to this diff.