Bug Report — graphify --update performs a destructive, net-negative fuzzy node merge on an already-current code graph
Report Bug write by Opus 4.8
- Date observed: 2026-06-07 (~14:05–14:15 local, GMT+2)
- graphify package:
graphifyy, installed via uv tool (interpreter path redacted)
- Skill version: `last version at this day/data/
- Host: Windows 11 (win32), PowerShell 7
- **IA Model: Opus 4.8 xhigh. | Claude version version 2.1.165
- Project type: a code-only TypeScript/React (Electron/Vite-class) corpus
- Severity: High (silent data loss is prevented only by the
to_json guard; if a user/agent passes force=True, 16% of the graph is destroyed)
- Status: Reproduced, root-caused, damage averted (graph left intact)
Note: all project-specific names, file paths, source-file names, symbol/identifier names, dependency names, and commit hashes have been replaced with generic placeholders (<PROJECT_ROOT>, <ClassA>, <functionA>(), <source-file>, <third-party-dep-N>, <COMMIT_HASH>, …).
TL;DR
Running the manual incremental update (/graphify --update) on a graph that the in-session auto-updater had already brought current caused the merge step to produce a graph with fewer nodes than it started with:
- Existing
graph.json: 1515 nodes / 2435 edges
- After
build_merge: 1272 nodes / 2000 edges
- Net result: −243 nodes, −435 edges, +0 new nodes
The 243 lost nodes were distinct, real symbols (third-party dependency nodes, a service class plus several of its methods, standalone functions, persistence modules, test fixtures), collapsed by build_merge's global fuzzy deduplication (Deduplicated 243 node(s): 4 exact, 234 fuzzy). The new AST extraction contributed zero genuinely new nodes — all 476 re-extracted node IDs already existed in the graph — so the merge had nothing to add and the fuzzy pass was the only operation with any effect, and that effect was purely subtractive.
The to_json safety guard (Refusing to overwrite … pass force=True to override) correctly blocked the overwrite, which is the only reason no data was lost. The bug is therefore latent: it is one force=True away from destroying 16% of the graph, and the skill docs do not warn about it.
Affected components
graphifyy library — build_merge() runs a global fuzzy dedup across the entire merged graph with a similarity threshold that is too loose for source-code graphs.
/graphify skill — references/update.md instructs the incremental flow to call build_merge([new_extraction], graph_path=..., prune_sources=deleted or None). It only prunes deleted files; it never prunes the re-extracted files, delegating reconciliation entirely to the fuzzy dedup.
The destructive behavior emerges from the interaction of the two, not from either in isolation, and only under a specific (but common) scenario described below.
Environment / preconditions that trigger it
All four conditions held simultaneously:
- Graph already current. An in-session auto-updater (
graphify update ., run after each modification) had already extracted every changed file. Re-extraction therefore produced 0 new node IDs.
- Code-only corpus. All 28 "changed" files were
.ts / .tsx / package.json. The skill's code-only fast path skipped semantic extraction (AST-only, 0 LLM tokens).
- Repeated short identifiers. A TypeScript/React codebase has many same-named symbols across files (e.g.
build, target, icon, generic test helpers, similar store/method names).
- Weak node provenance in the fuzzy key. Affected nodes carry
source_location values that are bare line numbers ("L11", "L33", "L207") with no file qualifier, so the fuzzy matcher cannot use location to keep same-named-but-different-file symbols apart.
Timeline — what happened, step by step
- User invoked
/graphify --update on <PROJECT_ROOT>.
detect_incremental(.) reported 28 changed code files, 0 deletions (21 .ts, 6 .tsx, 1 package.json).
- All 28 fell in the
code category (0 docs/papers/images) → skill took the AST-only code path, no subagents, no LLM.
extract() over the 28 files produced 476 AST nodes / 988 edges.
- Per
update.md, the flow backed up graph.json → .graphify_old.json, then ran:
G = build_merge([new_extraction], graph_path='graphify-out/graph.json', prune_sources=None)
build_merge printed: Deduplicated 243 node(s) (4 exact, 234 fuzzy) and returned 1272 nodes / 2000 edges.
- Step 4 of the skill called
to_json(G, communities, 'graphify-out/graph.json'), which printed:
WARNING: new graph has 1272 nodes but existing graph.json has 1515.
Refusing to overwrite — you may be missing chunk files from a previous session.
Pass force=True to override.
→ graph.json was not overwritten.
- Investigation (see Evidence) confirmed the reduction was destructive, not legitimate cleanup: 243 distinct real nodes dropped, 0 nodes added.
- The downstream agent declined to force the overwrite, kept
graph.json intact (1515 nodes), regenerated the report from the intact graph (reusing existing community assignments + labels, no re-cluster), and saved the manifest. A re-run of detect_incremental then reported 0 changed files (manifest consistent).
Evidence (collected during the session)
Node/edge counts at each stage
| Artifact |
Nodes |
Edges |
graph.json (existing, pre-merge) |
1515 |
2435 |
.graphify_old.json (backup of the above) |
1515 |
2435 |
.graphify_ast.json (fresh AST of 28 files) |
476 |
988 |
.graphify_extract.json (merge output) |
1272 |
2000 |
Diff of existing graph vs merge output
Old: 1515 New: 1272
Dropped (in old, not in new): 243
Added (in new, not in old): 0
Dropped by type: {'?': 243}
Added by type: {}
Are the re-extracted nodes actually new?
Raw AST nodes: 476
AST node IDs already in old graph: 476
Genuinely NEW AST node IDs (not in old): 0
→ The re-extraction added nothing. Every fresh node already existed. The merge's only net effect was the fuzzy collapse.
Sample of dropped (lost) nodes — all distinct, real symbols (names redacted)
<build-target-node> | L11
<third-party-dep-1> | L21
<third-party-dep-2> | L25
<third-party-dep-3> | L30
<third-party-dep-4> | L90
<functionA>() | L33
<ClassA> | L207
<ClassA>.<methodA>() | L301
<ClassA>.<methodB>() | L323
<PersistenceModuleA> | L37
<PersistenceModuleB> | L27
<testFixtureA> | L17
<source-file>.ts | L1
These are not duplicates — they are separate third-party dependency nodes (declared in package.json), separate service-class methods, separate standalone functions, separate persistence modules, and test fixtures. Collapsing them loses real graph structure. (Note: ~25 of the 243 were sampled directly; the remainder are inferred to be of the same kind given the 0-added / all-IDs-pre-existing result.)
Post-recovery integrity check of the preserved graph.json
Nodes: 1515 | unique IDs: 1515
Edges: 2435
Orphan edges (pointing to non-existent nodes): 0
Distinct communities: 108
built_at_commit: <COMMIT_HASH>
The original graph has perfect referential integrity; it did not need "cleaning."
Root cause analysis
Why the count went down instead of up
On a normal --update, new/changed files introduce new nodes, so the merged count rises and build_merge's fuzzy dedup (whatever it collapses) is masked by the increase. Here the auto-updater had already ingested these files, so re-extraction produced 0 new nodes. With nothing to add, the fuzzy pass is the sole mutation — and it is purely subtractive.
Why fuzzy dedup over-merges this graph
build_merge applies a global fuzzy similarity pass over the whole node set. In a code graph:
- Many nodes share identical short labels across different files.
- The disambiguating field
source_location is just a line number ("L33"), not file:line, so two different functions on similar lines in different files look "close."
- The fuzzy threshold is tuned for prose/entity corpora (where near-duplicate surface forms usually are the same entity), not for code (where the same short identifier in file A and file B is genuinely two different symbols).
Result: the matcher merges across files and erases 234 distinct symbols (plus 4 exact and assorted edge collapses).
Why the skill flow amplifies it
references/update.md calls:
build_merge([new_extraction], graph_path='graph.json', prune_sources=deleted or None)
For re-extracted files it does not prune the old nodes first; it expects the fuzzy dedup to reconcile old vs new copies of the same file's nodes. That is exactly the operation that, on a code graph, collapses unrelated same-named symbols. A source-scoped replace (prune all nodes whose source_file ∈ changed-files, then add the fresh AST nodes) would be deterministic and lossless and would not need fuzzy matching at all.
Why the in-session auto-updater does not show this (inferred)
The in-session graphify update . appears to do a per-file, source-scoped replace (remove the changed file's old nodes, insert its new nodes), so counts stay stable and nothing cross-file is collapsed. The skill's --update path instead routes through build_merge + global fuzzy dedup. The two paths diverge in their merge strategy, which is why an agent updating after every change stays safe, while a manual --update risks the shrink. (This divergence is inferred from observed behavior; it has not been confirmed against graphify update's source.)
Impact
- Latent silent data loss. With
force=True (or any caller that ignores the warning), 243/1515 = ~16% of nodes and ~18% of edges are destroyed in a single run.
- Erosion across repeated manual updates. Each forced
--update can shave the graph further, since the fuzzy pass is re-applied to an already-deduped graph.
- Misleading guard message. The warning attributes the shrink to "missing chunk files from a previous session," which is the wrong diagnosis here (it was code-only, no chunks). A user could be misled into forcing the overwrite believing the shrink is benign.
- Docs gap. Maintainers reportedly recommend running a manual
--update every few sessions, with no mention of this near-destructive edge case.
Steps to reproduce
- Build a graph for a code-only TypeScript/React (or similarly identifier-dense) project:
/graphify ..
- Let the in-session auto-updater (
graphify update .) bring the graph current after edits (so re-extraction would yield 0 new node IDs).
- Run
/graphify --update (or directly: build_merge([fresh_ast_extraction], graph_path='graph.json', prune_sources=None)).
- Observe
Deduplicated N node(s) … fuzzy and a merged node count lower than the original, with to_json refusing to overwrite.
- Diff old vs merged:
added == 0, dropped == N, dropped entries are distinct real symbols.
Suggested fixes (for graphify maintainers)
- Source-scoped replace in the
--update path. In references/update.md, pass the re-extracted files (not only deleted ones) to prune_sources, so build_merge removes their old nodes and inserts the fresh AST deterministically — no fuzzy matching of code symbols.
- Disable cross-file fuzzy merging for code nodes. Gate fuzzy dedup on node type/source; never fuzzy-merge two AST symbols from different
source_files. Require exact (source_file, label) equality for code.
- Qualify the fuzzy key with the file. Use
source_file:source_location (e.g. <source-file>.ts:L33) instead of bare L33 so same-named symbols in different files are never "close."
- Make a 0-new-node update a no-op. If the fresh extraction contributes no new node IDs and there are no deletions, skip
build_merge entirely (the graph is already current) instead of running a destructive dedup.
- Never let a merge produce a net node loss without explicit, accurate justification. The
to_json guard is good; improve its message to say "merge produced fewer nodes than the source graph (−N from fuzzy dedup)" rather than blaming missing chunk files, and have --update treat the guard as a hard stop (not a "pass force=True" nudge).
- Document the edge case. Warn that manual
--update on an already-current, code-dense graph can shrink it, and recommend a clean full rebuild (/graphify .) as the safe refresh for such projects.
Workaround (current, for downstream users)
- On code-only projects where the in-session updater already keeps the graph fresh, do not run
/graphify --update manually; it can only shrink or stall.
- To refresh deliberately, do a clean full rebuild (
/graphify .) — it builds without merging against the existing graph, so there is no fuzzy collapse.
- Never pass
force=True to to_json to silence the "refusing to overwrite" warning when the node count has dropped.
Confirmed vs inferred
- Confirmed (observed in this session): existing graph 1515/2435; merge output 1272/2000;
Deduplicated 243 (4 exact, 234 fuzzy); 0 nodes added; 243 distinct nodes dropped (sampled); to_json refused to overwrite; original graph has 0 orphan edges. The downstream agent ran the skill's documented build_merge call verbatim — the destructive mechanism is inside the library, not in the execution.
- Inferred (not verified against source): that the in-session
graphify update uses a source-scoped replace while the skill's --update uses global fuzzy dedup; the exact fuzzy threshold and matching key in build_merge. These can be confirmed by reading build_merge() and the graphify update CLI implementation in the installed package.
Bug Report —
graphify --updateperforms a destructive, net-negative fuzzy node merge on an already-current code graphReport Bug write by Opus 4.8
graphifyy, installed viauv tool(interpreter path redacted)to_jsonguard; if a user/agent passesforce=True, 16% of the graph is destroyed)TL;DR
Running the manual incremental update (
/graphify --update) on a graph that the in-session auto-updater had already brought current caused the merge step to produce a graph with fewer nodes than it started with:graph.json: 1515 nodes / 2435 edgesbuild_merge: 1272 nodes / 2000 edgesThe 243 lost nodes were distinct, real symbols (third-party dependency nodes, a service class plus several of its methods, standalone functions, persistence modules, test fixtures), collapsed by
build_merge's global fuzzy deduplication (Deduplicated 243 node(s): 4 exact, 234 fuzzy). The new AST extraction contributed zero genuinely new nodes — all 476 re-extracted node IDs already existed in the graph — so the merge had nothing to add and the fuzzy pass was the only operation with any effect, and that effect was purely subtractive.The
to_jsonsafety guard (Refusing to overwrite … pass force=True to override) correctly blocked the overwrite, which is the only reason no data was lost. The bug is therefore latent: it is oneforce=Trueaway from destroying 16% of the graph, and the skill docs do not warn about it.Affected components
graphifyylibrary —build_merge()runs a global fuzzy dedup across the entire merged graph with a similarity threshold that is too loose for source-code graphs./graphifyskill —references/update.mdinstructs the incremental flow to callbuild_merge([new_extraction], graph_path=..., prune_sources=deleted or None). It only prunes deleted files; it never prunes the re-extracted files, delegating reconciliation entirely to the fuzzy dedup.The destructive behavior emerges from the interaction of the two, not from either in isolation, and only under a specific (but common) scenario described below.
Environment / preconditions that trigger it
All four conditions held simultaneously:
graphify update ., run after each modification) had already extracted every changed file. Re-extraction therefore produced 0 new node IDs..ts/.tsx/package.json. The skill's code-only fast path skipped semantic extraction (AST-only, 0 LLM tokens).build,target,icon, generic test helpers, similar store/method names).source_locationvalues that are bare line numbers ("L11","L33","L207") with no file qualifier, so the fuzzy matcher cannot use location to keep same-named-but-different-file symbols apart.Timeline — what happened, step by step
/graphify --updateon<PROJECT_ROOT>.detect_incremental(.)reported 28 changed code files, 0 deletions (21.ts, 6.tsx, 1package.json).codecategory (0 docs/papers/images) → skill took the AST-only code path, no subagents, no LLM.extract()over the 28 files produced 476 AST nodes / 988 edges.update.md, the flow backed upgraph.json→.graphify_old.json, then ran:build_mergeprinted:Deduplicated 243 node(s) (4 exact, 234 fuzzy)and returned 1272 nodes / 2000 edges.to_json(G, communities, 'graphify-out/graph.json'), which printed:graph.jsonwas not overwritten.graph.jsonintact (1515 nodes), regenerated the report from the intact graph (reusing existing community assignments + labels, no re-cluster), and saved the manifest. A re-run ofdetect_incrementalthen reported 0 changed files (manifest consistent).Evidence (collected during the session)
Node/edge counts at each stage
graph.json(existing, pre-merge).graphify_old.json(backup of the above).graphify_ast.json(fresh AST of 28 files).graphify_extract.json(merge output)Diff of existing graph vs merge output
Are the re-extracted nodes actually new?
→ The re-extraction added nothing. Every fresh node already existed. The merge's only net effect was the fuzzy collapse.
Sample of dropped (lost) nodes — all distinct, real symbols (names redacted)
These are not duplicates — they are separate third-party dependency nodes (declared in
package.json), separate service-class methods, separate standalone functions, separate persistence modules, and test fixtures. Collapsing them loses real graph structure. (Note: ~25 of the 243 were sampled directly; the remainder are inferred to be of the same kind given the 0-added / all-IDs-pre-existing result.)Post-recovery integrity check of the preserved
graph.jsonThe original graph has perfect referential integrity; it did not need "cleaning."
Root cause analysis
Why the count went down instead of up
On a normal
--update, new/changed files introduce new nodes, so the merged count rises andbuild_merge's fuzzy dedup (whatever it collapses) is masked by the increase. Here the auto-updater had already ingested these files, so re-extraction produced 0 new nodes. With nothing to add, the fuzzy pass is the sole mutation — and it is purely subtractive.Why fuzzy dedup over-merges this graph
build_mergeapplies a global fuzzy similarity pass over the whole node set. In a code graph:source_locationis just a line number ("L33"), notfile:line, so two different functions on similar lines in different files look "close."Result: the matcher merges across files and erases 234 distinct symbols (plus 4 exact and assorted edge collapses).
Why the skill flow amplifies it
references/update.mdcalls:For re-extracted files it does not prune the old nodes first; it expects the fuzzy dedup to reconcile old vs new copies of the same file's nodes. That is exactly the operation that, on a code graph, collapses unrelated same-named symbols. A source-scoped replace (prune all nodes whose
source_file∈ changed-files, then add the fresh AST nodes) would be deterministic and lossless and would not need fuzzy matching at all.Why the in-session auto-updater does not show this (inferred)
The in-session
graphify update .appears to do a per-file, source-scoped replace (remove the changed file's old nodes, insert its new nodes), so counts stay stable and nothing cross-file is collapsed. The skill's--updatepath instead routes throughbuild_merge+ global fuzzy dedup. The two paths diverge in their merge strategy, which is why an agent updating after every change stays safe, while a manual--updaterisks the shrink. (This divergence is inferred from observed behavior; it has not been confirmed againstgraphify update's source.)Impact
force=True(or any caller that ignores the warning), 243/1515 = ~16% of nodes and ~18% of edges are destroyed in a single run.--updatecan shave the graph further, since the fuzzy pass is re-applied to an already-deduped graph.--updateevery few sessions, with no mention of this near-destructive edge case.Steps to reproduce
/graphify ..graphify update .) bring the graph current after edits (so re-extraction would yield 0 new node IDs)./graphify --update(or directly:build_merge([fresh_ast_extraction], graph_path='graph.json', prune_sources=None)).Deduplicated N node(s) … fuzzyand a merged node count lower than the original, withto_jsonrefusing to overwrite.added == 0,dropped == N, dropped entries are distinct real symbols.Suggested fixes (for graphify maintainers)
--updatepath. Inreferences/update.md, pass the re-extracted files (not only deleted ones) toprune_sources, sobuild_mergeremoves their old nodes and inserts the fresh AST deterministically — no fuzzy matching of code symbols.source_files. Require exact(source_file, label)equality for code.source_file:source_location(e.g.<source-file>.ts:L33) instead of bareL33so same-named symbols in different files are never "close."build_mergeentirely (the graph is already current) instead of running a destructive dedup.to_jsonguard is good; improve its message to say "merge produced fewer nodes than the source graph (−N from fuzzy dedup)" rather than blaming missing chunk files, and have--updatetreat the guard as a hard stop (not a "pass force=True" nudge).--updateon an already-current, code-dense graph can shrink it, and recommend a clean full rebuild (/graphify .) as the safe refresh for such projects.Workaround (current, for downstream users)
/graphify --updatemanually; it can only shrink or stall./graphify .) — it builds without merging against the existing graph, so there is no fuzzy collapse.force=Truetoto_jsonto silence the "refusing to overwrite" warning when the node count has dropped.Confirmed vs inferred
Deduplicated 243 (4 exact, 234 fuzzy); 0 nodes added; 243 distinct nodes dropped (sampled);to_jsonrefused to overwrite; original graph has 0 orphan edges. The downstream agent ran the skill's documentedbuild_mergecall verbatim — the destructive mechanism is inside the library, not in the execution.graphify updateuses a source-scoped replace while the skill's--updateuses global fuzzy dedup; the exact fuzzy threshold and matching key inbuild_merge. These can be confirmed by readingbuild_merge()and thegraphify updateCLI implementation in the installed package.