Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph

# Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph 

## **Report Bug write by Opus 4.8**

- **Date observed:** 2026-06-07 (~14:05–14:15 local, GMT+2)
- **graphify package:** `graphifyy`, installed via `uv tool` (interpreter path redacted) 
- **Skill version:** `last version at this day/data/
- **Host:** Windows 11 (win32), PowerShell 7
- **IA Model: Opus 4.8 xhigh. | Claude version version 2.1.165 
- **Project type:** a code-only TypeScript/React (Electron/Vite-class) corpus
- **Severity:** High (silent data loss is *prevented* only by the `to_json` guard; if a user/agent passes `force=True`, 16% of the graph is destroyed)
- **Status:** Reproduced, root-caused, damage averted (graph left intact)

> **Note:** all project-specific names, file paths, source-file names, symbol/identifier names, dependency names, and commit hashes have been replaced with generic placeholders (`<PROJECT_ROOT>`, `<ClassA>`, `<functionA>()`, `<source-file>`, `<third-party-dep-N>`, `<COMMIT_HASH>`, …).
---

## TL;DR

Running the manual incremental update (`/graphify --update`) on a graph that the in-session auto-updater had **already brought current** caused the merge step to produce a graph with **fewer nodes than it started with**:

- Existing `graph.json`: **1515 nodes / 2435 edges**
- After `build_merge`: **1272 nodes / 2000 edges**
- **Net result: −243 nodes, −435 edges, +0 new nodes**

The 243 lost nodes were **distinct, real symbols** (third-party dependency nodes, a service class plus several of its methods, standalone functions, persistence modules, test fixtures), collapsed by `build_merge`'s **global fuzzy deduplication** (`Deduplicated 243 node(s): 4 exact, 234 fuzzy`). The new AST extraction contributed **zero** genuinely new nodes — all 476 re-extracted node IDs already existed in the graph — so the merge had nothing to add and the fuzzy pass was the *only* operation with any effect, and that effect was purely subtractive.

The `to_json` safety guard (`Refusing to overwrite … pass force=True to override`) correctly blocked the overwrite, which is the only reason no data was lost. The bug is therefore **latent**: it is one `force=True` away from destroying 16% of the graph, and the skill docs do not warn about it.

---

## Affected components

1. **`graphifyy` library** — `build_merge()` runs a **global fuzzy dedup** across the entire merged graph with a similarity threshold that is too loose for source-code graphs.
2. **`/graphify` skill** — `references/update.md` instructs the incremental flow to call `build_merge([new_extraction], graph_path=..., prune_sources=deleted or None)`. It only prunes **deleted** files; it never prunes the **re-extracted** files, delegating reconciliation entirely to the fuzzy dedup.

The destructive behavior emerges from the **interaction** of the two, not from either in isolation, and only under a specific (but common) scenario described below.

---

## Environment / preconditions that trigger it

All four conditions held simultaneously:

1. **Graph already current.** An in-session auto-updater (`graphify update .`, run after each modification) had already extracted every changed file. Re-extraction therefore produced **0 new node IDs**.
2. **Code-only corpus.** All 28 "changed" files were `.ts` / `.tsx` / `package.json`. The skill's code-only fast path skipped semantic extraction (AST-only, 0 LLM tokens).
3. **Repeated short identifiers.** A TypeScript/React codebase has many same-named symbols across files (e.g. `build`, `target`, `icon`, generic test helpers, similar store/method names).
4. **Weak node provenance in the fuzzy key.** Affected nodes carry `source_location` values that are **bare line numbers** (`"L11"`, `"L33"`, `"L207"`) with no file qualifier, so the fuzzy matcher cannot use location to keep same-named-but-different-file symbols apart.

---

## Timeline — what happened, step by step

1. User invoked `/graphify --update` on `<PROJECT_ROOT>`.
2. `detect_incremental(.)` reported **28 changed code files**, **0 deletions** (21 `.ts`, 6 `.tsx`, 1 `package.json`).
3. All 28 fell in the `code` category (0 docs/papers/images) → skill took the **AST-only code path**, no subagents, no LLM.
4. `extract()` over the 28 files produced **476 AST nodes / 988 edges**.
5. Per `update.md`, the flow backed up `graph.json` → `.graphify_old.json`, then ran:
   ```python
   G = build_merge([new_extraction], graph_path='graphify-out/graph.json', prune_sources=None)
   ```
6. `build_merge` printed: `Deduplicated 243 node(s) (4 exact, 234 fuzzy)` and returned **1272 nodes / 2000 edges**.
7. Step 4 of the skill called `to_json(G, communities, 'graphify-out/graph.json')`, which printed:
   ```
   WARNING: new graph has 1272 nodes but existing graph.json has 1515.
   Refusing to overwrite — you may be missing chunk files from a previous session.
   Pass force=True to override.
   ```
   → `graph.json` was **not** overwritten.
8. Investigation (see Evidence) confirmed the reduction was **destructive, not legitimate cleanup**: 243 distinct real nodes dropped, 0 nodes added.
9. The downstream agent **declined to force the overwrite**, kept `graph.json` intact (1515 nodes), regenerated the report from the intact graph (reusing existing community assignments + labels, no re-cluster), and saved the manifest. A re-run of `detect_incremental` then reported **0 changed files** (manifest consistent).

---

## Evidence (collected during the session)

### Node/edge counts at each stage
| Artifact | Nodes | Edges |
|---|---|---|
| `graph.json` (existing, pre-merge) | 1515 | 2435 |
| `.graphify_old.json` (backup of the above) | 1515 | 2435 |
| `.graphify_ast.json` (fresh AST of 28 files) | 476 | 988 |
| `.graphify_extract.json` (merge output) | **1272** | **2000** |

### Diff of existing graph vs merge output
```
Old: 1515   New: 1272
Dropped (in old, not in new): 243
Added   (in new, not in old): 0
Dropped by type: {'?': 243}
Added   by type: {}
```

### Are the re-extracted nodes actually new?
```
Raw AST nodes: 476
AST node IDs already in old graph: 476
Genuinely NEW AST node IDs (not in old): 0
```
→ The re-extraction added nothing. Every fresh node already existed. The merge's only net effect was the fuzzy collapse.

### Sample of dropped (lost) nodes — all distinct, real symbols (names redacted)
```
<build-target-node>        | L11
<third-party-dep-1>        | L21
<third-party-dep-2>        | L25
<third-party-dep-3>        | L30
<third-party-dep-4>        | L90
<functionA>()              | L33
<ClassA>                   | L207
<ClassA>.<methodA>()       | L301
<ClassA>.<methodB>()       | L323
<PersistenceModuleA>       | L37
<PersistenceModuleB>       | L27
<testFixtureA>             | L17
<source-file>.ts           | L1
```
These are not duplicates — they are separate third-party dependency nodes (declared in `package.json`), separate service-class methods, separate standalone functions, separate persistence modules, and test fixtures. Collapsing them loses real graph structure. (Note: ~25 of the 243 were sampled directly; the remainder are inferred to be of the same kind given the 0-added / all-IDs-pre-existing result.)

### Post-recovery integrity check of the preserved `graph.json`
```
Nodes: 1515 | unique IDs: 1515
Edges: 2435
Orphan edges (pointing to non-existent nodes): 0
Distinct communities: 108
built_at_commit: <COMMIT_HASH>
```
The original graph has perfect referential integrity; it did not need "cleaning."

---

## Root cause analysis

### Why the count went *down* instead of up
On a normal `--update`, new/changed files introduce **new** nodes, so the merged count rises and `build_merge`'s fuzzy dedup (whatever it collapses) is masked by the increase. Here the auto-updater had already ingested these files, so re-extraction produced **0 new nodes**. With nothing to add, the fuzzy pass is the *sole* mutation — and it is purely subtractive.

### Why fuzzy dedup over-merges this graph
`build_merge` applies a **global** fuzzy similarity pass over the whole node set. In a code graph:
- Many nodes share **identical short labels** across different files.
- The disambiguating field `source_location` is just a **line number** (`"L33"`), not `file:line`, so two different functions on similar lines in different files look "close."
- The fuzzy threshold is tuned for prose/entity corpora (where near-duplicate surface forms usually *are* the same entity), not for code (where the same short identifier in file A and file B is genuinely two different symbols).

Result: the matcher merges across files and erases 234 distinct symbols (plus 4 exact and assorted edge collapses).

### Why the skill flow amplifies it
`references/update.md` calls:
```python
build_merge([new_extraction], graph_path='graph.json', prune_sources=deleted or None)
```
For **re-extracted** files it does **not** prune the old nodes first; it expects the fuzzy dedup to reconcile old vs new copies of the same file's nodes. That is exactly the operation that, on a code graph, collapses unrelated same-named symbols. A **source-scoped replace** (prune all nodes whose `source_file` ∈ changed-files, then add the fresh AST nodes) would be deterministic and lossless and would not need fuzzy matching at all.

---

## Why the in-session auto-updater does **not** show this (inferred)

The in-session `graphify update .` appears to do a **per-file, source-scoped replace** (remove the changed file's old nodes, insert its new nodes), so counts stay stable and nothing cross-file is collapsed. The skill's `--update` path instead routes through `build_merge` + global fuzzy dedup. The two paths diverge in their merge strategy, which is why an agent updating after every change stays safe, while a manual `--update` risks the shrink. *(This divergence is inferred from observed behavior; it has not been confirmed against `graphify update`'s source.)*

---

## Impact

- **Latent silent data loss.** With `force=True` (or any caller that ignores the warning), 243/1515 = **~16% of nodes** and **~18% of edges** are destroyed in a single run.
- **Erosion across repeated manual updates.** Each forced `--update` can shave the graph further, since the fuzzy pass is re-applied to an already-deduped graph.
- **Misleading guard message.** The warning attributes the shrink to "missing chunk files from a previous session," which is the *wrong* diagnosis here (it was code-only, no chunks). A user could be misled into forcing the overwrite believing the shrink is benign.
- **Docs gap.** Maintainers reportedly recommend running a manual `--update` every few sessions, with no mention of this near-destructive edge case.

---

## Steps to reproduce

1. Build a graph for a code-only TypeScript/React (or similarly identifier-dense) project: `/graphify .`.
2. Let the in-session auto-updater (`graphify update .`) bring the graph current after edits (so re-extraction would yield 0 new node IDs).
3. Run `/graphify --update` (or directly: `build_merge([fresh_ast_extraction], graph_path='graph.json', prune_sources=None)`).
4. Observe `Deduplicated N node(s) … fuzzy` and a **merged node count lower than the original**, with `to_json` refusing to overwrite.
5. Diff old vs merged: `added == 0`, `dropped == N`, dropped entries are distinct real symbols.

---

## Suggested fixes (for graphify maintainers)

1. **Source-scoped replace in the `--update` path.** In `references/update.md`, pass the **re-extracted** files (not only deleted ones) to `prune_sources`, so `build_merge` removes their old nodes and inserts the fresh AST deterministically — no fuzzy matching of code symbols.
2. **Disable cross-file fuzzy merging for code nodes.** Gate fuzzy dedup on node type/source; never fuzzy-merge two AST symbols from different `source_file`s. Require exact `(source_file, label)` equality for code.
3. **Qualify the fuzzy key with the file.** Use `source_file:source_location` (e.g. `<source-file>.ts:L33`) instead of bare `L33` so same-named symbols in different files are never "close."
4. **Make a 0-new-node update a no-op.** If the fresh extraction contributes no new node IDs and there are no deletions, skip `build_merge` entirely (the graph is already current) instead of running a destructive dedup.
5. **Never let a merge produce a net node loss without explicit, accurate justification.** The `to_json` guard is good; improve its message to say *"merge produced fewer nodes than the source graph (−N from fuzzy dedup)"* rather than blaming missing chunk files, and have `--update` treat the guard as a hard stop (not a "pass force=True" nudge).
6. **Document the edge case.** Warn that manual `--update` on an already-current, code-dense graph can shrink it, and recommend a clean full rebuild (`/graphify .`) as the safe refresh for such projects.

---

## Workaround (current, for downstream users)

- On code-only projects where the in-session updater already keeps the graph fresh, **do not run `/graphify --update` manually**; it can only shrink or stall.
- To refresh deliberately, do a **clean full rebuild** (`/graphify .`) — it builds without merging against the existing graph, so there is no fuzzy collapse.
- **Never pass `force=True`** to `to_json` to silence the "refusing to overwrite" warning when the node count has dropped.

---

## Confirmed vs inferred

- **Confirmed (observed in this session):** existing graph 1515/2435; merge output 1272/2000; `Deduplicated 243 (4 exact, 234 fuzzy)`; 0 nodes added; 243 distinct nodes dropped (sampled); `to_json` refused to overwrite; original graph has 0 orphan edges. The downstream agent ran the skill's documented `build_merge` call verbatim — the destructive mechanism is inside the library, not in the execution.
- **Inferred (not verified against source):** that the in-session `graphify update` uses a source-scoped replace while the skill's `--update` uses global fuzzy dedup; the exact fuzzy threshold and matching key in `build_merge`. These can be confirmed by reading `build_merge()` and the `graphify update` CLI implementation in the installed package.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph #1178

Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph

Report Bug write by Opus 4.8

TL;DR

Affected components

Environment / preconditions that trigger it

Timeline — what happened, step by step

Evidence (collected during the session)

Node/edge counts at each stage

Diff of existing graph vs merge output

Are the re-extracted nodes actually new?

Sample of dropped (lost) nodes — all distinct, real symbols (names redacted)

Post-recovery integrity check of the preserved `graph.json`

Root cause analysis

Why the count went down instead of up

Why fuzzy dedup over-merges this graph

Why the skill flow amplifies it

Why the in-session auto-updater does not show this (inferred)

Impact

Steps to reproduce

Suggested fixes (for graphify maintainers)

Workaround (current, for downstream users)

Confirmed vs inferred

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Artifact	Nodes	Edges
`graph.json` (existing, pre-merge)	1515	2435
`.graphify_old.json` (backup of the above)	1515	2435
`.graphify_ast.json` (fresh AST of 28 files)	476	988
`.graphify_extract.json` (merge output)	1272	2000

Uh oh!

Uh oh!

Bug Report — graphify --update performs a destructive, net-negative fuzzy node merge on an already-current code graph #1178

Description

Bug Report — graphify --update performs a destructive, net-negative fuzzy node merge on an already-current code graph

Report Bug write by Opus 4.8

TL;DR

Affected components

Environment / preconditions that trigger it

Timeline — what happened, step by step

Evidence (collected during the session)

Node/edge counts at each stage

Diff of existing graph vs merge output

Are the re-extracted nodes actually new?

Sample of dropped (lost) nodes — all distinct, real symbols (names redacted)

Post-recovery integrity check of the preserved graph.json

Root cause analysis

Why the count went down instead of up

Why fuzzy dedup over-merges this graph

Why the skill flow amplifies it

Why the in-session auto-updater does not show this (inferred)

Impact

Steps to reproduce

Suggested fixes (for graphify maintainers)

Workaround (current, for downstream users)

Confirmed vs inferred

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph #1178

Bug Report — `graphify --update` performs a destructive, net-negative fuzzy node merge on an already-current code graph

Post-recovery integrity check of the preserved `graph.json`