Add incremental invalidation/resolution#589
Add incremental invalidation/resolution#589st0012 wants to merge 2 commits intounresolve-primitivesfrom
Conversation
365da40 to
774d476
Compare
vinistock
left a comment
There was a problem hiding this comment.
I'm still trying to reason about the algorithm, so not done reviewing yet.
7c0e491 to
e8cdc32
Compare
vinistock
left a comment
There was a problem hiding this comment.
This PR is highly complex. I think the mechanism works, but it's hard to ship so many changes in one go. Here's what I think we should do to minimize the changes and ship with confidence:
- Move the rename of
resolve_alltoresolveto a separate PR - Create a PR implementing
unresolve_nameandunresolve_referencethat can be shipped separate from the algorithm - Add the
name_dependentshashmap ofIdentityHashMap<NameId, IdentityHashSet<NameDependent>>using aNameDependentenum. Populate this map during indexing when creating names, so that the global graph only has to merge the work at the end
With this foundation, it will be significantly easier for reviewers to focus on the algorithm. What do you think?
rust/rubydex/src/model/graph.rs
Outdated
| if let Some(name_set) = self.declaration_to_names.get_mut(&declaration_id) { | ||
| name_set.remove(&name_id); | ||
| if name_set.is_empty() { | ||
| self.declaration_to_names.remove(&declaration_id); | ||
| } | ||
| } |
There was a problem hiding this comment.
This is one of the reasons why I'm not a fan of the multiple hashmaps. We need to ensure that the data on the auxiliary maps is also consistent and the benefit is just avoiding tracing the graph from declaration -> definitions -> names. I'm not convinced tbh.
Also, to make sure we're making progress, I think you can probably ship a separate PR with just the unresolve_name and unresolve_reference methods separately.
There was a problem hiding this comment.
I'll address all the feedback first and then see how we can split this PR.
rust/rubydex/src/model/graph.rs
Outdated
| /// Reverse index: for each `NameId`, which definitions and constant references use it. | ||
| /// Eliminates O(D+R) scans during invalidation. | ||
| name_users: IdentityHashMap<NameId, Vec<NameUser>>, | ||
|
|
||
| /// Reverse index: for each `NameId`, which other names depend on it | ||
| /// (via nesting or `parent_scope`). Used for cascade invalidation. | ||
| name_dependents: IdentityHashMap<NameId, IdentityHashSet<NameId>>, |
There was a problem hiding this comment.
Why do we need these two maps? The idea of having an enum for name dependents is so that you can go NameId -> ReferenceId | DefinitionId -> NameId.
There was a problem hiding this comment.
I merged the maps but I think the value should also include NameId. So it'd be NameId -> ReferenceId | DefinitionId | NameId.
I prototyped using reference/definition to look up name but it doesn't work well with parent_scope/nesting cases. For example:
class Baz
include Foo
CONST
endThe reference CONST creates a name with nesting=baz_name, but it doesn't create a member declaration under Baz. So Baz.members() is empty — there's no path from baz_name to the reference's name through the declaration/definitions. When Baz's ancestors change (e.g., a new prepend Bar is added), we need to re-evaluate CONST, but without the explicit Name(NameId) entry under baz_name, we can't discover it.
There was a problem hiding this comment.
there's no path from baz_name to the reference's name through the declaration/definitions
This is the only reason why we need name_dependents. In your example, this is what I would expect the hashmap to look like:
# name_dependents
{
NameId(Baz) => Set[ReferenceId(Foo), ReferenceId(CONST)]
}
This allows the graph to remember which references and definitions will be potentially impacted by a name change. We then trace name_dependents and the rest of the graph to unresolve names.
In this case, if we had to unresolve all names due to a change to Baz, I would expect the algorithm to do something like this:
Bazchanged. Loop through name dependents- Regardless whether the dependent is a reference or definition, get its
name_id, pull the name from the graph and unresolve it - Now, unresolving the definition and reference may invalidate other things. Go back to 1. and invalidate the
name_idfor the reference/definition
This also involves invalidating ancestors, but you get the idea.
|
|
||
| /// Accumulated work items from update/delete operations. | ||
| /// Drained by `take_pending_work()` before resolution. | ||
| pending_work: Vec<Unit>, |
There was a problem hiding this comment.
We still need an answer for the ever growing memory if users don't call resolve.
995854d to
746e90c
Compare
Add methods for unwinding name resolution during incremental updates: - `unresolve_name`: converts a Resolved name back to Unresolved - `unresolve_reference`: detaches a reference from its declaration and unresolves the underlying name - `remove_name`: removes a name and cleans up reverse indices - `detach_name_from_declaration`: shared helper for reverse index cleanup Add `declaration_to_names` reverse index that tracks which names resolve to each declaration. Populated during `record_resolved_name`, cleaned up by the unresolve/remove primitives. Integrate `unresolve_reference` into `remove_definitions_for_document` to properly detach references during document updates.
a421b3c to
9ee49c6
Compare
9ee49c6 to
417635f
Compare
4980e3b to
0ac1ae6
Compare
Summary
Replace the full re-resolution strategy (
clear_declarations+ resolve everything from scratch) with incremental invalidation. Graph mutations (update/delete_document) now compute the minimal set of definitions, references, and ancestor chains that need re-resolution, and the resolver processes only that subset.How it works
Graph mutations follow a three-step pipeline:
invalidate— Detaches old definitions from declarations. Identifies affected declarations (definition removed, new definition/reference touching existing names). Feeds them intoinvalidate_graph, a unified worklist that processes declarations and names.remove_document_data— Removes old refs/defs/names/strings from maps. Cleans up empty declarations.extend— Merges newLocalGraphdata and queues new definitions/references for resolution.Work items are accumulated as
Unit(definitions, references, ancestors) and drained by the resolver. For the initial full index, this contains everything. For incremental updates, only the invalidated subset.Reverse index:
name_dependentsA single reverse index
name_dependents: NameId → Vec<NameDependent>maps each name to the definitions and references that depend on it. Each dependent is stored under its ownname_idand under thenesting/parent_scopeof that name, so invalidation can trace from any scope level directly to affected refs/defs without an intermediate name-to-name layer.Invalidation worklist
invalidate_graphprocesses two kinds of items:Other changes
without_resolutionoption — Skip accumulating pending work for tools that only need definitions.name_dependentspopulated during indexing —LocalGraphbuildsname_dependentsentries inadd_definitionandadd_constant_reference, then merged into the global graph duringextend.resolve_allrenamed toresolve, drainspending_workviaUnitenum instead of scanning all names.TODO: smarter ancestor invalidation
Currently, any definition change on a declaration enters ancestor-stale mode, which conservatively unresolves all nested references. A future optimization: compare old vs new definitions for ancestor-affecting fields (
mixins,parent_class). If they haven't changed, skip ancestor invalidation entirely — only re-resolve the definition itself.Compared to
mainCorrectness: identical — all declaration counts, definition counts, orphan rates, and linked/orphan breakdowns match exactly between main and the branch.
Performance (initial full index on 94,036 files):
Memory:
Resolution is 39% faster at the cost of 13.5% more memory from the reverse index and pending work accumulation. (Note: benchmarks predate some of the recent changes — numbers may have shifted slightly.)