Make it easier to store other types of link disambiguating information in the path hierarchy #828

d-ronnqvist · 2024-02-15T09:16:39Z

Bug/issue #, if applicable:

Summary

This is a small refactor to change how the path hierarchy internally stores its disambiguated elements, changing the implementation from nested dictionaries per name collision per node to small lists per name collision per node.

The benefit of a small list is that new disambiguating information can be added without increasing the depth of the nested dictionaries at each node.

Dependencies

None.

Testing

Nothing in particular.

Checklist

Make sure you check off the following items. If they cannot be completed, provide a reason.

~~[ ] Added tests~~
Ran the ./bin/test script and it succeeded
Updated documentation if necessary

d-ronnqvist · 2024-02-15T09:19:01Z

In the local benchmark I ran just now, this change shows no measurable difference in build time but a possible small improvement in memory usage

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric                                   │ Change          │ main (f0a3d7a1)      │ current (15305e58)   │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Duration for 'bundle-registration'       │ -0.7118%¹       │ 2.521 sec            │ 2.503 sec            │
│ Duration for 'convert-total-time'        │ no change²      │ 5.04 sec             │ 4.958 sec            │
│ Duration for 'documentation-processing'  │ no change³      │ 2.379 sec            │ 2.316 sec            │
│ Duration for 'finalize-navigation-index' │ no change⁴      │ 0.062 sec            │ 0.062 sec            │
│ Peak memory footprint                    │ -0.9491%⁵       │ 537.7 MB             │ 532.6 MB             │
│ Data subdirectory size                   │ no change       │ 168.5 MB             │ 168.5 MB             │
│ Index subdirectory size                  │ no change       │ 1.5 MB               │ 1.5 MB               │
│ Total DocC archive size                  │ no change       │ 196.5 MB             │ 196.5 MB             │
│ Topic Anchor Checksum                    │ no change       │ 9ca210cb9901eb38d157 │ 9ca210cb9901eb38d157 │
│ Topic Graph Checksum                     │ no change       │ 311f4d32005c9bd7dd72 │ 311f4d32005c9bd7dd72 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

 1: There's a statistically significant difference between the two benchmark measurements.
    The values are different enough that the most probable explanation is that they're random samples from two different data sets.
    t-statistic : +2.572          degrees of freedom : 28       95% confidence critical value : +2.042

 2: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +1.297          degrees of freedom : 28       95% confidence critical value : +2.042

 3: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +0.990          degrees of freedom : 28       95% confidence critical value : +2.042

 4: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +1.103          degrees of freedom : 28       95% confidence critical value : +2.042

 5: There's a statistically significant difference between the two benchmark measurements.
    The values are different enough that the most probable explanation is that they're random samples from two different data sets.
    t-statistic : +5.870          degrees of freedom : 28       95% confidence critical value : +2.042

mayaepps

@patshaughnessy , @binamaniar , and I took a look at this today.
I'm leaving a few questions that we were curious about as we started to go through these changes.

mayaepps · 2024-02-22T21:47:58Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

+            // The 'hash' is more unique than the 'kind', so compare the 'hash' first.
+            self.hash == hash && self.kind == kind
+        }
+        var isPlaceholderValue: Bool {


When do we ever create a node with a nil kind and hash? It looks like has something to do with an "unfindable" symbol placeholder, but I'm not sure what that is.

It only happens when the convert service builds documentation for a single page. In this case, DocC is passed a "partial" symbol graph file that only contains a single symbol, which doesn't have to be a top-level symbol.

For example, if the convert service builds documentation for someFunction() and it's a member of SomeClass, then the symbol graph file will only contain the someFunction() symbol and not SomeClass but the function's path components still include "SomeClass". In this case, DocC creates an "unfindable" symbol placeholder named "SomeClass" that it someFunction() becomes a member of. This ensures that the nodes in the hierarchy are connected and that there's the same number of nodes—with the same names—between the module node and the someFunction() node as there would be in the full symbol graph.

Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.

I added a a couple of paragraphs of internal implementation documentation about this in 50cc145

mayaepps · 2024-02-22T21:49:52Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

-            // It is possible for articles and other non-symbols to collide with unfindable symbol placeholder nodes.
+    mutating func add(_ value: PathHierarchy.Node, kind: String?, hash: String?) {
+        // When adding new elements to the container, it's sufficient to check if the hash and kind match.
+        if let existing = storage.first(where: { $0.matches(kind: kind, hash: hash) }) {


Could there ever be multiple Elements with the same kind and hash? Do we need an assert for that? This looks safe as long as it's the only place we ever add new Elements.

Yes, if processing multiple symbol graphs with different language representations of the same symbol there could be multiple calls to add(_:kind:hash:) with the same kind and hash values. That's why this implementation checks if there's already an element with that kind and hash and merge the two nodes (combining the different language representations of that symbol).

The storage is a private property so this is the only palace that can add new elements.

mayaepps · 2024-02-22T23:35:19Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

-        // Given this expected amount of data, a nested dictionary implementation performs well.
-        private(set) var storage: [String: [String: PathHierarchy.Node]] = [:]
+        // Given this expected amount of data, linear searches through an array performs well.
+        private(set) var storage = ContiguousArray<Element>()


What made you decide to use an array (or a ContiguousArray<Element> to be precise) and not a Set? As a list of unique items, if we make Element conform to Equatable, Set<Element> would automatically guarantee that all the elements are unique.
Maybe the Elements aren't actually equal if the hash and kind are equal but not the node?

As a follow up question, after taking a look at ContiguousArray's documentation:

If the array’s Element type is a struct or enumeration, Array and ContiguousArray should have similar efficiency.

What is the benefit of using ContiguousArray over Array here?

My reason for not using a Set here is twofold:

First, when an element already exists, the DisambiguationContainer needs to merge the new value with the existing value. Because of this, it's not sufficient to rely on the Set's uniqueness. Colliding elements also need to be merged.

Second, the DisambiguationContainer also needs to merge new and existing values in cases other than exact kind and hash matches. Currently this only happens when the container has a single placeholder value but in the future—when there could be more types of disambiguation—the container would need to look for existing values in more ways.

I could implement this with a Set and still use storage.first(where: { $0.matches(kind: kind, hash: hash) }) to check for existing matches but the Set doesn't bring any benefit over an Array in that case.

I could implement this with a Set and check for exact kind and hash matches with Set.remove(_:) but it would only work for the exact kind and hash match.

let element = Element(...) if let existing = storage.remove(element) { existing.node.merge(with: value) storage.insert(existing) } else if ...

I could also separate the kind and hash from the node and create a compound key

struct Key { let kind: String? let hash: String? } private(set) var storage = [Key: Node]()

This makes it very easy to look for exact matches

let key = Key(kind: kind, hash: hash) if let existing = storage[key] { merge(with: value) } else if ...

but when there are other types of disambiguation this breaks down and need to use .first(where:) to inspect the individual values again.

Ultimately, since the purpose of this refactoring is to make it easier to store other types of disambiguation in the future (#643 already has one type of disambiguation that I want to add), I went with the solution that doesn't prioritize exact kind and hash matches.

I thought that there would be a bigger difference between ContiguousArray and Array but I can't measure any different except in extreme micro benchmarks. I can change it to a regular array if it makes the code easier to understand. Otherwise I don't mind keeping the ContiguousArray since it's a low-level implementation detail.

That makes sense, thank you! I wonder if it would be useful for future developers you were to add a comment (similar to your bullet points above) briefly explaining the unique requirements for the DisambiguationContainer's storage? I'm trying to think what may be useful for a future developer to know if they wanted to add a new type of disambiguation.

I briefly mentioned why a Set wouldn't help in a new comment in 50cc145

mayaepps

This looks like it simplifies the use of disambiguations in some places too, which is great!
(Reviewed together with @patshaughnessy and @binamaniar)

mayaepps · 2024-03-01T22:36:09Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

+            // The 'hash' is more unique than the 'kind', so compare the 'hash' first.
+            self.hash == hash && self.kind == kind
+        }
+        var isPlaceholderValue: Bool {


Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.

mayaepps · 2024-03-01T23:00:56Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

                    let component = PathParser.parse(pathComponent: component[...])
                    let nodeWithoutSymbol = Node(name: String(component.name))
                    nodeWithoutSymbol.isDisfavoredInCollision = true
+                    // If 'known disambiguated path components' was provided, then


Looks like part of the comment is missing here.

I updated this—and the surrounding internal documentation—in 50cc145

mayaepps · 2024-03-01T23:01:04Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy.swift

@@ -207,9 +207,17 @@ struct PathHierarchy {
                        parent.children[components.first!] == nil,
                        "Shouldn't create a new sparse node when symbol node already exist. This is an indication that a symbol is missing a relationship."
                    )
+                    guard knownDisambiguatedPathComponents != nil else {


Why did we not have to do this before?

We don't need to do this but it's a small optimization to avoid doing unnecessary work. Before we always parsed each placeholder node's path component for disambiguation but now we only do it if we know that the path hierarchy has been provided some custom path components (which could include disambiguation that we need to parse)

I added a brief comment that talks about this in 50cc145

mayaepps · 2024-03-01T23:03:43Z

Sources/SwiftDocC/Infrastructure/Link Resolution/PathHierarchy+Find.swift

-                // Subtree contains more than one match
-                throw Error.lookupCollision(kinds.map { ($0.value[hash]!, $0.key) })
+        switch disambiguation {
+        case .kindAndHash(let kind, let hash):


Could you avoid a nested switch statement here by using a guard to check whether disambiguation is nil?

I could do that now, but when we add another type of disambiguation we'll need both layers of switch statements.

d-ronnqvist · 2024-03-04T11:49:06Z

@swift-ci please test

d-ronnqvist · 2024-03-04T11:51:43Z

@swift-ci please test

Internally store disambiguated path hierarchy elements as small lists

15305e5

d-ronnqvist added the code cleanup Code cleanup *without* user facing changes label Feb 15, 2024

d-ronnqvist mentioned this pull request Feb 15, 2024

Support disambiguating links with type signature information #643

Open

3 tasks

d-ronnqvist added 2 commits February 21, 2024 15:55

Merge branch 'main' into link-disambiguation-refactor-2

2237eff

Merge branch 'main' into link-disambiguation-refactor-2

37b57d8

d-ronnqvist requested review from patshaughnessy, mayaepps and sofiaromorales and removed request for patshaughnessy and mayaepps February 22, 2024 08:19

mayaepps reviewed Feb 23, 2024

View reviewed changes

Merge branch 'main' into link-disambiguation-refactor-2

dbc7322

mayaepps approved these changes Mar 2, 2024

View reviewed changes

d-ronnqvist added 2 commits March 4, 2024 12:47

Continue to disfavor sparse nodes without disambiguation

2008ef7

Expand on internal comments to describe more implementation details

50cc145

d-ronnqvist merged commit 2f7b3aa into apple:main Mar 4, 2024
2 checks passed

d-ronnqvist deleted the link-disambiguation-refactor-2 branch March 4, 2024 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier to store other types of link disambiguating information in the path hierarchy #828

Make it easier to store other types of link disambiguating information in the path hierarchy #828

d-ronnqvist commented Feb 15, 2024

d-ronnqvist commented Feb 15, 2024

mayaepps left a comment

mayaepps Feb 22, 2024

d-ronnqvist Feb 23, 2024

mayaepps Mar 1, 2024

d-ronnqvist Mar 4, 2024

mayaepps Feb 22, 2024

d-ronnqvist Feb 23, 2024

mayaepps Feb 22, 2024

mayaepps Feb 22, 2024

d-ronnqvist Feb 23, 2024

mayaepps Mar 1, 2024 •

edited

d-ronnqvist Mar 4, 2024

mayaepps left a comment •

edited

mayaepps Mar 1, 2024

mayaepps Mar 1, 2024

d-ronnqvist Mar 4, 2024

mayaepps Mar 1, 2024

d-ronnqvist Mar 4, 2024

d-ronnqvist Mar 4, 2024

mayaepps Mar 1, 2024

d-ronnqvist Mar 4, 2024

d-ronnqvist commented Mar 4, 2024

d-ronnqvist commented Mar 4, 2024

Make it easier to store other types of link disambiguating information in the path hierarchy #828

Make it easier to store other types of link disambiguating information in the path hierarchy #828

Conversation

d-ronnqvist commented Feb 15, 2024

Summary

Dependencies

Testing

Checklist

d-ronnqvist commented Feb 15, 2024

mayaepps left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayaepps Mar 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayaepps left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-ronnqvist commented Mar 4, 2024

d-ronnqvist commented Mar 4, 2024

mayaepps Mar 1, 2024 •

edited

mayaepps left a comment •

edited