Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to store other types of link disambiguating information in the path hierarchy #828

Merged
merged 6 commits into from
Mar 4, 2024

Conversation

d-ronnqvist
Copy link
Contributor

Bug/issue #, if applicable:

Summary

This is a small refactor to change how the path hierarchy internally stores its disambiguated elements, changing the implementation from nested dictionaries per name collision per node to small lists per name collision per node.

The benefit of a small list is that new disambiguating information can be added without increasing the depth of the nested dictionaries at each node.

Dependencies

None.

Testing

Nothing in particular.

Checklist

Make sure you check off the following items. If they cannot be completed, provide a reason.

  • [ ] Added tests
  • Ran the ./bin/test script and it succeeded
  • Updated documentation if necessary

@d-ronnqvist d-ronnqvist added the code cleanup Code cleanup *without* user facing changes label Feb 15, 2024
@d-ronnqvist
Copy link
Contributor Author

In the local benchmark I ran just now, this change shows no measurable difference in build time but a possible small improvement in memory usage

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric                                   │ Change          │ main (f0a3d7a1)      │ current (15305e58)   │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Duration for 'bundle-registration'       │ -0.7118%¹       │ 2.521 sec            │ 2.503 sec            │
│ Duration for 'convert-total-time'        │ no change²      │ 5.04 sec             │ 4.958 sec            │
│ Duration for 'documentation-processing'  │ no change³      │ 2.379 sec            │ 2.316 sec            │
│ Duration for 'finalize-navigation-index' │ no change⁴      │ 0.062 sec            │ 0.062 sec            │
│ Peak memory footprint                    │ -0.9491%⁵       │ 537.7 MB             │ 532.6 MB             │
│ Data subdirectory size                   │ no change       │ 168.5 MB             │ 168.5 MB             │
│ Index subdirectory size                  │ no change       │ 1.5 MB               │ 1.5 MB               │
│ Total DocC archive size                  │ no change       │ 196.5 MB             │ 196.5 MB             │
│ Topic Anchor Checksum                    │ no change       │ 9ca210cb9901eb38d157 │ 9ca210cb9901eb38d157 │
│ Topic Graph Checksum                     │ no change       │ 311f4d32005c9bd7dd72 │ 311f4d32005c9bd7dd72 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

 1: There's a statistically significant difference between the two benchmark measurements.
    The values are different enough that the most probable explanation is that they're random samples from two different data sets.
    t-statistic : +2.572          degrees of freedom : 28       95% confidence critical value : +2.042

 2: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +1.297          degrees of freedom : 28       95% confidence critical value : +2.042

 3: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +0.990          degrees of freedom : 28       95% confidence critical value : +2.042

 4: No statistically significant difference.
    The values are similar enough that the most probable explanation is that they're random samples from the same data set.
    t-statistic : +1.103          degrees of freedom : 28       95% confidence critical value : +2.042

 5: There's a statistically significant difference between the two benchmark measurements.
    The values are different enough that the most probable explanation is that they're random samples from two different data sets.
    t-statistic : +5.870          degrees of freedom : 28       95% confidence critical value : +2.042

Copy link
Contributor

@mayaepps mayaepps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patshaughnessy , @binamaniar , and I took a look at this today.
I'm leaving a few questions that we were curious about as we started to go through these changes.

// The 'hash' is more unique than the 'kind', so compare the 'hash' first.
self.hash == hash && self.kind == kind
}
var isPlaceholderValue: Bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we ever create a node with a nil kind and hash? It looks like has something to do with an "unfindable" symbol placeholder, but I'm not sure what that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only happens when the convert service builds documentation for a single page. In this case, DocC is passed a "partial" symbol graph file that only contains a single symbol, which doesn't have to be a top-level symbol.

For example, if the convert service builds documentation for someFunction() and it's a member of SomeClass, then the symbol graph file will only contain the someFunction() symbol and not SomeClass but the function's path components still include "SomeClass". In this case, DocC creates an "unfindable" symbol placeholder named "SomeClass" that it someFunction() becomes a member of. This ensures that the nodes in the hierarchy are connected and that there's the same number of nodes—with the same names—between the module node and the someFunction() node as there would be in the full symbol graph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a a couple of paragraphs of internal implementation documentation about this in 50cc145

// It is possible for articles and other non-symbols to collide with unfindable symbol placeholder nodes.
mutating func add(_ value: PathHierarchy.Node, kind: String?, hash: String?) {
// When adding new elements to the container, it's sufficient to check if the hash and kind match.
if let existing = storage.first(where: { $0.matches(kind: kind, hash: hash) }) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there ever be multiple Elements with the same kind and hash? Do we need an assert for that? This looks safe as long as it's the only place we ever add new Elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if processing multiple symbol graphs with different language representations of the same symbol there could be multiple calls to add(_:kind:hash:) with the same kind and hash values. That's why this implementation checks if there's already an element with that kind and hash and merge the two nodes (combining the different language representations of that symbol).

The storage is a private property so this is the only palace that can add new elements.

// Given this expected amount of data, a nested dictionary implementation performs well.
private(set) var storage: [String: [String: PathHierarchy.Node]] = [:]
// Given this expected amount of data, linear searches through an array performs well.
private(set) var storage = ContiguousArray<Element>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you decide to use an array (or a ContiguousArray<Element> to be precise) and not a Set? As a list of unique items, if we make Element conform to Equatable, Set<Element> would automatically guarantee that all the elements are unique.
Maybe the Elements aren't actually equal if the hash and kind are equal but not the node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up question, after taking a look at ContiguousArray's documentation:

If the array’s Element type is a struct or enumeration, Array and ContiguousArray should have similar efficiency.

What is the benefit of using ContiguousArray over Array here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason for not using a Set here is twofold:

  • First, when an element already exists, the DisambiguationContainer needs to merge the new value with the existing value. Because of this, it's not sufficient to rely on the Set's uniqueness. Colliding elements also need to be merged.
  • Second, the DisambiguationContainer also needs to merge new and existing values in cases other than exact kind and hash matches. Currently this only happens when the container has a single placeholder value but in the future—when there could be more types of disambiguation—the container would need to look for existing values in more ways.

I could implement this with a Set and still use storage.first(where: { $0.matches(kind: kind, hash: hash) }) to check for existing matches but the Set doesn't bring any benefit over an Array in that case.

I could implement this with a Set and check for exact kind and hash matches with Set.remove(_:) but it would only work for the exact kind and hash match.

let element = Element(...)
if let existing = storage.remove(element) {
  existing.node.merge(with: value)
  storage.insert(existing)
} else if ...

I could also separate the kind and hash from the node and create a compound key

struct Key {
    let kind: String?
    let hash: String?
}
private(set) var storage = [Key: Node]()

This makes it very easy to look for exact matches

let key = Key(kind: kind, hash: hash)
if let existing = storage[key] {
    merge(with: value)
} else if ...

but when there are other types of disambiguation this breaks down and need to use .first(where:) to inspect the individual values again.

Ultimately, since the purpose of this refactoring is to make it easier to store other types of disambiguation in the future (#643 already has one type of disambiguation that I want to add), I went with the solution that doesn't prioritize exact kind and hash matches.


I thought that there would be a bigger difference between ContiguousArray and Array but I can't measure any different except in extreme micro benchmarks. I can change it to a regular array if it makes the code easier to understand. Otherwise I don't mind keeping the ContiguousArray since it's a low-level implementation detail.

Copy link
Contributor

@mayaepps mayaepps Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thank you! I wonder if it would be useful for future developers you were to add a comment (similar to your bullet points above) briefly explaining the unique requirements for the DisambiguationContainer's storage? I'm trying to think what may be useful for a future developer to know if they wanted to add a new type of disambiguation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly mentioned why a Set wouldn't help in a new comment in 50cc145

Copy link
Contributor

@mayaepps mayaepps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it simplifies the use of disambiguations in some places too, which is great!
(Reviewed together with @patshaughnessy and @binamaniar)

// The 'hash' is more unique than the 'kind', so compare the 'hash' first.
self.hash == hash && self.kind == kind
}
var isPlaceholderValue: Bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explaining. It might be helpful to have a comment somewhere explaining what a placeholder value is in the context of the PathHierarchy.

let component = PathParser.parse(pathComponent: component[...])
let nodeWithoutSymbol = Node(name: String(component.name))
nodeWithoutSymbol.isDisfavoredInCollision = true
// If 'known disambiguated path components' was provided, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like part of the comment is missing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this—and the surrounding internal documentation—in 50cc145

@@ -207,9 +207,17 @@ struct PathHierarchy {
parent.children[components.first!] == nil,
"Shouldn't create a new sparse node when symbol node already exist. This is an indication that a symbol is missing a relationship."
)
guard knownDisambiguatedPathComponents != nil else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we not have to do this before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this but it's a small optimization to avoid doing unnecessary work. Before we always parsed each placeholder node's path component for disambiguation but now we only do it if we know that the path hierarchy has been provided some custom path components (which could include disambiguation that we need to parse)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a brief comment that talks about this in 50cc145

// Subtree contains more than one match
throw Error.lookupCollision(kinds.map { ($0.value[hash]!, $0.key) })
switch disambiguation {
case .kindAndHash(let kind, let hash):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you avoid a nested switch statement here by using a guard to check whether disambiguation is nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do that now, but when we add another type of disambiguation we'll need both layers of switch statements.

@d-ronnqvist
Copy link
Contributor Author

@swift-ci please test

1 similar comment
@d-ronnqvist
Copy link
Contributor Author

@swift-ci please test

@d-ronnqvist d-ronnqvist merged commit 2f7b3aa into apple:main Mar 4, 2024
2 checks passed
@d-ronnqvist d-ronnqvist deleted the link-disambiguation-refactor-2 branch March 4, 2024 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code cleanup Code cleanup *without* user facing changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants