Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations in TokenSegment.TryMatch #12728

Open
kartheekp-ms opened this issue Jul 5, 2023 · 10 comments
Open

Reduce allocations in TokenSegment.TryMatch #12728

kartheekp-ms opened this issue Jul 5, 2023 · 10 comments
Assignees
Labels
Area:ContentFiles PackageReference contentFiles folder Category:BreakingChange Functionality:Restore Priority:1 High priority issues that must be resolved in the current sprint. TechDebt Technical debt Tenet:Performance Performance issues Type:Bug

Comments

@kartheekp-ms
Copy link
Contributor

Internal Tracking issue

@kartheekp-ms kartheekp-ms added Type:Bug Tenet:Performance Performance issues Functionality:Restore Area:ContentFiles PackageReference contentFiles folder labels Jul 5, 2023
@kartheekp-ms kartheekp-ms self-assigned this Jul 5, 2023
@kartheekp-ms kartheekp-ms added the TechDebt Technical debt label Jul 5, 2023
@Nigusu-Allehu
Copy link
Contributor

Nigusu-Allehu commented Aug 7, 2023

Proposed Solution

  • Removing substrings by using a ReadOnlySpan
  • Passing the full path and the indices to the parser.

Result

  • According to my attempts using Span (passing the full path and the indices) was able to help
    me reduce allocation by the TryMatch function, However it resulted in the helper functions
    making the allocations

  • I looked into replacing the helpers functions so that they also used Span. However the
    helper functions eventually call parses that almost always do lookups in a dictionary, which
    requires a string as a key(ReadOnlySpan can not be used for the dictionaries, it is stack only
    data type). As a result, eventually some function will end up producing the string.

Conclusion
Even though I was able to mitigate producing substrings in the TryMatch function and
almost all the functions it calls, all the parsers it uses eventually almost always do a lookup
in a dictionary and as a result end up producing a substring.
As a result, improving the performance of TryMatch by removing the substrings it produces
will only result in the better performance of TryMatch, however the system in general does
not improve in performance

possible solution

  • We currently, store file paths for a package content as a string. For example, "contetFiles/any/img.jpg"
  • This is formed by looping through the directories that would lead to img.jpg and concatenating them.
  • The TryMatch function is then tasked with tokenizing this path by delimiter "/" when we could have actually kept the file path to img.jpg as a list of tokens(the directories) which will lead to img.jpg. This way TryMatch won't have to tokenize a long file path string and result in unnecessary allocations. In addition, TryMatch won't have to loop through each character in a file path, making it faster.
  • Drawback: I am not sure yet, but if we need the file paths as a string at any point, we will have to do concatenation; possibly multiple times.

@davkean
Copy link

davkean commented Oct 23, 2023

@Nigusu-Allehu Can you point me to the dictionary lookup?

@Nigusu-Allehu
Copy link
Contributor

@Nigusu-Allehu Nigusu Yenework FTE Can you point me to the dictionary lookup?

The TryMatch method eventually, most of the time, calls on of the parsers in the ManagedCodeConventions.cs. Here is an example : https://github.com/NuGet/NuGet.Client/blob/b02fdbb36d34947c24496b2c0b505aa785bc657f/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L128

@zivkan
Copy link
Member

zivkan commented Oct 23, 2023

Is this code path only used by packages that have contentFiles/, or is it also used by lib/, build/, and so on?

@Nigusu-Allehu
Copy link
Contributor

It is used by all of them. https://github.com/NuGet/NuGet.Client/blob/b02fdbb36d34947c24496b2c0b505aa785bc657f/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L411 in here there is a list of patterns that a package could have for its asset files. TokenSegments in these patterns always use a parser. For example for the pattern "lib/{tfm}/{assembly}", tfm and assembly are TokenSegments and as a result a parser would be used to identify the tfm and assembly.

@zivkan
Copy link
Member

zivkan commented Oct 23, 2023

Ok. What I'm thinking is that NuGet could "pre-compute" assets lists per TFM on package extraction, and save the results in the values in the .nupkg.metadata computed file. Then, when NuGet is doing asset selection for a project, "all" it has to do is select the "closest TFM", then it already has the List<string> of matching files, and per-project restore won't need to use ManagedCodeConventions at all.

This will be a lot more difficult to implement than refactoring a single class for optimizations. But I think it has the potential for significant perf improvements. Other people should definately think about the suggestion, in case I'm way off base. CI restore might not benefit a whole lot, since they'll have to extract packages and do the asset pre-compute on every restore. But on developer boxes where the global packages folder isn't deleted frequently (for example, where customers use VS), I think it could really have a decent (maybe even big?) improvement.

I think NuGet's "convention based" approach makes manually creating packages a little easier, but I think it harms restore performance. We can't change the past (and I might be wrong that a more explicit package format would be faster to restore), but we can take advantage of NuGet's "generated on extraction" files. The .nupkg.metadata file already has a version property, so we can detect when an older version without the asset pre-compute exists, then do a one-time update.

Unfortunately, for read-only fallback folders, we'll need to keep using the convention based approach, but if whatever creates these fallback folder layouts updates to a newer NuGet, then the higher .nupkg.metadata file version will be pre-created there.

However, the big question is whether this idea will actually benefit perf. ManagedCodeConventions, and TokenSegment cause a lot of allocations, but so can deserializing "large" json files. Obviously, I suspect reading the json file will be better than doing convention based filtering every restore. But I don't think it's a 100% sure thing, so if there's a way to quickly prototype and benchmark to validate it's a good idea, that can tell us if it's worthwhile doing a big change.

@davkean
Copy link

davkean commented Oct 24, 2023

@Nigusu-Allehu Is this the table you a referring to? This is currently defined using strings, but doesn't have to be, this could be:

        private readonly Dictionary<ReadOnlyMemory<Char>, Dictionary<ReadOnlyMemory<Char>, object>> _table = new Dictionary<ReadOnlyMemory<Char>, Dictionary<ReadOnlyMemory<Char>, object>>(ReadOnlyMemoryCharOrdinalComparer.Instance);       

Where ReadOnlyMemoryCharOrdinalComparer is a comparer that delegates to MemoryExtensions.Equals and a custom GetHashCode. Or an implementation similar to Drew's StringPool which takes a Span and returns a "string".

@davkean
Copy link

davkean commented Oct 24, 2023

I do want to say that @zivkan's idea is a good one, but I don't want to lose sight here that there might be an easier approach we can take to reduce the allocations on this path first.

@zivkan
Copy link
Member

zivkan commented Oct 24, 2023

I intended to write "if we can't find easy wins" in my previous message. Looks like I forgot to 🤦‍♂️

@Nigusu-Allehu
Copy link
Contributor

@Nigusu-Allehu Nigusu Yenework FTE Is this the table you a referring to? This is currently defined using strings, but doesn't have to be, this could be:

        private readonly Dictionary<ReadOnlyMemory<Char>, Dictionary<ReadOnlyMemory<Char>, object>> _table = new Dictionary<ReadOnlyMemory<Char>, Dictionary<ReadOnlyMemory<Char>, object>>(ReadOnlyMemoryCharOrdinalComparer.Instance);       

Where ReadOnlyMemoryCharOrdinalComparer is a comparer that delegates to MemoryExtensions.Equals and a custom GetHashCode. Or an implementation similar to Drew's StringPool which takes a Span and returns a "string".

Thank you, David! I will look into it

@jeffkl jeffkl self-assigned this Mar 4, 2024
@nkolev92 nkolev92 added Priority:2 Issues for the current backlog. Priority:1 High priority issues that must be resolved in the current sprint. and removed Priority:2 Issues for the current backlog. labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area:ContentFiles PackageReference contentFiles folder Category:BreakingChange Functionality:Restore Priority:1 High priority issues that must be resolved in the current sprint. TechDebt Technical debt Tenet:Performance Performance issues Type:Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants