-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce allocations in TokenSegment.TryMatch #5676
Conversation
c114686
to
3555d15
Compare
@jeffkl Do you have any numbers on this? |
src/NuGet.Core/NuGet.Packaging/PublicAPI/netstandard2.0/PublicAPI.Unshipped.txt
Outdated
Show resolved
Hide resolved
@@ -12,8 +12,8 @@ namespace NuGet.ContentModel | |||
/// </summary> | |||
public class PatternTable | |||
{ | |||
private readonly Dictionary<string, Dictionary<string, object>> _table | |||
= new Dictionary<string, Dictionary<string, object>>(StringComparer.Ordinal); | |||
private readonly Dictionary<string, Dictionary<ReadOnlyMemory<char>, object>> _table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels risky to me, but I don't have a better suggestion.
If the ReadOnlyMemory<char>
comes from a char[]
, then there's nothing in the C# language preventing code from modifying values in the char[]
, and therefore using the ReadOnlyMemory<char>
as a dictionary key is going to be "corrupt", since the value and therefore hash code has now changed.
Even in .NET 8 and 9, I don't believe it's possible to have a Dictionary<string, TValue>
, and then use a ReadOnlyMemory<char>
to get values.
I was hoping that wring out some of these words would give me a better idea (kind of like rubber duck debugging), but unfortunately I haven't. I didn't know if using ReadOnlyMemory<char>
as a dictionary key world work, as it needs to implement GetHashCode "correctly", but if it does, then it's safe enough, as long as the key used to add into the dictionary is never modified. It's generally nicer when we can get the compiler to check things for us, but I guess in this case we're going to have to manually read and reason about the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write a debug-only assertion that the underlying type is a string via MemoryMarshal.TryGetString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They keys are added from strings: https://github.com/NuGet/NuGet.Client/pull/5676/files#diff-209a6a4d96581f677475543a85fa1ef521d76b514da50a1f0e4f23012bfb02eaR39
Do you mean the lookup should verify the backing memory is a string?
https://github.com/NuGet/NuGet.Client/pull/5676/files#diff-209a6a4d96581f677475543a85fa1ef521d76b514da50a1f0e4f23012bfb02eaL64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the lookup should verify the backing memory is a string?
Yes. "Good" code is easy to read and verify it's doing the right thing. The .Add
call passes in a type that I can't figure out by reading this file alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values of this dictionary are added in the constructor only here which the caller specifies strings and this class stores them as a ReadOnlyMemory<char>
so that later on lookups can be ReadOnlyMemory<char>
.
It sounds like instead there should be a Debug.Assert()
in the lookup code here to verify that the caller specified a ReadOnlyMemory<char>
that is backed by a string?
if (charactersRemaining > 0) | ||
{ | ||
num1 = ((num1 << 5) + num1 + (num1 >> 27)) ^ pSpan0[obj.Length - 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this works correctly when charactersRemaining
is 1, 2, or 3, it's unintuitive to me how. It looks to me like it only considers the last character, which means when there's 2 or 3 remaining, this method will calculate the "wrong" hash (or it's easy to have a hash collection when only pSpan0[obj.Length - 2]
is the only different character).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I borrowed this code from here: https://github.com/dotnet/msbuild/pull/8852/files#diff-88a07852f8662ec51afa7f83c50152dac604d637ef6526f442d16a0e94cd86d6R177
@@ -206,7 +206,7 @@ public TokenSegment(string token, char delimiter, bool matchOnly, PatternTable t | |||
} | |||
if (StringComparer.Ordinal.Equals(_token, "tfm")) | |||
{ | |||
item.Properties.Add("tfm_raw", substring); | |||
item.Properties.Add("tfm_raw", substring.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside of this PR, but probably perfect case to have this pulled from a name table so we only ever create one of these strings per "raw TFM".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -22,7 +21,7 @@ public class ManagedCodeConventions | |||
|
|||
private static readonly ContentPropertyDefinition AnyProperty = new ContentPropertyDefinition( | |||
PropertyNames.AnyValue, | |||
parser: (o, t) => o); // Identity parser, all strings are valid for any | |||
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this run? I would be worried that this defeats the purpose of the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert in this code path, but my understanding is that any pattern that has {any}
or
{any?}
path segment will execute this code path. Looking at the code https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L421-L573, almost all patterns except MSBuild related files have the said any
path segment in them. If all the parsers have .ToString()
in them, I am also worried that it might defeat the purpose of the change. However, a PerfView trace before and after the fix would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is for path segments that match on everything for example lib/{tfm}/{any}
. That case, the o.ToString()
makes a string out of something like PackageA.dll
which gets added to the properties in TokenSegment.TryMatch()
. This means the substring was used so its okay to realize it to a string? Since the point was to stop allocating a substring if its not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the point was to stop allocating a substring if its not used.
I agree with this goal. Is there any test that checks for the scenario where substrings are not used? The reason is that the pattern matching algorithm is executed during the pack and restore operation. Hence, I am curious to understand the scenario where substrings are not used. For example, I debugged first 3-unit tests in https://github.com/NuGet/NuGet.Client/blob/dev/test/NuGet.Core.Tests/NuGet.Packaging.Test/ContentModelTests/ContentModelLibTests.cs and noticed that substrings are invoked every time. Maybe I chose wrong tests to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kartheekp-ms Are you saying are created for every segment in those strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substrings are created if the segment satisfies a particular parser logic. For example, if it is a {any}
segment and path
token has xyz.dll
, the code creates a substring. Likewise looking at all the parsers logic, there is .ToString()
invocation which is invoked if the token satisfies a path segment.
@@ -161,18 +160,18 @@ private static object Locale_Parser(string name, PatternTable table) | |||
|
|||
if (name.Length == 2) | |||
{ | |||
return name; | |||
return name.ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern as above, when does this run? If it runs always, then this will defeat the purpose of the original change on all strings of two chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code path will be executed only when parsing resource assemblies because they have {locale}
path segment in them. https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L486-L497
src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs
Outdated
Show resolved
Hide resolved
@@ -12,8 +12,8 @@ namespace NuGet.ContentModel | |||
/// </summary> | |||
public class PatternTable | |||
{ | |||
private readonly Dictionary<string, Dictionary<string, object>> _table | |||
= new Dictionary<string, Dictionary<string, object>>(StringComparer.Ordinal); | |||
private readonly Dictionary<string, Dictionary<ReadOnlyMemory<char>, object>> _table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the lookup should verify the backing memory is a string?
Yes. "Good" code is easy to read and verify it's doing the right thing. The .Add
call passes in a type that I can't figure out by reading this file alone.
....Core.Tests/NuGet.Packaging.Test/ContentModelTests/ReadOnlyMemoryCharComparerOrdinalTests.cs
Outdated
Show resolved
Hide resolved
{ | ||
[Theory] | ||
[InlineData("AbCDeFg", "AbCDeFg", true)] | ||
[InlineData("AbCDeFG", "abcdefg", false)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand how the code works, or why this test passes. Can anybody explain it to me?
The for
loop processes multiples of 4 characters. Then there's an if (charactersRemaining > 0)
which runs num1 = whatever(num1) ^ pSpan0[obj.Length - 1];
In the case of 7 character test, the for
loop processes array indexes 0, 1, 2, and 3. charactersRemaining
should be 3, so, the if statement processes pSpan0[7 - 1]
or pSpan0[6]
.
How are characters 4 and 5 contributing to the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I tried debugging it but ran out of time. I did add a unit test to verify that the hash of ABCDEFGH
and ABCDefgH
are not the same which I think shows that its working as expected?
@jeffkl What's blocking this from making progress? |
7590e49
to
f1a0893
Compare
I was the NuGet on-call engineer last week. Hoping to get back to this today or tomorrow, I'm also helping with the 1ES PT effort. |
f1a0893
to
99c6c8b
Compare
@jeffkl - I have a slightly different understanding of why this code path is causing too many allocations. My understanding is that the current pattern-matching algorithm is suboptimal. For every asset in the package, we check it against all the patterns to find a match. Instead, we could have some kind of state machine that we traverse with each asset exactly once to determine the asset type. For example, in the image below, I have added numbers (1 to 10) to denote roots. Each asset will navigate through only one root, and if it reaches the terminal state on that path, then there is a match, meaning it could be a runtime assembly, a native assembly, or something else. |
Team Triage: Filed a tech debt issue https://github.com/NuGet/Client.Engineering/issues/2780. |
|
Can you share out a trace?
…________________________________
From: Jeff Kluge ***@***.***>
Sent: Friday, April 5, 2024 6:55 AM
To: NuGet/NuGet.Client ***@***.***>
Cc: David Kean ***@***.***>; Review requested ***@***.***>
Subject: Re: [NuGet/NuGet.Client] Reduce allocations in TokenSegment.TryMatch (PR #5676)
@jeffkl<https://github.com/jeffkl> Do you have any numbers on this?
Before:
image.png (view on web)<https://github.com/NuGet/NuGet.Client/assets/17556515/e3383573-3d28-40b8-8e27-ca1307d2c65a>
After:
image.png (view on web)<https://github.com/NuGet/NuGet.Client/assets/17556515/1a869020-04fb-4ac0-875d-877487766b75>
—
Reply to this email directly, view it on GitHub<#5676 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAINQIQFJT2BMFOPAGSYEZDY3WV4LAVCNFSM6AAAAABENHJ4UWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZYGA4TKMZZGE>.
You are receiving this because your review was requested.Message ID: ***@***.***>
|
|
I compared the traces. The highlighted areas where I believe the above change affects. We moved ~45 MB -> 22 MB, but I want to see this is repeatible (I see differences where I don't expect), so can you run the after a few times and make sure that the TryMatch numbers match below within a MB or so. If not, then we don't have a deterministic repro, and we should pick something larger. |
In the before GC stats, only 1 of the 4 had a gen 2 GC, which only paused for 14ms. All 4 of the after GC stats show a gen 2 GC, and takes around 150+ ms. |
@jeffkl could you address some of the pending concerns in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several comments that haven't been addressed. Is this still viable to review or is it a work in progress?
@@ -206,7 +206,7 @@ public TokenSegment(string token, char delimiter, bool matchOnly, PatternTable t | |||
} | |||
if (StringComparer.Ordinal.Equals(_token, "tfm")) | |||
{ | |||
item.Properties.Add("tfm_raw", substring); | |||
item.Properties.Add("tfm_raw", substring.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -22,7 +21,7 @@ public class ManagedCodeConventions | |||
|
|||
private static readonly ContentPropertyDefinition AnyProperty = new ContentPropertyDefinition( | |||
PropertyNames.AnyValue, | |||
parser: (o, t) => o); // Identity parser, all strings are valid for any | |||
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any | |
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any <??> |
@@ -90,7 +88,7 @@ public ManagedCodeConventions(RuntimeGraph runtimeGraph) | |||
|
|||
props[PropertyNames.RuntimeIdentifier] = new ContentPropertyDefinition( | |||
PropertyNames.RuntimeIdentifier, | |||
parser: (o, t) => o, // Identity parser, all strings are valid runtime ids :) | |||
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids :) | |
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids |
[InlineData("ref/four/foo.dll", "lib/five/bar.dll", false, 4, 4)] | ||
public void Equals_ReturnsExpectedValue(string x, string y, bool expected, int? start = default, int? length = default) | ||
{ | ||
ReadOnlyMemoryCharComparerOrdinal.Instance.Equals(x.AsMemory(start ?? 0, length ?? x.Length), y.AsMemory(start ?? 0, length ?? y.Length)).Should().Be(expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I personally think multi-lines with variables is far easier to read and parse than a single line.
…adOnlyMemoryCharComparerOrdinal
99c6c8b
to
b383710
Compare
This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 7 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch. |
Bug
Fixes: NuGet/Home#12728
Also closes internal workitem.
Regression? Last working version:
Description
This uses
ReadOnlyMemory<char>.Slice()
instead ofstring.Substring()
to reduce allocations when pattern matching. This also needs to cache the matches with aDictionary<ReadOnlyMemory<char>, object>
inPatternTable
so I needed to re-use an ordinal comparer that I found in another repo. The rest of the changes are fallout from changing the pattern we're looking for from astring
toReadOnlyMemory<char>
. The reason I'm usingReadOnlyMemory<T>
instead ofReadOnlySpan<T>
is thatReadOnlySpan<T>
is areadonly ref struct
and can't be used as a type parameter.I had to change the public API of
NuGet.ContentModel.ContentPropertyDefinition
andNuGet.ContentModel.PatternTable.TryLookup
. I'm not sure how widely those are used by anyone so I marked the associated issue as a "breaking change" just to be safe.PR Checklist
PR has a meaningful title
PR has a linked issue.
Described changes
Tests
Documentation