Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations in TokenSegment.TryMatch #5676

Closed
wants to merge 6 commits into from

Conversation

jeffkl
Copy link
Contributor

@jeffkl jeffkl commented Mar 8, 2024

Bug

Fixes: NuGet/Home#12728

Also closes internal workitem.

Regression? Last working version:

Description

This uses ReadOnlyMemory<char>.Slice() instead of string.Substring() to reduce allocations when pattern matching. This also needs to cache the matches with a Dictionary<ReadOnlyMemory<char>, object> in PatternTable so I needed to re-use an ordinal comparer that I found in another repo. The rest of the changes are fallout from changing the pattern we're looking for from a string to ReadOnlyMemory<char>. The reason I'm using ReadOnlyMemory<T> instead of ReadOnlySpan<T> is that ReadOnlySpan<T> is a readonly ref struct and can't be used as a type parameter.

I had to change the public API of NuGet.ContentModel.ContentPropertyDefinition and NuGet.ContentModel.PatternTable.TryLookup. I'm not sure how widely those are used by anyone so I marked the associated issue as a "breaking change" just to be safe.

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception - Updated existing tests
    • OR
    • N/A
  • Documentation

    • Documentation PR or issue filled
    • OR
    • N/A

@jeffkl jeffkl requested review from davkean and Erarndt March 8, 2024 18:14
@jeffkl jeffkl self-assigned this Mar 8, 2024
@jeffkl jeffkl force-pushed the dev-jeffkl-tokensegment-trymatch branch from c114686 to 3555d15 Compare March 8, 2024 21:06
@jeffkl jeffkl marked this pull request as ready for review March 8, 2024 22:55
@jeffkl jeffkl requested a review from a team as a code owner March 8, 2024 22:55
Nigusu-Allehu
Nigusu-Allehu previously approved these changes Mar 8, 2024
@nkolev92
Copy link
Member

nkolev92 commented Mar 8, 2024

@jeffkl Do you have any numbers on this?

@@ -12,8 +12,8 @@ namespace NuGet.ContentModel
/// </summary>
public class PatternTable
{
private readonly Dictionary<string, Dictionary<string, object>> _table
= new Dictionary<string, Dictionary<string, object>>(StringComparer.Ordinal);
private readonly Dictionary<string, Dictionary<ReadOnlyMemory<char>, object>> _table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels risky to me, but I don't have a better suggestion.

If the ReadOnlyMemory<char> comes from a char[], then there's nothing in the C# language preventing code from modifying values in the char[], and therefore using the ReadOnlyMemory<char> as a dictionary key is going to be "corrupt", since the value and therefore hash code has now changed.

Even in .NET 8 and 9, I don't believe it's possible to have a Dictionary<string, TValue>, and then use a ReadOnlyMemory<char> to get values.

I was hoping that wring out some of these words would give me a better idea (kind of like rubber duck debugging), but unfortunately I haven't. I didn't know if using ReadOnlyMemory<char> as a dictionary key world work, as it needs to implement GetHashCode "correctly", but if it does, then it's safe enough, as long as the key used to add into the dictionary is never modified. It's generally nicer when we can get the compiler to check things for us, but I guess in this case we're going to have to manually read and reason about the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a debug-only assertion that the underlying type is a string via MemoryMarshal.TryGetString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the lookup should verify the backing memory is a string?

Yes. "Good" code is easy to read and verify it's doing the right thing. The .Add call passes in a type that I can't figure out by reading this file alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values of this dictionary are added in the constructor only here which the caller specifies strings and this class stores them as a ReadOnlyMemory<char> so that later on lookups can be ReadOnlyMemory<char>.

It sounds like instead there should be a Debug.Assert() in the lookup code here to verify that the caller specified a ReadOnlyMemory<char> that is backed by a string?

Comment on lines 105 to 107
if (charactersRemaining > 0)
{
num1 = ((num1 << 5) + num1 + (num1 >> 27)) ^ pSpan0[obj.Length - 1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works correctly when charactersRemaining is 1, 2, or 3, it's unintuitive to me how. It looks to me like it only considers the last character, which means when there's 2 or 3 remaining, this method will calculate the "wrong" hash (or it's easy to have a hash collection when only pSpan0[obj.Length - 2] is the only different character).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -206,7 +206,7 @@ public TokenSegment(string token, char delimiter, bool matchOnly, PatternTable t
}
if (StringComparer.Ordinal.Equals(_token, "tfm"))
{
item.Properties.Add("tfm_raw", substring);
item.Properties.Add("tfm_raw", substring.ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of this PR, but probably perfect case to have this pulled from a name table so we only ever create one of these strings per "raw TFM".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -22,7 +21,7 @@ public class ManagedCodeConventions

private static readonly ContentPropertyDefinition AnyProperty = new ContentPropertyDefinition(
PropertyNames.AnyValue,
parser: (o, t) => o); // Identity parser, all strings are valid for any
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this run? I would be worried that this defeats the purpose of the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this code path, but my understanding is that any pattern that has {any} or
{any?} path segment will execute this code path. Looking at the code https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L421-L573, almost all patterns except MSBuild related files have the said any path segment in them. If all the parsers have .ToString() in them, I am also worried that it might defeat the purpose of the change. However, a PerfView trace before and after the fix would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is for path segments that match on everything for example lib/{tfm}/{any}. That case, the o.ToString() makes a string out of something like PackageA.dll which gets added to the properties in TokenSegment.TryMatch(). This means the substring was used so its okay to realize it to a string? Since the point was to stop allocating a substring if its not used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the point was to stop allocating a substring if its not used.

I agree with this goal. Is there any test that checks for the scenario where substrings are not used? The reason is that the pattern matching algorithm is executed during the pack and restore operation. Hence, I am curious to understand the scenario where substrings are not used. For example, I debugged first 3-unit tests in https://github.com/NuGet/NuGet.Client/blob/dev/test/NuGet.Core.Tests/NuGet.Packaging.Test/ContentModelTests/ContentModelLibTests.cs and noticed that substrings are invoked every time. Maybe I chose wrong tests to debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kartheekp-ms Are you saying are created for every segment in those strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substrings are created if the segment satisfies a particular parser logic. For example, if it is a {any} segment and path token has xyz.dll, the code creates a substring. Likewise looking at all the parsers logic, there is .ToString() invocation which is invoked if the token satisfies a path segment.

@@ -161,18 +160,18 @@ private static object Locale_Parser(string name, PatternTable table)

if (name.Length == 2)
{
return name;
return name.ToString();
Copy link
Contributor

@davkean davkean Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above, when does this run? If it runs always, then this will defeat the purpose of the original change on all strings of two chars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code path will be executed only when parsing resource assemblies because they have {locale} path segment in them. https://github.com/NuGet/NuGet.Client/blob/dev/src/NuGet.Core/NuGet.Packaging/ContentModel/ManagedCodeConventions.cs#L486-L497

@@ -12,8 +12,8 @@ namespace NuGet.ContentModel
/// </summary>
public class PatternTable
{
private readonly Dictionary<string, Dictionary<string, object>> _table
= new Dictionary<string, Dictionary<string, object>>(StringComparer.Ordinal);
private readonly Dictionary<string, Dictionary<ReadOnlyMemory<char>, object>> _table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the lookup should verify the backing memory is a string?

Yes. "Good" code is easy to read and verify it's doing the right thing. The .Add call passes in a type that I can't figure out by reading this file alone.

{
[Theory]
[InlineData("AbCDeFg", "AbCDeFg", true)]
[InlineData("AbCDeFG", "abcdefg", false)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand how the code works, or why this test passes. Can anybody explain it to me?

The for loop processes multiples of 4 characters. Then there's an if (charactersRemaining > 0) which runs num1 = whatever(num1) ^ pSpan0[obj.Length - 1];

In the case of 7 character test, the for loop processes array indexes 0, 1, 2, and 3. charactersRemaining should be 3, so, the if statement processes pSpan0[7 - 1] or pSpan0[6].

How are characters 4 and 5 contributing to the hash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I tried debugging it but ran out of time. I did add a unit test to verify that the hash of ABCDEFGH and ABCDefgH are not the same which I think shows that its working as expected?

@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Mar 22, 2024
@davkean
Copy link
Contributor

davkean commented Mar 26, 2024

@jeffkl What's blocking this from making progress?

@dotnet-policy-service dotnet-policy-service bot removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Mar 26, 2024
@jeffkl jeffkl force-pushed the dev-jeffkl-tokensegment-trymatch branch 2 times, most recently from 7590e49 to f1a0893 Compare March 26, 2024 12:51
@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 26, 2024

@jeffkl What's blocking this from making progress?

I was the NuGet on-call engineer last week. Hoping to get back to this today or tomorrow, I'm also helping with the 1ES PT effort.

@jeffkl jeffkl force-pushed the dev-jeffkl-tokensegment-trymatch branch from f1a0893 to 99c6c8b Compare March 27, 2024 18:49
@kartheekp-ms
Copy link
Contributor

@jeffkl - I have a slightly different understanding of why this code path is causing too many allocations. My understanding is that the current pattern-matching algorithm is suboptimal. For every asset in the package, we check it against all the patterns to find a match. Instead, we could have some kind of state machine that we traverse with each asset exactly once to determine the asset type. For example, in the image below, I have added numbers (1 to 10) to denote roots. Each asset will navigate through only one root, and if it reaches the terminal state on that path, then there is a match, meaning it could be a runtime assembly, a native assembly, or something else.

IMG_5536

@kartheekp-ms
Copy link
Contributor

@jeffkl - I have a slightly different understanding of why this code path is causing too many allocations. My understanding is that the current pattern-matching algorithm is suboptimal. For every asset in the package, we check it against all the patterns to find a match. Instead, we could have some kind of state machine that we traverse with each asset exactly once to determine the asset type. For example, in the image below, I have added numbers (1 to 10) to denote roots. Each asset will navigate through only one root, and if it reaches the terminal state on that path, then there is a match, meaning it could be a runtime assembly, a native assembly, or something else.

IMG_5536

Team Triage: Filed a tech debt issue https://github.com/NuGet/Client.Engineering/issues/2780.

@jeffkl
Copy link
Contributor Author

jeffkl commented Apr 4, 2024

@jeffkl Do you have any numbers on this?

Before:
image

After:
image

@davkean
Copy link
Contributor

davkean commented Apr 4, 2024 via email

@jeffkl
Copy link
Contributor Author

jeffkl commented Apr 4, 2024

Can you share out a trace?

________________________________ From: Jeff Kluge @.> Sent: Friday, April 5, 2024 6:55 AM To: NuGet/NuGet.Client @.> Cc: David Kean @.>; Review requested @.> Subject: Re: [NuGet/NuGet.Client] Reduce allocations in TokenSegment.TryMatch (PR #5676) @jeffklhttps://github.com/jeffkl Do you have any numbers on this? Before: image.png (view on web)https://github.com/NuGet/NuGet.Client/assets/17556515/e3383573-3d28-40b8-8e27-ca1307d2c65a After: image.png (view on web)https://github.com/NuGet/NuGet.Client/assets/17556515/1a869020-04fb-4ac0-875d-877487766b75 — Reply to this email directly, view it on GitHub<#5676 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAINQIQFJT2BMFOPAGSYEZDY3WV4LAVCNFSM6AAAAABENHJ4UWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZYGA4TKMZZGE. You are receiving this because your review was requested.Message ID: @.***>

Before / After

@davkean
Copy link
Contributor

davkean commented Apr 4, 2024

I compared the traces. The highlighted areas where I believe the above change affects. We moved ~45 MB -> 22 MB, but I want to see this is repeatible (I see differences where I don't expect), so can you run the after a few times and make sure that the TryMatch numbers match below within a MB or so. If not, then we don't have a deterministic repro, and we should pick something larger.

image

@jeffkl
Copy link
Contributor Author

jeffkl commented Apr 4, 2024

Here are 4 runs of before:
image

And after:
image

@Erarndt
Copy link
Contributor

Erarndt commented Apr 4, 2024

Here are 4 runs of before: image

And after: image

Hmm... Something seems off. The metrics in the "after" look worse. Max peak MB from ~1,200 MB => ~1,700 MB, Total Pause from ~252 => ~460.

@zivkan
Copy link
Member

zivkan commented Apr 5, 2024

In the before GC stats, only 1 of the 4 had a gen 2 GC, which only paused for 14ms. All 4 of the after GC stats show a gen 2 GC, and takes around 150+ ms.

@davkean
Copy link
Contributor

davkean commented Apr 5, 2024

I'm not super convinced its testing the exact same thing; I see allocation increases/decreases in the places I expect, but increases in MSBuild code:

image

Why would MSBuild be doing more work?

@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Apr 12, 2024
@jeffkl jeffkl removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Apr 26, 2024
@jeffkl jeffkl reopened this Apr 26, 2024
@donnie-msft
Copy link
Contributor

Why would MSBuild be doing more work?

@jeffkl could you address some of the pending concerns in this PR?

Copy link
Contributor

@donnie-msft donnie-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several comments that haven't been addressed. Is this still viable to review or is it a work in progress?

@@ -206,7 +206,7 @@ public TokenSegment(string token, char delimiter, bool matchOnly, PatternTable t
}
if (StringComparer.Ordinal.Equals(_token, "tfm"))
{
item.Properties.Add("tfm_raw", substring);
item.Properties.Add("tfm_raw", substring.ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -22,7 +21,7 @@ public class ManagedCodeConventions

private static readonly ContentPropertyDefinition AnyProperty = new ContentPropertyDefinition(
PropertyNames.AnyValue,
parser: (o, t) => o); // Identity parser, all strings are valid for any
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Suggested change
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any
parser: (o, t) => o.ToString()); // Identity parser, all strings are valid for any <??>

@@ -90,7 +88,7 @@ public ManagedCodeConventions(RuntimeGraph runtimeGraph)

props[PropertyNames.RuntimeIdentifier] = new ContentPropertyDefinition(
PropertyNames.RuntimeIdentifier,
parser: (o, t) => o, // Identity parser, all strings are valid runtime ids :)
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids :)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids :)
parser: (o, t) => o.ToString(), // Identity parser, all strings are valid runtime ids

[InlineData("ref/four/foo.dll", "lib/five/bar.dll", false, 4, 4)]
public void Equals_ReturnsExpectedValue(string x, string y, bool expected, int? start = default, int? length = default)
{
ReadOnlyMemoryCharComparerOrdinal.Instance.Equals(x.AsMemory(start ?? 0, length ?? x.Length), y.AsMemory(start ?? 0, length ?? y.Length)).Should().Be(expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I personally think multi-lines with variables is far easier to read and parse than a single line.

@jeffkl jeffkl marked this pull request as draft May 7, 2024 17:08
@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label May 14, 2024
@jeffkl jeffkl removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label May 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label May 23, 2024
@jeffkl jeffkl removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label May 24, 2024
@jeffkl jeffkl marked this pull request as ready for review May 24, 2024 16:59
@jeffkl jeffkl force-pushed the dev-jeffkl-tokensegment-trymatch branch from 99c6c8b to b383710 Compare May 24, 2024 17:01
@jeffkl jeffkl marked this pull request as draft May 24, 2024 17:03
@dotnet-policy-service dotnet-policy-service bot added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label May 31, 2024
Copy link
Contributor

This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 7 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed
Projects
None yet
8 participants