-
Notifications
You must be signed in to change notification settings - Fork 5k
Arm64: Implement region write barriers #111636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/ |
@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64) |
I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great. I've been using your |
Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the |
@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler using BenchmarkDotNet.Attributes;
public class MyBench
{
object Dst1;
object Dst2;
object Dst3;
object Dst4;
static object Value = new();
[Benchmark]
public void WB_nonephemeral()
{
// Write non-ephemeral reference
Dst1 = Value;
Dst2 = Value;
Dst3 = Value;
Dst4 = Value;
}
[Benchmark]
public void WB_ephemeral()
{
// Write non-ephemeral reference
Dst1 = new object();
}
} |
I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the // Check whether the region we're storing into is gen 0 - nothing to do in this case
ldrb w12, [x12]
cbz w12, LOCAL_LABEL(Exit) (I guess I should've added an extra benchmark where object we're storing is gen2) PS: feel free to call the bot yourself if needed |
Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:
Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests. As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember |
As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards. |
running most of the tests as suggested, I don't see any differences. Everything seems within error margins:
|
@EgorBot -linux_ampere -linux_cobalt100 -windows_cobalt100 -profiler --envvars DOTNET_GCWriteBarrier:3 using BenchmarkDotNet.Attributes;
public class MyBench
{
object Dst1;
object Dst2;
object Dst3;
object Dst4;
static object Value = new();
static MyBench()
{
GC.Collect();
GC.Collect();
}
[Benchmark]
public void WB_nonephemeral()
{
// Write non-ephemeral reference
Dst1 = Value;
Dst2 = Value;
Dst3 = Value;
Dst4 = Value;
}
[Benchmark]
public void WB_ephemeral()
{
// Write non-ephemeral reference
Dst1 = new object();
}
} |
WriteBarrier 3 results are a little better than expected. With this we're using the old writebarrier, except it has one or two fewer checks to do. With this it's showing a gain. Oddly windows has 20% gain for nonephemeral! |
I noticed that the ShadowUpdate code is never called, as Removing the DOTNET_GCWriteBarrier=0
DOTNET_GCWriteBarrier=3
It has removed all the slowdown added by this PR, and given additional perf when writebarrier=3. Looking at Am64, when (Note I'll be away for 2 weeks, so will implement when I get back) |
I did some runs of Orchard CMS based on Egor's script, on Cobalt 100: HEAD: PR: So a couple of percent better overall with the PR. I tried with the GCShadow checks removed, but figures looks identical to the PR. |
GCShadow should be present in debug and checked builds of the runtime only. They should not be present in release builds of the runtime. I assume that all perf measurements are done on a release build. Is that correct? So it makes sense that removing GCShadow checks has no impact on the results. |
Yes, on a release build WRITE_BARRIER_CHECK shouldn't be defined. I'll double check to make sure I've been using release for the micro benchmarks. |
Orchard CMS results were using a Release build. So the figures above, with ~100 Requests/sec improvement are correct. However, my Ephemeral tests were using a Checked build. Here's using a Release build. These match better to the EgorBot results.
With DOTNET_GCWriteBarrier=3
|
Is there any additional testing anyone wanted? |
I'm back from vacation and have asked @a74nh to please edit the original description of this PR to include a summary of the perf results so we'll have an easier time to know the perf behavior (instead of having to read many comments on the PR). |
Test ResultsThis comment will be extended as I gather results. This contains more details for the perf results in the top message. I intend to keep this comment up to date with the latest results All run on an 8 core Cobalt 100, Ubuntu 24.04.2 Ephemeral test (dotnet/performance)
With DOTNET_GCWriteBarrier=3
GCPerfsimFlags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0 No environment variables set (bit region write barriers): DOTNET_GCWriteBarrier=2 (byte region write barriers): DOTNET_GCWriteBarrier=3 (server write barriers): DOTNET_gcServer=1 DOTNET_GCHeapCount=8: Flags: -tc 2 -tagb 200 -tlgb 8 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0 No environment variables set (bit region write barriers): DOTNET_GCWriteBarrier=2 (byte region write barriers): DOTNET_GCWriteBarrier=3 (server write barriers): DOTNET_gcServer=1 DOTNET_GCHeapCount=8: Orchard CMS benchmarkHEAD: PR: |
@a74nh and I have been looking at the profiles and we need to do a new run as the runs from before was doing mostly gen1 GCs and there were very few gen0 GCs which made the comparison not meaningful. we did notice some problem with the runs @a74nh did where the BGC pause times were much higher with the fix build which I was going to take a look at. |
The higher pause times were due to issues in the way the results were being gathered, which has now been fixed. New runs of the GCperfSim have been done with a meaningful number of GC collections. Full results here: #111636 (comment) The best result is -21.06% Gen0 pause time and -14.25% gen1 pause time. Meanwhile, GCWriteBarrier=3 is showing now change from head (as we wanted). A reduced version is in the top comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much, @a74nh, for your contribution and being patient with the perf data collection, discussion and meetings at odd hours :) this work is greatly appreciated!
Change-Id: Ia4f89dce9cb5aeedeeac16e54b7e35e9f255f68b
(@Maoni0 will merge this PR when all the data is collected)
Extend the Arm64 writebarrier function to support regions and use the WriteBarrierManager, similar to Amd64. This results in 10 different versions of the JIT_WriteBarrier, with the WriteBarrierManager deciding on which version to use.
Pseudo code for the writebarrier is included in GC-write-barriers.md
This is expected to make the writebarrier slower, but improve the performance of the GC. DOTNET_GCWriteBarrier=3 can be used give the same functionality as before this change.
The behavior of the writebarrier is:
Before the PR: check ephemeral bounds, update a byte in the card table, mark the card bundle
After the PR:
DOTNET_GCWriteBarrier=1 (default, bit region write barriers): check ephemeral bounds, check regions, update a bit in the card table, mark the card bundle
DOTNET_GCWriteBarrier=2 (byte region write barriers): check ephemeral bounds, check regions, update a byte in the card table, mark the card bundle
DOTNET_GCWriteBarrier=3 (server write barriers): check ephemeral bounds, update a byte in the card table, mark the card bundle. This is the same as before the PR.
DOTNET_gcServer=1: update a byte in the card table, mark the card bundle.
Test results on an 8 core Cobalt 100.
Ephemeral test (dotnet/performance)
WB_nonephemeral : -20%
WB_ephemeral: -16%
WKS GC is calculating the generation of regions in addition to comparing with g_ephemeral_low/high". So while it might set fewer cards, it is more expensive and it shows.
With DOTNET_GCWriteBarrier=3:
WB_nonephemeral : +15%
WB_ephemeral: +1%
SVR GC WB also became more expensive but it sets way fewer cards (for nonephemeral it should set almost no cards).
GCPerfsim
Flags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0
No environment variables set:
Gen0 pause: -21.06%. Gen1 pause -14.25%
DOTNET_GCWriteBarrier=2:
Gen0 pause: -6.7%. Gen1 pause -2.78%
DOTNET_GCWriteBarrier=3 :
Gen0 pause: -1.37%. Gen1 pause -1.26%
DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
Gen0 pause: -7.24%. Gen1 pause -3.49%
Above are linux numbers. On windows for no env var set we are seeing not as much but still quite noticeable pause improvements around 8% to 10% for this config of GCPerfSim.
Looking at the card marking speed it's clearly improved -
Orchard CMS benchmark
+~2% reqs/sec