Skip to content

Arm64: Implement region write barriers #111636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
May 17, 2025

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jan 20, 2025

(@Maoni0 will merge this PR when all the data is collected)

Extend the Arm64 writebarrier function to support regions and use the WriteBarrierManager, similar to Amd64. This results in 10 different versions of the JIT_WriteBarrier, with the WriteBarrierManager deciding on which version to use.

Pseudo code for the writebarrier is included in GC-write-barriers.md

This is expected to make the writebarrier slower, but improve the performance of the GC. DOTNET_GCWriteBarrier=3 can be used give the same functionality as before this change.

The behavior of the writebarrier is:
Before the PR: check ephemeral bounds, update a byte in the card table, mark the card bundle
After the PR:
DOTNET_GCWriteBarrier=1 (default, bit region write barriers): check ephemeral bounds, check regions, update a bit in the card table, mark the card bundle
DOTNET_GCWriteBarrier=2 (byte region write barriers): check ephemeral bounds, check regions, update a byte in the card table, mark the card bundle
DOTNET_GCWriteBarrier=3 (server write barriers): check ephemeral bounds, update a byte in the card table, mark the card bundle. This is the same as before the PR.
DOTNET_gcServer=1: update a byte in the card table, mark the card bundle.

Test results on an 8 core Cobalt 100.

Ephemeral test (dotnet/performance)

WB_nonephemeral : -20%
WB_ephemeral: -16%

WKS GC is calculating the generation of regions in addition to comparing with g_ephemeral_low/high". So while it might set fewer cards, it is more expensive and it shows.

With DOTNET_GCWriteBarrier=3:
WB_nonephemeral : +15%
WB_ephemeral: +1%

SVR GC WB also became more expensive but it sets way fewer cards (for nonephemeral it should set almost no cards).

GCPerfsim

Flags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set:
Gen0 pause: -21.06%. Gen1 pause -14.25%

DOTNET_GCWriteBarrier=2:
Gen0 pause: -6.7%. Gen1 pause -2.78%

DOTNET_GCWriteBarrier=3 :
Gen0 pause: -1.37%. Gen1 pause -1.26%

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
Gen0 pause: -7.24%. Gen1 pause -3.49%

Above are linux numbers. On windows for no env var set we are seeing not as much but still quite noticeable pause improvements around 8% to 10% for this config of GCPerfSim.

  Baseline 13608 Diff: 13608 Diff %: 13608
Process ID 19732 13608    
Process Name corerun corerun    
Commandline corerun.exe GCPerfSim.dll -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0 corerun.exe GCPerfSim.dll -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0    
Process Duration (Sec) 35.945 32.834 -3.111 -8.655
Total Allocated MB 215,230.37 215,263.87 33.505 0.016
Max Size Peak MB 4,444.05 4,505.40 61.357 1.381
GC Count 38,865.00 38,728.00 -137 -0.353
Heap Count 1 1 0 0
Gen0 Count 3,076.00 3,646.00 570 18.531
Gen1 Count 35,774.00 35,067.00 -707 -1.976
Ephemeral Count 38,850.00 38,713.00 -137 -0.353
Gen2 Blocking Count 1 1 0 0
BGC Count 14 14 0 0
Gen0 Total Pause Time MSec 1,302.02 1,386.41 84.388 6.481
Gen1 Total Pause Time MSec 16,992.42 14,964.89 -2,027.52 -11.932
Ephemeral Total Pause Time MSec 18,294.43 16,351.30 -1,943.14 -10.621
Blocking Gen2 Total Pause Time MSec 2.319 2.271 -0.048 -2.07
BGC Total Pause Time MSec 4.225 4.44 0.215 5.081
GC Pause Time % 50.914 49.82 -1.093 -2.148
Avg. Gen0 Pause Time (ms) 0.423 0.38 -0.043 -10.165
Avg. Gen1 Pause Time (ms) 0.475 0.427 -0.048 -10.156
Avg. Gen0 Promoted (mb) 0.862 0.8 -0.061 -7.119
Avg. Gen1 Promoted (mb) 0.783 0.787 0.004 0.573
Avg. Gen0 Speed (mb/ms) 2.036 2.105 0.069 3.391
Avg. Gen1 Speed (mb/ms) 1.648 1.845 0.197 11.943

Looking at the card marking speed it's clearly improved -

image

Orchard CMS benchmark

+~2% reqs/sec

@kunalspathak
Copy link
Member

kunalspathak commented Jan 21, 2025

FYI - @Maoni0
@mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

@a74nh
Copy link
Contributor Author

a74nh commented Jan 21, 2025

I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

@a74nh
Copy link
Contributor Author

a74nh commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great.

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the WB_nonephemeral perf is mostly here: https://gist.github.com/EgorBot/a6db6579aba05de6a25f111513cb54b2#file-diff_asm_bcd38073-asm-L30 which is, I guess,

    // Check whether the region we're storing into is gen 0 - nothing to do in this case
    ldrb w12, [x12]
    cbz  w12, LOCAL_LABEL(Exit)

(I guess I should've added an extra benchmark where object we're storing is gen2)

PS: feel free to call the bot yourself if needed

@mrsharm
Copy link
Member

mrsharm commented Jan 24, 2025

FYI - @Maoni0 @mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

  1. Not removing the outliers: --outliers DontRemove.
  2. Setting a fixed number of invocations that'll be high enough to reduce the standard error: --invocationCount {InvocationCount}
  3. Setting a fixed number of iterations: --iterationCount 20.
- System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536*)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 10000)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536*)
- System.Collections.CtorGivenSize<String>.Array(size: 512)
- ByteMark.BenchBitOps
- System.IO.Tests.Perf_File.ReadAllBytes(size: 104857600)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Linq.Tests.Perf_Enumerable.ToArray*
- System.Collections.Tests.Perf_BitArray.BitArrayByteArrayCtor(size: 512)

Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests.

As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember StackWalk being extremely volatile but still worth trying out with.

@cshung
Copy link
Contributor

cshung commented Jan 24, 2025

As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards.

@a74nh
Copy link
Contributor Author

a74nh commented Jan 27, 2025

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

running most of the tests as suggested, I don't see any differences. Everything seems within error margins:



| Method                     | Job        | Toolchain                                                                          | length | pinned | Mean        | Error     | StdDev    | Median      | Min         | Max        | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Gen1   | Gen2   | Allocated | Alloc Ratio |
|--------------------------- |----------- |----------------------------------------------------------------------------------- |------- |------- |------------:|----------:|----------:|------------:|------------:|-----------:|------:|---------------- |--------:|-------:|-------:|-------:|----------:|------------:|
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   129.78 ns | 53.253 ns | 61.326 ns |   118.07 ns |   108.50 ns |   388.8 ns |  1.08 | Baseline        |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   137.49 ns | 53.415 ns | 61.512 ns |   125.97 ns |   116.80 ns |   396.9 ns |  1.15 | Same            |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   103.60 ns | 51.462 ns | 59.263 ns |    89.10 ns |    88.63 ns |   354.8 ns |  1.11 | Baseline        |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   103.35 ns | 51.294 ns | 59.070 ns |    88.76 ns |    88.21 ns |   353.4 ns |  1.10 | Same            |    0.65 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   744.34 ns |  7.498 ns |  8.634 ns |   741.62 ns |   735.19 ns |   764.7 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   743.07 ns |  9.170 ns | 10.561 ns |   740.52 ns |   732.56 ns |   763.7 ns |  1.00 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   735.06 ns | 10.791 ns | 12.426 ns |   728.98 ns |   720.78 ns |   757.2 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   748.82 ns |  8.844 ns | 10.185 ns |   743.99 ns |   736.23 ns |   767.8 ns |  1.02 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   626.94 ns | 39.042 ns | 44.961 ns |   618.03 ns |   588.73 ns |   805.0 ns |  1.00 | Baseline        |    0.09 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   623.92 ns | 74.318 ns | 85.585 ns |   601.31 ns |   589.99 ns |   983.1 ns |  1.00 | Same            |    0.15 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   142.84 ns | 17.866 ns | 20.575 ns |   138.18 ns |   134.39 ns |   228.9 ns |  1.01 | Baseline        |    0.17 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   149.25 ns | 16.513 ns | 19.016 ns |   146.35 ns |   137.79 ns |   227.3 ns |  1.06 | Same            |    0.16 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,592.21 ns | 32.371 ns | 37.278 ns | 2,585.44 ns | 2,550.16 ns | 2,707.3 ns |  1.00 | Baseline        |    0.02 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,475.21 ns | 76.425 ns | 88.011 ns | 2,436.47 ns | 2,379.59 ns | 2,637.6 ns |  0.96 | Same            |    0.04 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,438.40 ns | 43.482 ns | 50.074 ns | 2,444.35 ns | 2,330.27 ns | 2,527.3 ns |  1.00 | Baseline        |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,449.01 ns | 35.429 ns | 40.800 ns | 2,448.20 ns | 2,338.34 ns | 2,520.9 ns |  1.00 | Same            |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | ?      |    98.53 ns | 49.747 ns | 57.289 ns |    86.26 ns |    74.80 ns |   340.4 ns |  1.11 | Baseline        |    0.67 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | ?      |    95.01 ns | 48.560 ns | 55.922 ns |    80.60 ns |    79.98 ns |   331.4 ns |  1.07 | Same            |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | ?      |   546.14 ns | 49.634 ns | 57.159 ns |   533.12 ns |   520.12 ns |   784.7 ns |  1.01 | Baseline        |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | ?      |   551.71 ns | 52.751 ns | 60.748 ns |   537.58 ns |   528.97 ns |   807.3 ns |  1.02 | Same            |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |


| Method | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   205.72 ns | 129.897 ns | 149.589 ns |    84.26 ns |    71.82 ns |   404.32 ns |  1.78 | Baseline        |    1.85 |      - |     160 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   203.72 ns | 129.080 ns | 148.649 ns |    83.54 ns |    72.15 ns |   400.73 ns |  1.76 | Same            |    1.84 |      - |     160 B |        1.00 |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    25.58 ns |   0.439 ns |   0.505 ns |    25.63 ns |    23.68 ns |    26.00 ns |  1.00 | Baseline        |    0.03 |      - |         - |          NA |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    24.67 ns |   1.307 ns |   1.506 ns |    24.99 ns |    21.90 ns |    26.31 ns |  0.97 | Same            |    0.06 |      - |         - |          NA |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,591.60 ns |  74.221 ns |  85.473 ns | 3,559.69 ns | 3,555.19 ns | 3,919.99 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8224 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,571.79 ns |  69.881 ns |  80.475 ns | 3,551.91 ns | 3,546.31 ns | 3,911.55 ns |  0.99 | Same            |    0.03 | 0.1212 |    8224 B |        1.00 |


| Method   | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|--------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   145.80 ns | 116.856 ns | 134.571 ns |    72.70 ns |    72.08 ns |   426.39 ns |  1.59 | Baseline        |    1.70 |      - |     152 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   143.24 ns | 118.524 ns | 136.493 ns |    72.22 ns |    71.90 ns |   431.54 ns |  1.57 | Same            |    1.72 |      - |     152 B |        1.00 |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    26.41 ns |   0.836 ns |   0.963 ns |    26.88 ns |    24.34 ns |    27.34 ns |  1.00 | Baseline        |    0.05 |      - |         - |          NA |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    26.22 ns |   0.666 ns |   0.767 ns |    26.29 ns |    24.35 ns |    27.18 ns |  0.99 | Same            |    0.05 |      - |         - |          NA |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,483.97 ns |  61.051 ns |  70.306 ns | 3,466.17 ns | 3,458.38 ns | 3,780.31 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8216 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,526.84 ns |  71.010 ns |  81.775 ns | 3,504.11 ns | 3,480.61 ns | 3,840.66 ns |  1.01 | Same            |    0.03 | 0.1212 |    8216 B |        1.00 |


| Method | Job        | Toolchain                                                                          | Size | Mean     | Error   | StdDev  | Median   | Min      | Max      | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----- |---------:|--------:|--------:|---------:|---------:|---------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Array  | Job-CZKOLC | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 152.8 ns | 7.44 ns | 8.56 ns | 149.6 ns | 147.4 ns | 186.8 ns |  1.00 | Baseline        |    0.07 | 0.0606 |   4.02 KB |        1.00 |
| Array  | Job-FQHBTF | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 155.3 ns | 4.66 ns | 5.36 ns | 154.5 ns | 151.6 ns | 177.2 ns |  1.02 | Same            |    0.06 | 0.0606 |   4.02 KB |        1.00 |


| Method  | Job        | Toolchain                                                                          | input       | Mean      | Error    | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|-------- |----------- |----------------------------------------------------------------------------------- |------------ |----------:|---------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | ICollection |  41.88 ns | 9.097 ns | 10.476 ns |  37.78 ns |  36.30 ns |  80.16 ns |  1.04 | Baseline        |    0.30 | 0.0061 |     424 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | ICollection |  43.15 ns | 9.478 ns | 10.915 ns |  36.91 ns |  36.21 ns |  79.58 ns |  1.07 | Same            |    0.31 | 0.0061 |     424 B |        1.00 |
|         |            |                                                                                    |             |           |          |           |           |           |           |       |                 |         |        |           |             |
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | IEnumerable | 287.98 ns | 5.110 ns |  5.885 ns | 286.38 ns | 285.61 ns | 312.59 ns |  1.00 | Baseline        |    0.03 | 0.0061 |     456 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | IEnumerable | 289.73 ns | 4.845 ns |  5.580 ns | 287.99 ns | 287.74 ns | 313.07 ns |  1.01 | Same            |    0.03 | 0.0061 |     456 B |        1.00 |


| Method                | Job        | Toolchain                                                                          | Size | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|---------------------- |----------- |----------------------------------------------------------------------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 4    |  21.91 ns | 11.631 ns | 13.395 ns |  14.75 ns |  14.65 ns |  57.33 ns |  1.24 | Baseline        |    0.88 |      - |      64 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 4    |  22.39 ns | 11.757 ns | 13.540 ns |  15.89 ns |  15.70 ns |  60.19 ns |  1.27 | Same            |    0.90 |      - |      64 B |        1.00 |
|                       |            |                                                                                    |      |           |           |           |           |           |           |       |                 |         |        |           |             |
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 142.08 ns |  5.946 ns |  6.848 ns | 140.73 ns | 138.30 ns | 170.18 ns |  1.00 | Baseline        |    0.06 | 0.0076 |     568 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 139.35 ns |  5.774 ns |  6.650 ns | 137.68 ns | 136.98 ns | 167.37 ns |  0.98 | Same            |    0.06 | 0.0076 |     568 B |        1.00 |

@a74nh
Copy link
Contributor Author

a74nh commented Mar 20, 2025

@EgorBot -linux_ampere -linux_cobalt100 -windows_cobalt100 -profiler --envvars DOTNET_GCWriteBarrier:3

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    static MyBench()
    {
        GC.Collect();
        GC.Collect();
    }

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

@a74nh
Copy link
Contributor Author

a74nh commented Mar 20, 2025

WriteBarrier 3 results are a little better than expected. With this we're using the old writebarrier, except it has one or two fewer checks to do. With this it's showing a gain. Oddly windows has 20% gain for nonephemeral!

@a74nh
Copy link
Contributor Author

a74nh commented Mar 21, 2025

I noticed that the ShadowUpdate code is never called, as g_GCShadow is always 0. It is only ever set if DOTNET_HeapVerify is set.

Removing the g_GCShadow checks from the writebarrier gives:

DOTNET_GCWriteBarrier=0

Method Job Toolchain Mean Error Ratio Gen0
WB_nonephemeral Job-HYVKPP HEAD 4.407 ns 0.0572 ns 1.00 -
WB_nonephemeral Job-HJOGLM PR 4.511 ns 0.0579 ns 1.02 -
WB_ephemeral Job-HYVKPP HEAD 12.036 ns 0.2587 ns 1.00 0.0003
WB_ephemeral Job-HJOGLM PR 12.548 ns 0.2553 ns 1.04 0.0003

DOTNET_GCWriteBarrier=3

Method Job Toolchain Mean Error Ratio Gen0
WB_nonephemeral Job-BEPCOQ HEAD 4.421 ns 0.1165 ns 1.00 -
WB_nonephemeral Job-VWEJCQ PR 3.195 ns 0.0033 ns 0.72 -
WB_ephemeral Job-BEPCOQ HEAD 11.898 ns 0.0826 ns 1.00 0.0003
WB_ephemeral Job-VWEJCQ PR 11.674 ns 0.1842 ns 0.98 0.0003

It has removed all the slowdown added by this PR, and given additional perf when writebarrier=3.

Looking at Am64, when g_GCShadow is set, it uses JIT_WriteBarrier_Debug in jithelpers_slow.S. Annoyingly it's another complete copy of the writebarrier function. I'll look at doing something similar for Arm64 - either by doing it the same way or extending writebarriermanager to switch on shadow too, giving us 16 functions. Either way I want to write the assembly using the macros to avoid copy/paste errors.

(Note I'll be away for 2 weeks, so will implement when I get back)

@a74nh
Copy link
Contributor Author

a74nh commented Mar 21, 2025

I did some runs of Orchard CMS based on Egor's script, on Cobalt 100:

HEAD:
Requests/sec: 5171.91
Requests/sec: 5201.02
Requests/sec: 5235.64

PR:
Requests/sec: 5326.45
Requests/sec: 5309.99
Requests/sec: 5298.49

So a couple of percent better overall with the PR.

I tried with the GCShadow checks removed, but figures looks identical to the PR.

@jkotas
Copy link
Member

jkotas commented Mar 23, 2025

I tried with the GCShadow checks removed, but figures looks identical to the PR.

GCShadow should be present in debug and checked builds of the runtime only. They should not be present in release builds of the runtime.

I assume that all perf measurements are done on a release build. Is that correct? So it makes sense that removing GCShadow checks has no impact on the results.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 26, 2025

I tried with the GCShadow checks removed, but figures looks identical to the PR.

GCShadow should be present in debug and checked builds of the runtime only. They should not be present in release builds of the runtime.

I assume that all perf measurements are done on a release build. Is that correct? So it makes sense that removing GCShadow checks has no impact on the results.

Yes, on a release build WRITE_BARRIER_CHECK shouldn't be defined. I'll double check to make sure I've been using release for the micro benchmarks.

@a74nh
Copy link
Contributor Author

a74nh commented Apr 7, 2025

Orchard CMS results were using a Release build. So the figures above, with ~100 Requests/sec improvement are correct.

However, my Ephemeral tests were using a Checked build. Here's using a Release build. These match better to the EgorBot results.

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 3.742 ns 0.0046 ns 1.00 -
WB_nonephemeral PR 4.489 ns 0.0077 ns 1.20 -
WB_ephemeral HEAD 5.314 ns 0.1318 ns 1.00 0.0004
WB_ephemeral PR 6.176 ns 0.0531 ns 1.16 0.0003

With DOTNET_GCWriteBarrier=3

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 3.742 ns 0.0077 ns 1.00 -
WB_nonephemeral PR 3.168 ns 0.0063 ns 0.85 -
WB_ephemeral HEAD 5.467 ns 0.0790 ns 1.00 0.0004
WB_ephemeral PR 5.409 ns 0.0538 ns 0.99 0.0003

@a74nh
Copy link
Contributor Author

a74nh commented Apr 7, 2025

Is there any additional testing anyone wanted?

@Maoni0
Copy link
Member

Maoni0 commented Apr 14, 2025

I'm back from vacation and have asked @a74nh to please edit the original description of this PR to include a summary of the perf results so we'll have an easier time to know the perf behavior (instead of having to read many comments on the PR).

@a74nh
Copy link
Contributor Author

a74nh commented Apr 17, 2025

Test Results

This comment will be extended as I gather results. This contains more details for the perf results in the top message. I intend to keep this comment up to date with the latest results

All run on an 8 core Cobalt 100, Ubuntu 24.04.2

Ephemeral test (dotnet/performance)

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 3.742 ns 0.0046 ns 1.00 -
WB_nonephemeral PR 4.489 ns 0.0077 ns 1.20 -
WB_ephemeral HEAD 5.314 ns 0.1318 ns 1.00 0.0004
WB_ephemeral PR 6.176 ns 0.0531 ns 1.16 0.0003

With DOTNET_GCWriteBarrier=3

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 3.742 ns 0.0077 ns 1.00 -
WB_nonephemeral PR 3.168 ns 0.0063 ns 0.85 -
WB_ephemeral HEAD 5.467 ns 0.0790 ns 1.00 0.0004
WB_ephemeral PR 5.409 ns 0.0538 ns 0.99 0.0003

GCPerfsim

Flags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set (bit region write barriers):
AverageGen0PauseTimeDiffPercentage -21.06%
AverageGen1PauseTimeDiffPercentage -14.25%
AverageGen0Count: 2624 -> 2744
AverageGen1Count: 680 -> 673

DOTNET_GCWriteBarrier=2 (byte region write barriers):
AverageGen0PauseTimeDiffPercentage -6.7%
AverageGen1PauseTimeDiffPercentage -2.78%
AverageGen0Count: 3048 -> 3044
AverageGen1Count: 659 -> 659

DOTNET_GCWriteBarrier=3 (server write barriers):
AverageGen0PauseTimeDiffPercentage -1.37%
AverageGen1PauseTimeDiffPercentage -1.26%
AverageGen0Count: 3047 -> 3048
AverageGen1Count: 660 -> 658

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
AverageGen0PauseTimeDiffPercentage -7.24%
AverageGen1PauseTimeDiffPercentage -3.49%
AverageGen0Count: 239 -> 239
AverageGen1Count: 81 -> 81

Flags: -tc 2 -tagb 200 -tlgb 8 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set (bit region write barriers):
AverageGen0PauseTimeDiffPercentage -13.69%
AverageGen1PauseTimeDiffPercentage -5.7%
AverageGen0Count: 2957 -> 2957
AverageGen1Count: 750 -> 749

DOTNET_GCWriteBarrier=2 (byte region write barriers):
AverageGen0PauseTimeDiffPercentage -5.94%
AverageGen1PauseTimeDiffPercentage -1.19%
AverageGen0Count: 2958 -> 2959
AverageGen1Count: 749 -> 749

DOTNET_GCWriteBarrier=3 (server write barriers):
AverageGen0PauseTimeDiffPercentage +0.07%
AverageGen1PauseTimeDiffPercentage 0.00%
AverageGen0Count: 2960 -> 2957
AverageGen1Count: 748 -> 750

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
AverageGen0PauseTimeDiffPercentage -7.4%
AverageGen1PauseTimeDiffPercentage -3.04%
AverageGen0Count: 233 -> 233
AverageGen1Count: 81 -> 81

Orchard CMS benchmark

HEAD:
Requests/sec: 5171.91
Requests/sec: 5201.02
Requests/sec: 5235.64

PR:
Requests/sec: 5326.45
Requests/sec: 5309.99
Requests/sec: 5298.49

@Maoni0
Copy link
Member

Maoni0 commented Apr 23, 2025

@a74nh and I have been looking at the profiles and we need to do a new run as the runs from before was doing mostly gen1 GCs and there were very few gen0 GCs which made the comparison not meaningful. we did notice some problem with the runs @a74nh did where the BGC pause times were much higher with the fix build which I was going to take a look at.

@a74nh
Copy link
Contributor Author

a74nh commented Apr 25, 2025

@a74nh and I have been looking at the profiles and we need to do a new run as the runs from before was doing mostly gen1 GCs and there were very few gen0 GCs which made the comparison not meaningful. we did notice some problem with the runs @a74nh did where the BGC pause times were much higher with the fix build which I was going to take a look at.

The higher pause times were due to issues in the way the results were being gathered, which has now been fixed.

New runs of the GCperfSim have been done with a meaningful number of GC collections.

Full results here: #111636 (comment)

The best result is -21.06% Gen0 pause time and -14.25% gen1 pause time.

Meanwhile, GCWriteBarrier=3 is showing now change from head (as we wanted).

A reduced version is in the top comment.

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much, @a74nh, for your contribution and being patient with the perf data collection, discussion and meetings at odd hours :) this work is greatly appreciated!

@Maoni0 Maoni0 enabled auto-merge (squash) May 16, 2025 23:20
Change-Id: Ia4f89dce9cb5aeedeeac16e54b7e35e9f255f68b
@Maoni0 Maoni0 merged commit e2ad5fc into dotnet:main May 17, 2025
96 checks passed
@a74nh a74nh deleted the precisewritebarriers_github branch May 17, 2025 08:46
@github-actions github-actions bot locked and limited conversation to collaborators Jun 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants