Skip to content

[Perf] Linux/x64: 3 Regressions on 11/2/2023 12:57:39 AM #94475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
performanceautofiler bot opened this issue Nov 7, 2023 · 11 comments
Closed

[Perf] Linux/x64: 3 Regressions on 11/2/2023 12:57:39 AM #94475

performanceautofiler bot opened this issue Nov 7, 2023 · 11 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) PGO Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@performanceautofiler
Copy link

performanceautofiler bot commented Nov 7, 2023

Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue TigerUbuntu
Baseline e4fbdb907bb187d7b5ba0668a84347c1058e3219
Compare 34bf55cd6448b3a19288623dca087151efe00367
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in LinqBenchmarks

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
60.98 ms 70.11 ms 1.15 0.09 True
58.93 ms 69.51 ms 1.18 0.07 True

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'LinqBenchmarks*'

Payloads

Baseline
Compare

LinqBenchmarks.Order00LinqQueryX

ETL Files

Histogram

JIT Disasms

LinqBenchmarks.Order00LinqMethodX

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue TigerUbuntu
Baseline e4fbdb907bb187d7b5ba0668a84347c1058e3219
Compare 34bf55cd6448b3a19288623dca087151efe00367
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Linq.Tests.Perf_Enumerable

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
216.78 ns 259.35 ns 1.20 0.11 True

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Linq.Tests.Perf_Enumerable*'

Payloads

Baseline
Compare

System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_LastElementMatches(input: List)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@performanceautofiler performanceautofiler bot added arch-x64 os-linux Linux OS (any supported distro) PGO runtime-coreclr specific to the CoreCLR runtime untriaged New issue has not been triaged by the area owner labels Nov 7, 2023
@cincuranet cincuranet removed the untriaged New issue has not been triaged by the area owner label Nov 7, 2023
@cincuranet cincuranet transferred this issue from dotnet/perf-autofiling-issues Nov 7, 2023
@ghost ghost added needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners untriaged New issue has not been triaged by the area owner labels Nov 7, 2023
@cincuranet cincuranet added tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark and removed untriaged New issue has not been triaged by the area owner needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Nov 7, 2023
@cincuranet
Copy link
Contributor

Likely caused by #94247.

@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 8, 2023
@ghost
Copy link

ghost commented Nov 8, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue TigerUbuntu
Baseline e4fbdb907bb187d7b5ba0668a84347c1058e3219
Compare 34bf55cd6448b3a19288623dca087151efe00367
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in LinqBenchmarks

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
60.98 ms 70.11 ms 1.15 0.09 True
58.93 ms 69.51 ms 1.18 0.07 True

graph
graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'LinqBenchmarks*'

Payloads

Baseline
Compare

LinqBenchmarks.Order00LinqQueryX

ETL Files

Histogram

JIT Disasms

LinqBenchmarks.Order00LinqMethodX

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository


Run Information

Name Value
Architecture x64
OS ubuntu 22.04
Queue TigerUbuntu
Baseline e4fbdb907bb187d7b5ba0668a84347c1058e3219
Compare 34bf55cd6448b3a19288623dca087151efe00367
Diff Diff
Configs CompilationMode:tiered, RunKind:micro

Regressions in System.Linq.Tests.Perf_Enumerable

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio
216.78 ns 259.35 ns 1.20 0.11 True

graph
Test Report

Repro

General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f net8.0 --filter 'System.Linq.Tests.Perf_Enumerable*'

Payloads

Baseline
Compare

System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_LastElementMatches(input: List)

ETL Files

Histogram

JIT Disasms

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: -
Labels:

os-linux, tenet-performance, tenet-performance-benchmarks, arch-x64, area-CodeGen-coreclr, runtime-coreclr, PGO

Milestone: -

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Dec 8, 2023
@AndyAyersMS AndyAyersMS added the Priority:2 Work that is important, but not critical for the release label May 8, 2024
@AndyAyersMS
Copy link
Member

Collated Reports

Notes Recent Score Orig Score x64 ubuntu x64 Windows x64 amd Benchmark
1.18 1.16 1.18
1.16
System.Collections.IterateForEach(Int32).ConcurrentDictionary(Size: 512)
1.17 1.19 1.17
1.19
System.Collections.CtorFromCollection(Int32).SortedDictionary(Size: 512)
1.17 1.18 1.17
1.18
LinqBenchmarks.Order00LinqMethodX
1.14 1.15 1.14
1.15
LinqBenchmarks.Order00LinqQueryX
1.11 1.11 1.09
1.12
1.12
1.11
System.Collections.CtorFromCollection(Int32).ConcurrentDictionary(Size: 512)
1.06 1.11 1.06
1.11
System.Text.RegularExpressions.Tests.Perf_Regex_Common.MatchesWord(Options: Compiled)
1.02 1.07 1.02
1.07
System.Collections.CtorFromCollection(Int32).ImmutableArray(Size: 512)
0.18 1.20 0.18
1.20
System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_LastElementMatches(input: List)
0.16 0.91 0.16
0.91
System.Linq.Tests.Perf_Enumerable.SingleWithPredicate_FirstElementMatches(input: List)

@AndyAyersMS
Copy link
Member

LinqBenchmarks.Order00LinqQueryX

Regression has persisted... mostly on linux

image

@AndyAyersMS
Copy link
Member

Can repro locally. Tried diffing vs latest main but things have changed quite a bit, so will look at the diffs wrt the narrower range above.

@AndyAyersMS
Copy link
Member

Morphing in RPO causes a struct local to be marked as exposed before we see a block copy so we change our copy expansion strategy from field by field to block.

base codegen

IN00c8: 0003C8 mov      edi, dword ptr [rbp-0x88]
IN00c9: 0003CE mov      dword ptr [rbp-0xA8], edi
IN00ca: 0003D4 mov      edi, dword ptr [rbp-0x84]
IN00cb: 0003DA mov      dword ptr [rbp-0xA4], edi
IN00cc: 0003E0 mov      rdi, qword ptr [rbp-0x80]
IN00cd: 0003E4 mov      qword ptr [rbp-0xA0], rdi
IN00ce: 0003EB mov      edi, dword ptr [rbp-0x98]
IN00cf: 0003F1 mov      dword ptr [rbp-0xB8], edi
IN00d0: 0003F7 mov      edi, dword ptr [rbp-0x94]
IN00d1: 0003FD mov      dword ptr [rbp-0xB4], edi
IN00d2: 000403 mov      rdi, qword ptr [rbp-0x90]
IN00d3: 00040A mov      qword ptr [rbp-0xB0], rdi
IN00d4: 000411 mov      esi, dword ptr [rbp-0xB4]
                            ; gcrRegs -[rsi]
IN00d5: 000417 or       rdi, rsi
IN00d6: 00041A jne      G_M4544_IG44

diff codegen

IN00c8: 0003C8 vmovups  xmm0, xmmword ptr [rbp-0x88]
IN00c9: 0003D0 vmovups  xmmword ptr [rbp-0xA8], xmm0
IN00ca: 0003D8 vmovups  xmm0, xmmword ptr [rbp-0x98]
IN00cb: 0003E0 vmovups  xmmword ptr [rbp-0xB8], xmm0
IN00cc: 0003E8 mov      rdi, qword ptr [rbp-0xB0]
IN00cd: 0003EF mov      esi, dword ptr [rbp-0xB4]
                            ; gcrRegs -[rsi]
IN00ce: 0003F5 or       rdi, rsi
IN00cf: 0003F8 jne      G_M4544_IG44

Looks like the latter incurs a store-forwarding stall (likely on the esi load). This is on linux WSL2, so I can't easily attribute samples to offsets. Will see if I can run this natively somewhere.

@jakobbotsch this seems similar to the ldp issue, wonder if we should try something similar here. But late mitigation would be awkward (recommendation is to do a wider aligned load, and extract the part needed).

@jakobbotsch
Copy link
Member

Similar case: #96524 (comment)

If we handle one it would be nice to see if we can handle both cases.
The stall here is likely on one of the xmm loads, I'd guess. It would probably be simple enough to just add the similar kind of heuristic in block morphing when determining if we should use the block copy for DNER locals. We would need to build loops I think, but that shouldn't be too expensive.

@AndyAyersMS
Copy link
Member

Yeah, it could be the stall is earlier -- we have narrow stores and then a wide load:

IN00c0: 00039B mov      dword ptr [V103 rbp-0x98], ebx
IN00c1: 0003A1 mov      ebx, dword ptr [rdi+0x04]
IN00c2: 0003A4 mov      dword ptr [V104 rbp-0x94], ebx
IN00c3: 0003AA mov      rdi, qword ptr [rdi+0x08]
IN00c4: 0003AE mov      qword ptr [V105 rbp-0x90], rdi
IN00c5: 0003B5 mov      rdi, 0x7FB304C17A58      ; System.Collections.Generic.GenericComparer`1[System.Decimal]
IN00c6: 0003BF cmp      qword ptr [rsi], rdi
IN00c7: 0003C2 jne      G_M4544_IG83
IN00c8: 0003C8 vmovups  xmm0, xmmword ptr [V71 rbp-0x88]
IN00c9: 0003D0 vmovups  xmmword ptr [V74 rbp-0xA8], xmm0
IN00ca: 0003D8 vmovups  xmm0, xmmword ptr [V72 rbp-0x98]
IN00cb: 0003E0 vmovups  xmmword ptr [V75 rbp-0xB8], xmm0

@AndyAyersMS
Copy link
Member

Fixing this is probably too ambitious for .NET 9 at this point, so moving to 10.0.

@AndyAyersMS AndyAyersMS modified the milestones: 9.0.0, 10.0.0 Aug 13, 2024
@AndyAyersMS
Copy link
Member

The Linq benchmarks have since improved dramatically (via 34545d7) so not clear if they are still suffering from this.

Other benchmarks also seem to be improved. So will close.

@github-actions github-actions bot locked and limited conversation to collaborators May 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) PGO Priority:2 Work that is important, but not critical for the release runtime-coreclr specific to the CoreCLR runtime tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

5 participants