Skip to content

Using an argument as a Vector variable causes data to be copied back and forth #115075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tfenise opened this issue Apr 26, 2025 · 4 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@tfenise
Copy link
Contributor

tfenise commented Apr 26, 2025

Description

Consider

static Vector256<float> M(Vector256<float> x, Vector256<float> y)
{
    x *= y;
    x *= y;
    x *= y;
    x *= y;
    return x;
}

which generates

; Assembly listing for method ConsoleApp1.Program:M(System.Runtime.Intrinsics.Vector256`1[float],System.Runtime.Intrinsics.Vector256`1[float]):System.Runtime.Intrinsics.Vector256`1[float] (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0x003F
       vzeroupper
       ret

; Total bytes of code 67

Note those unnecessary

vmovups  ymmword ptr [rdx], ymm0
vmovups  ymm0, ymmword ptr [rdx]

No problem with local variables.

static Vector256<float> M(Vector256<float> x, Vector256<float> y)
{
    var l_x = x;
    l_x *= y;
    l_x *= y;
    l_x *= y;
    l_x *= y;
    return l_x;
}
; Assembly listing for method ConsoleApp1.Program:M(System.Runtime.Intrinsics.Vector256`1[float],System.Runtime.Intrinsics.Vector256`1[float]):System.Runtime.Intrinsics.Vector256`1[float] (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [r8]
       vmulps   ymm1, ymm0, ymmword ptr [rdx]
       vmulps   ymm1, ymm1, ymm0
       vmulps   ymm1, ymm1, ymm0
       vmulps   ymm0, ymm1, ymm0
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0x001C
       vzeroupper
       ret

; Total bytes of code 32

Configuration

.NET 9, also 10.0.100-preview.5.25224.4
Windows 11, x64
Cannot reproduce on https://godbolt.org/, possibly because they are using *nix.

Regression?

Not known

@tfenise tfenise added the tenet-performance Performance related issue label Apr 26, 2025
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 26, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 26, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jnyrup
Copy link
Contributor

jnyrup commented Apr 26, 2025

Possibly a duplicate of #78883

@tfenise
Copy link
Contributor Author

tfenise commented Apr 26, 2025

Possibly a duplicate of #78883

Makes sense.

Although I think aliasing issues notwithstanding, in

vmovups  ymmword ptr [rdx], ymm0
vmovups  ymm0, ymmword ptr [rdx]

Clearly the second vmovups is unnecessary.

Take another example

public class TestClass
{
    int z;

    void M1(int y)
    {
        z *= y;
        z *= y;
        z *= y;
        z *= y;
    }

    void M2(int y)
    {
        ref int x = ref z;

        x *= y;
        x *= y;
        x *= y;
        x *= y;
    }
}

https://godbolt.org/z/xYfsds798

TestClass:M1(int):this (FullOpts):
; Emitting BLENDED_CODE for X64 with AVX - Unix
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  7,  7   )     ref  ->  rdi         this class-hnd single-def <TestClass>
;  V01 arg1         [V01,T01] (  6,  6   )     int  ->  rsi         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;  V03 cse0         [V03,T02] (  3,  3   )     int  ->  rax         "CSE #01: aggressive"
;  V04 cse1         [V04,T03] (  3,  3   )     int  ->  rax         "CSE #02: aggressive"
;  V05 cse2         [V05,T04] (  3,  3   )     int  ->  rax         "CSE #03: aggressive"
;
; Lcl frame size = 0
						;; size=0 bbWeight=1 PerfScore 0.00
       mov      eax, esi
       imul     eax, dword ptr [rdi+0x08]
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
						;; size=27 bbWeight=1 PerfScore 15.25
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 28, prolog size 0, PerfScore 16.25, instruction count 10, allocated bytes for code 28 (MethodHash=0b0a3fd5) for method TestClass:M1(int):this (FullOpts)
; ============================================================

TestClass:M2(int):this (FullOpts):
; Emitting BLENDED_CODE for X64 with AVX - Unix
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T02] (  3,  3   )     ref  ->  rdi         this class-hnd single-def <TestClass>
;  V01 arg1         [V01,T01] (  6,  6   )     int  ->  rsi         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;  V03 tmp1         [V03,T00] (  9, 18   )   byref  ->  rdi         single-def "dup spill"
;
; Lcl frame size = 0
						;; size=0 bbWeight=1 PerfScore 0.00
       add      rdi, 8
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       imul     esi, dword ptr [rdi]
       mov      dword ptr [rdi], esi
						;; size=30 bbWeight=1 PerfScore 25.00
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 31, prolog size 0, PerfScore 26.00, instruction count 13, allocated bytes for code 31 (MethodHash=8d69b9d6) for method TestClass:M2(int):this (FullOpts)
; ============================================================

M2 could be on par with M1 and avoid loading dword ptr [rdi] again and again.

@tannergooding
Copy link
Member

Clearly the second vmovups is unnecessary.

This depends on the hardware and several other considerations.

It isn't strictly a no-op as it zeros any "upper bits", such as on AVX-512 capable hardware. Likewise a load from memory requires presuming single-threaded state and no mutations may occur.

The .NET memory model does allow some presumptions here, but it requires tracking of the relevant data, knowledge of whether it could alias or not, etc.

For a real app calling M, it is highly likely that M is inlined and so the byref and any copy are completely removed giving the "optimal" codegen expected. It generally isn't sufficient to just look at or measure an isolated method like this.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label May 29, 2025
@JulieLeeMSFT JulieLeeMSFT added this to the Future milestone May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants