Using an argument as a Vector variable causes data to be copied back and forth #115075

tfenise · 2025-04-26T11:11:12Z

Description

Consider

static Vector256<float> M(Vector256<float> x, Vector256<float> y)
{
    x *= y;
    x *= y;
    x *= y;
    x *= y;
    return x;
}

which generates

; Assembly listing for method ConsoleApp1.Program:M(System.Runtime.Intrinsics.Vector256`1[float],System.Runtime.Intrinsics.Vector256`1[float]):System.Runtime.Intrinsics.Vector256`1[float] (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmulps   ymm0, ymm0, ymmword ptr [r8]
       vmovups  ymmword ptr [rdx], ymm0
       vmovups  ymm0, ymmword ptr [rdx]
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0x003F
       vzeroupper
       ret

; Total bytes of code 67

Note those unnecessary

vmovups  ymmword ptr [rdx], ymm0
vmovups  ymm0, ymmword ptr [rdx]

No problem with local variables.

static Vector256<float> M(Vector256<float> x, Vector256<float> y)
{
    var l_x = x;
    l_x *= y;
    l_x *= y;
    l_x *= y;
    l_x *= y;
    return l_x;
}

; Assembly listing for method ConsoleApp1.Program:M(System.Runtime.Intrinsics.Vector256`1[float],System.Runtime.Intrinsics.Vector256`1[float]):System.Runtime.Intrinsics.Vector256`1[float] (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       vmovups  ymm0, ymmword ptr [r8]
       vmulps   ymm1, ymm0, ymmword ptr [rdx]
       vmulps   ymm1, ymm1, ymm0
       vmulps   ymm1, ymm1, ymm0
       vmulps   ymm0, ymm1, ymm0
       vmovups  ymmword ptr [rcx], ymm0
       mov      rax, rcx

G_M000_IG03:                ;; offset=0x001C
       vzeroupper
       ret

; Total bytes of code 32

Configuration

.NET 9, also 10.0.100-preview.5.25224.4
Windows 11, x64
Cannot reproduce on https://godbolt.org/, possibly because they are using *nix.

Regression?

Not known

dotnet-policy-service · 2025-04-26T11:11:46Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jnyrup · 2025-04-26T15:18:59Z

Possibly a duplicate of #78883

tfenise · 2025-04-26T15:59:05Z

Possibly a duplicate of #78883

Makes sense.

Although I think aliasing issues notwithstanding, in

vmovups  ymmword ptr [rdx], ymm0
vmovups  ymm0, ymmword ptr [rdx]

Clearly the second vmovups is unnecessary.

Take another example

public class TestClass
{
    int z;

    void M1(int y)
    {
        z *= y;
        z *= y;
        z *= y;
        z *= y;
    }

    void M2(int y)
    {
        ref int x = ref z;

        x *= y;
        x *= y;
        x *= y;
        x *= y;
    }
}

https://godbolt.org/z/xYfsds798

TestClass:M1(int):this (FullOpts):
; Emitting BLENDED_CODE for X64 with AVX - Unix
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] (  7,  7   )     ref  ->  rdi         this class-hnd single-def <TestClass>
;  V01 arg1         [V01,T01] (  6,  6   )     int  ->  rsi         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;  V03 cse0         [V03,T02] (  3,  3   )     int  ->  rax         "CSE #01: aggressive"
;  V04 cse1         [V04,T03] (  3,  3   )     int  ->  rax         "CSE #02: aggressive"
;  V05 cse2         [V05,T04] (  3,  3   )     int  ->  rax         "CSE #03: aggressive"
;
; Lcl frame size = 0
						;; size=0 bbWeight=1 PerfScore 0.00
       mov      eax, esi
       imul     eax, dword ptr [rdi+0x08]
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
       imul     eax, esi
       mov      dword ptr [rdi+0x08], eax
						;; size=27 bbWeight=1 PerfScore 15.25
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 28, prolog size 0, PerfScore 16.25, instruction count 10, allocated bytes for code 28 (MethodHash=0b0a3fd5) for method TestClass:M1(int):this (FullOpts)
; ============================================================

TestClass:M2(int):this (FullOpts):
; Emitting BLENDED_CODE for X64 with AVX - Unix
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 this         [V00,T02] (  3,  3   )     ref  ->  rdi         this class-hnd single-def <TestClass>
;  V01 arg1         [V01,T01] (  6,  6   )     int  ->  rsi         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;  V03 tmp1         [V03,T00] (  9, 18   )   byref  ->  rdi         single-def "dup spill"
;
; Lcl frame size = 0
						;; size=0 bbWeight=1 PerfScore 0.00
       add      rdi, 8
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       mov      eax, esi
       imul     eax, dword ptr [rdi]
       mov      dword ptr [rdi], eax
       imul     esi, dword ptr [rdi]
       mov      dword ptr [rdi], esi
						;; size=30 bbWeight=1 PerfScore 25.00
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 31, prolog size 0, PerfScore 26.00, instruction count 13, allocated bytes for code 31 (MethodHash=8d69b9d6) for method TestClass:M2(int):this (FullOpts)
; ============================================================

M2 could be on par with M1 and avoid loading dword ptr [rdi] again and again.

tannergooding · 2025-04-28T18:02:08Z

Clearly the second vmovups is unnecessary.

This depends on the hardware and several other considerations.

It isn't strictly a no-op as it zeros any "upper bits", such as on AVX-512 capable hardware. Likewise a load from memory requires presuming single-threaded state and no mutations may occur.

The .NET memory model does allow some presumptions here, but it requires tracking of the relevant data, knowledge of whether it could alias or not, etc.

For a real app calling M, it is highly likely that M is inlined and so the byref and any copy are completely removed giving the "optimal" codegen expected. It generally isn't sufficient to just look at or measure an isolated method like this.

tfenise added the tenet-performance label Apr 26, 2025

ghost added the area-CodeGen-coreclr label Apr 26, 2025

dotnet-policy-service bot added the untriaged label Apr 26, 2025

JulieLeeMSFT removed the untriaged label May 29, 2025

JulieLeeMSFT added this to the Future milestone May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using an argument as a Vector variable causes data to be copied back and forth #115075

Using an argument as a Vector variable causes data to be copied back and forth #115075

tfenise commented Apr 26, 2025

dotnet-policy-service bot commented Apr 26, 2025

Uh oh!

jnyrup commented Apr 26, 2025

Uh oh!

tfenise commented Apr 26, 2025

Uh oh!

tannergooding commented Apr 28, 2025

Uh oh!

Using an argument as a Vector variable causes data to be copied back and forth #115075

Using an argument as a Vector variable causes data to be copied back and forth #115075

Comments

tfenise commented Apr 26, 2025

Description

Configuration

Regression?

dotnet-policy-service bot commented Apr 26, 2025

Uh oh!

jnyrup commented Apr 26, 2025

Uh oh!

tfenise commented Apr 26, 2025

Uh oh!

tannergooding commented Apr 28, 2025

Uh oh!