Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running FLOPS in Windows #1

Open
Dr-Noob opened this issue Jun 27, 2020 · 10 comments
Open

Running FLOPS in Windows #1

Dr-Noob opened this issue Jun 27, 2020 · 10 comments
Labels
enhancement New feature or request

Comments

@Dr-Noob
Copy link
Owner

Dr-Noob commented Jun 27, 2020

I've tried running FLOPS in Windows:

First, one have to change some int and long to stdint's type (int32_t and int64_t). After that, I tried running it and the performance was horrible. Then, I figured out, looking at assembly, that the loop was compiled poorly. I noticed I was using a 32 bit compiler (mingw which is 32 bits). Lastly, I tried compiling it with a 64 bit compiler (I used mingw-w64), using the Windows build (MingW-W64-builds) and the Arch Linux one (mingw-w64-gcc-bin). Both gave me the same result: segmentation fault. I found somewhere that running it with 32 bits but having a segfault with 64 bits could be caused to some issues with the stack, which should be solved using -fno-stack-protector. This does not solve the segfault. I love you, Windows ❤️

@Dr-Noob Dr-Noob added the enhancement New feature or request label Sep 1, 2020
@Xemorph
Copy link

Xemorph commented Aug 27, 2021

I'm starting to be labeled a stalker, but back to the topic.
The program peakperf under Windows (compiled with MinGW64) runs terribly (see screenshot), the results shown are definitely not correct … I'll try to rebuild the program for MSVC & see if I can achieve something.

I've listed all the information below, if you have any questions just let me know :)

peakperf: AMD Ryzen 9 (Zen 2) 3900X @ 4.3GHz (Validation: see CPU-Z screenshot)

> peakperf.exe -b zen2 -w 0

------------------------------------------------------
    peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
        CPU: AMD Ryzen 9 3900X 12-Core Processor
  Microarch: Zen 2
  Benchmark: Zen 2 (AVX2)
 Iterations: 1.00e+09
      GFLOP: 0.30
    Threads: 24

   N┬║  Time(s)  GFLOP/s
    1  2.35462     0.13
    2  2.35085     0.13
    3  2.34874     0.13
    4  2.35693     0.13
    5  2.35561     0.13
    6  2.34889     0.13
    7  2.34961     0.13
    8  2.34183     0.13
    9  2.34504     0.13
   10  2.34760     0.13
------------------------------------------------------
 Average performance:      0.13 +- 0.00 GFLOP/s
------------------------------------------------------

CPU-Z: CPU Validation
image

GCC: GCC Version (MinGW64)
image

@Xemorph
Copy link

Xemorph commented Aug 28, 2021

Update:

I have first fixed all problems under MSVC and the result is just sobering ... The performance under Windows is catastrophic & I just don't know why.

I could now reach these values (Compiled with MSVC X64):

> MinSizeRel\peakperf.exe -w 0

------------------------------------------------------
    peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
        CPU: AMD Ryzen 9 3900X 12-Core Processor
  Microarch: Zen 2
  Benchmark: Zen 2 (AVX2)
 Iterations: 1.00e+09
      GFLOP: 3840.00
    Threads: 24

   N┬║  Time(s)  GFLOP/s
    1  3.34606  1147.62
    2  3.29593  1165.07
    3  3.29363  1165.89
    4  3.31118  1159.71
    5  3.29446  1165.59
    6  3.29978  1163.71
    7  3.28723  1168.16
    8  3.29356  1165.91
    9  3.30641  1161.38
   10  3.29375  1165.84
------------------------------------------------------
 Average performance:      1162.86 +- 5.59 GFLOP/s
------------------------------------------------------

@Dr-Noob , do you have any suggestions on what else I could test?

@Dr-Noob
Copy link
Owner Author

Dr-Noob commented Aug 28, 2021

In the case of peakperf, it is not as easy as the others programs (cpufetch, gpufetch). peakperf is extremely delicate; if the compiler does not generate the exact needed assembly instructions in the correct order, you will not reach the peak performance (you will not be even close to it, as you have experienced).

Honestly, I think that making peakperf work on Windows is quite hard (I don't know if it is even possible), so this is more of a thing that I should do on my own. That said, the thing I would do in your case is to check the assembly. The generated assembly will surely be wrong (you can compare it against the assembly generated in Linux), and you would need to make the compiler to generate the correct one. As I said, this is not trivial at all...

@Xemorph
Copy link

Xemorph commented Aug 30, 2021

Under Windows it is really not easy. MinGW only produces segfaults, I also found something about this on StackOverflow (but I don't know if this is still true, anyway I link it).

Windows GCC is buggy with 32-byte stack alignment for spilling __m256, so it's not in general safe to use GCC with -march= with anything that includes AVX.

MSVC produces only miserable assembly code and I can't optimize it further, so I'll have another look at clang (was also recommended on StackOverflow, maybe I can achieve something with it).

@Xemorph
Copy link

Xemorph commented Sep 1, 2021

Update

After intensive research, I was at least able to achieve better performance with Clang under Windows. But it doesn't come close to the performance under Linux ... The assembly of the compute functions looks good so far (have added compute_zen2 here respectively, since I own a Ryzen 9 3900X).

Now I have to do a profiling and look where there are still performance hotspots to eliminate them.

Result of the microbenchmark with the CPU AMD Ryzen 9 3900X under Windows 10:

C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc>peakperf.exe -w 0

------------------------------------------------------
    peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
        CPU: AMD Ryzen 9 3900X 12-Core Processor
  Microarch: Zen 2
  Benchmark: Zen 2 (AVX2)
 Iterations: 1.00e+09
      GFLOP: 3840.00
    Threads: 24

   N┬║  Time(s)  GFLOP/s
    1  2.34313  1638.84
    2  2.34213  1639.54
    3  2.33812  1642.34
    4  2.36314  1624.95
    5  2.34313  1638.83
    6  2.33912  1641.64
    7  2.34713  1636.04
    8  2.35013  1633.95
    9  2.34813  1635.34
   10  2.34413  1638.14
------------------------------------------------------
 Average performance:      1636.95 +- 4.72 GFLOP/s
------------------------------------------------------


C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc>

The assembler code from the file zen2.cpp:

	.text
	.def	 @feat.00;
	.scl	3;
	.type	0;
	.endef
	.globl	@feat.00
.set @feat.00, 0
	.intel_syntax noprefix
	.file	"zen2.cpp"
	.def	 "?compute_zen2@@YAXPEAT__m256@@T1@H@Z";
	.scl	2;
	.type	32;
	.endef
	.section	.text,"xr",one_only,"?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
	.globl	"?compute_zen2@@YAXPEAT__m256@@T1@H@Z" # -- Begin function ?compute_zen2@@YAXPEAT__m256@@T1@H@Z
	.p2align	4, 0x90
"?compute_zen2@@YAXPEAT__m256@@T1@H@Z": # @"?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
.seh_proc "?compute_zen2@@YAXPEAT__m256@@T1@H@Z"
# %bb.0:
	sub	rsp, 184
	movsxd	rax, esi
	lea	rdi, [rip + farr_zen2]
	.seh_stackalloc 184
	.seh_endprologue
	mov	edx, 1000000000
	lea	rsi, [rax + 4*rax]
	shl	rsi, 7
	vmovaps	ymm11, ymmword ptr [rsi + rdi + 160]
	vmovaps	ymm2, ymmword ptr [rsi + rdi + 32]
	vmovaps	ymm3, ymmword ptr [rsi + rdi + 96]
	vmovaps	ymm1, ymmword ptr [rsi + rdi]
	vmovaps	ymm10, ymmword ptr [rsi + rdi + 128]
	vmovaps	ymm9, ymmword ptr [rsi + rdi + 192]
	vmovaps	ymm8, ymmword ptr [rsi + rdi + 256]
	vmovaps	ymm7, ymmword ptr [rsi + rdi + 320]
	vmovaps	ymm6, ymmword ptr [rsi + rdi + 384]
	vmovaps	ymm5, ymmword ptr [rsi + rdi + 448]
	vmovaps	ymm4, ymmword ptr [rsi + rdi + 512]
	vmovaps	ymm12, ymmword ptr [rsi + rdi + 416]
	vmovaps	ymm13, ymmword ptr [rsi + rdi + 480]
	vmovaps	ymm14, ymmword ptr [rsi + rdi + 544]
	vmovaps	ymm15, ymmword ptr [rsi + rdi + 608]
	lea	rcx, [rdi + rsi]
	lea	rax, [rsi + rdi + 64]
	vmovups	ymmword ptr [rsp + 64], ymm11   # 32-byte Spill
	vmovaps	ymm11, ymmword ptr [rsi + rdi + 224]
	vmovups	ymmword ptr [rsp + 128], ymm2   # 32-byte Spill
	vmovups	ymmword ptr [rsp + 96], ymm3    # 32-byte Spill
	vmovaps	ymm2, ymmword ptr [rsi + rdi + 64]
	vmovaps	ymm3, ymmword ptr [rsi + rdi + 576]
	vmovups	ymmword ptr [rsp + 32], ymm11   # 32-byte Spill
	vmovaps	ymm11, ymmword ptr [rsi + rdi + 288]
	vmovups	ymmword ptr [rsp], ymm11        # 32-byte Spill
	vmovaps	ymm11, ymmword ptr [rsi + rdi + 352]
	.p2align	4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
	vfmadd213ps	ymm1, ymm0, ymmword ptr [rsp + 128] # 32-byte Folded Reload
                                        # ymm1 = (ymm0 * ymm1) + mem
	vfmadd213ps	ymm2, ymm0, ymmword ptr [rsp + 96] # 32-byte Folded Reload
                                        # ymm2 = (ymm0 * ymm2) + mem
	vfmadd213ps	ymm10, ymm0, ymmword ptr [rsp + 64] # 32-byte Folded Reload
                                        # ymm10 = (ymm0 * ymm10) + mem
	vfmadd213ps	ymm9, ymm0, ymmword ptr [rsp + 32] # 32-byte Folded Reload
                                        # ymm9 = (ymm0 * ymm9) + mem
	vfmadd213ps	ymm8, ymm0, ymmword ptr [rsp] # 32-byte Folded Reload
                                        # ymm8 = (ymm0 * ymm8) + mem
	vfmadd213ps	ymm7, ymm0, ymm11       # ymm7 = (ymm0 * ymm7) + ymm11
	vfmadd213ps	ymm6, ymm0, ymm12       # ymm6 = (ymm0 * ymm6) + ymm12
	vfmadd213ps	ymm5, ymm0, ymm13       # ymm5 = (ymm0 * ymm5) + ymm13
	vfmadd213ps	ymm4, ymm0, ymm14       # ymm4 = (ymm0 * ymm4) + ymm14
	vfmadd213ps	ymm3, ymm0, ymm15       # ymm3 = (ymm0 * ymm3) + ymm15
	dec	rdx
	jne	.LBB0_1
# %bb.2:
	vmovaps	ymmword ptr [rcx], ymm1
	vmovaps	ymmword ptr [rax], ymm2
	vmovaps	ymmword ptr [rax + 64], ymm10
	vmovaps	ymmword ptr [rax + 128], ymm9
	vmovaps	ymmword ptr [rax + 192], ymm8
	vmovaps	ymmword ptr [rax + 256], ymm7
	vmovaps	ymmword ptr [rax + 320], ymm6
	vmovaps	ymmword ptr [rax + 384], ymm5
	vmovaps	ymmword ptr [rax + 448], ymm4
	vmovaps	ymmword ptr [rax + 512], ymm3
	add	rsp, 184
	vzeroupper
	ret
	.seh_endproc
                                        # -- End function
	.lcomm	farr_zen2,327680,64             # @farr_zen2
	.section	.drectve,"yn"
	.ascii	" /DEFAULTLIB:libcmt.lib"
	.ascii	" /DEFAULTLIB:oldnames.lib"
	.addrsig
	.globl	_fltused

@Dr-Noob
Copy link
Owner Author

Dr-Noob commented Sep 1, 2021

What performance do you achieve in Linux? 1600 GFLOP/s doesn't look bad for your CPU.
What is the frequency of your CPU running peakperf? You can check this by running the program and then using ./freq.sh

@Xemorph
Copy link

Xemorph commented Sep 2, 2021

[...]
Now I have to do a profiling and look where there are still performance hotspots to eliminate them.

Result of the microbenchmark with the CPU AMD Ryzen 9 3900X under Windows 10:

C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc>peakperf.exe -w 0

------------------------------------------------------
    peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
        CPU: AMD Ryzen 9 3900X 12-Core Processor
  Microarch: Zen 2
  Benchmark: Zen 2 (AVX2)
 Iterations: 1.00e+09
      GFLOP: 3840.00
    Threads: 24

   N┬║  Time(s)  GFLOP/s
    1  2.34313  1638.84
    2  2.34213  1639.54
    3  2.33812  1642.34
    4  2.36314  1624.95
    5  2.34313  1638.83
    6  2.33912  1641.64
    7  2.34713  1636.04
    8  2.35013  1633.95
    9  2.34813  1635.34
   10  2.34413  1638.14
------------------------------------------------------
 Average performance:      1636.95 +- 4.72 GFLOP/s
------------------------------------------------------


C:\Users\[USER NAME]\Documents\AAA_Environment\peakperf.win.env\peakperf-clang\build-clang-msvc>

[...]

Okay, that's funny … Under Linux (headless) I achieve between 1675 GFLOP/s to 1700 GFLOP/s.
I took your formula and calculated the theoretical value.

N_CORES * FREQUENCY * FMA * UNITS * (SIZE_OF_VECTOR/32)

(12 * 4.3 * 2 * 2 * (256/3)) / 1000000000 = 1651.2 GFLOP/s
      ~^~~~~~
      Overclocked: Base clock is 3.8 GHz

Under Windows, the GFLOP/s calculation is a bit slower. I will do some more tests under Windows. The previous result may have been distorted by the antivirus program or other background services. But on the whole it already looks very good :)

@Dr-Noob
Copy link
Owner Author

Dr-Noob commented Sep 2, 2021

Yeah, that's what I meant, it looks like your last version is working in Windows because you are already achieving the peak performance. So, in the end, what you did was just to use clang instead of msvc and mingw? Did you do any changes to the source code?

@Xemorph
Copy link

Xemorph commented Sep 2, 2021

In the end, I did indeed use clang. The Cmake file I had to adapt partly, because some GCC options also clang does not understand under Windows. I have listed the biggest changes below.

File arch.cpp:

/* [+](Added) __attribute__((sysv_abi)) to the function pointers */
struct benchmark_cpu {
  void (*compute_function_256)(__m256*farr_ptr, __m256, int32_t) __attribute__((sysv_abi));
  void (*compute_function_512)(__m512 *farr_ptr, __m512, int32_t) __attribute__((sysv_abi));
  const char* name;
  double gflops;
  int32_t n_threads;
  bench_type benchmark_type;
};

/* ... */

/* [+](Added) __attribute__((ms_abi)) */
__attribute__((ms_abi)) bool compute_cpu (struct benchmark_cpu* bench, double* e_time)
{
    /* ... */
}

File zen2.hpp (respectively for the others too):

#ifndef __ZEN2__
#define __ZEN2__

#include "arch.hpp"

/* [+](Added) __attribute__((sysv_abi)) */
__attribute__((sysv_abi)) void compute_zen2(__m256 *farr_ptr, __m256 mult, int32_t index);

#endif

File zen2.cpp (respectively for the others too):

#include "zen2.hpp"
#define OP_PER_IT B_256_10_OP_IT

/* [+](Added) Keyword 'static' because clang throws a lot of warnings */
static TYPE farr_zen2[MAX_NUMBER_THREADS][SIZE] __attribute__((aligned(64)));  

/* [+](Added) __attribute__((sysv_abi))
 * [~](Updated)
 *     int index (changed to) -> int32_t index
 *     long i    (changed to) -> int64_t i
 */
__attribute__((sysv_abi)) void compute_zen2(TYPE *farr, TYPE mult, int32_t index)
{
  farr = farr_zen2[index];
    
  for(int64_t i=0; i < BENCHMARK_CPU_ITERS; i++)
  {
    farr[0]  = _mm256_fmadd_ps(mult, farr[0], farr[1]);
    farr[2]  = _mm256_fmadd_ps(mult, farr[2], farr[3]);
    farr[4]  = _mm256_fmadd_ps(mult, farr[4], farr[5]);
    farr[6]  = _mm256_fmadd_ps(mult, farr[6], farr[7]);
    farr[8]  = _mm256_fmadd_ps(mult, farr[8], farr[9]);
    farr[10] = _mm256_fmadd_ps(mult, farr[10], farr[11]);
    farr[12] = _mm256_fmadd_ps(mult, farr[12], farr[13]);
    farr[14] = _mm256_fmadd_ps(mult, farr[14], farr[15]);
    farr[16] = _mm256_fmadd_ps(mult, farr[16], farr[17]);
    farr[18] = _mm256_fmadd_ps(mult, farr[18], farr[19]);
  }
}

These are mainly the most important changes. Of course I had to implement other things like the method 'gettimeofday' myself ... Maybe it is even possible to get more performance out of it.

@Xemorph
Copy link

Xemorph commented Sep 2, 2021

My friend was kind enough to test the program on her computer. She has an i7 4790K @ 4.0 GHz and the results are horrible ... There is definitely something wrong yet, I'll have another look at the assembler code.

------------------------------------------------------
    peakperf (https://github.com/Dr-Noob/peakperf)
------------------------------------------------------
        CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  Microarch: Haswell
  Benchmark: Haswell (AVX2)
 Iterations: 1.00e+09
      GFLOP: 640.00
    Threads: 4

   N┬║  Time(s)  GFLOP/s
    1  1.93591   330.59
    2  1.90480   335.99
    3  1.92203   332.98
    4  1.95600   327.20
    5  1.96473   325.74
    6  2.05039   312.14
    7  2.30242   277.97
    8  2.00046   319.93
    9  2.03192   314.97
   10  2.11022   303.29
------------------------------------------------------
 Average performance:      317.16 +- 16.51 GFLOP/s
------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants