Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

Open
MigMuc opened this issue Apr 1, 2017 · 18 comments
Open

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

MigMuc opened this issue Apr 1, 2017 · 18 comments

Comments

@MigMuc
Copy link

MigMuc commented Apr 1, 2017

I suceesfully compiled the benchmark gemm_report.d provided by mir-glas. I ran it twice.
One comparing with OpenBLAS and another comparing against ACML-5.3.1.
As you can see from the benchmarks mir-glas does not yield full performance for large matrices.
Peak performance for my machine is about 23 GFLOPs for double precision.
But also ACML does noch achieve full performance.
So I decided to compare with dgemm.goto and dgemm.acml benchmark programs provided in
OpenBLAS/benchmark. Here ACML reaches peak performance too. Is there any overhead calling
ACML from D?
dgemm_bench
print

@MigMuc
Copy link
Author

MigMuc commented Apr 1, 2017

I have llvm version 3.9.1 installed.

@9il 9il added the performance label Apr 1, 2017
@9il
Copy link
Member

9il commented Apr 1, 2017

Hey @MigMuc,

Is there any overhead calling ACML from D?

No, only cblas_dgemm CBLAS function are called.

I have never tested GLAS on AMD CPUs. Would be awesome to have benchmarks for AMD. Benchmarks can be posted in the blog https://github.com/libmir/blog.

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Possible factors that may influence performance:

  1. Computation kernel structure.
  2. CPU Cache usage by BLAS and other programs. You may want to close web browser and other programs to get correct benchmarks.
  3. Matrix transposition.
  4. Strange thermal behaviour.

Lets start with computation kernels to optimize GLAS.

OpenBLAS uses sgemm_kernel_16x2_piledriver. This is strange because this kernel do not use YMM registers, only XMM registers. Maybe Piledriver YMM are simulated on top of XMM?

To see GLAS DGEMM kernel comile this gist with -output-s flag. Command line example is in the first line. The example is for SGEMM, replace float[8] with double[4] to generate DGEMM kernel.

Thanks!

@MigMuc
Copy link
Author

MigMuc commented Apr 1, 2017

Hi @9il,

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Yes, it has a Piledriver core.
So in order to compile the gemm_micro_kernel.d I used the -mcpu=bdver2 flag after exchanging

11 
12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(float[8])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(float[8])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }
22 

with

11 
12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(double[4])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(double[4])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }
22 

I got the following result:

gemm_micro_kernel.s.txt

@9il
Copy link
Member

9il commented Apr 1, 2017

Please replace float with double for b

@MigMuc
Copy link
Author

MigMuc commented Apr 1, 2017

gemm_micro_kernel.s.txt

@9il 9il changed the title mir-glas is slower than OpenBLAS for DGEMM [AMD] mir-glas is slower than OpenBLAS for DGEMM Apr 11, 2017
@RoyiAvital
Copy link

Can one use mir-glas on Windows for C \ C++ Projects using Visual Studio?

@9il
Copy link
Member

9il commented Jun 8, 2017

@RoyiAvital, yes. It has C headers. Note, that it is single thread for now.

@RoyiAvital
Copy link

@9il ,
I'm interested in Small Matrices Linear Algebra library.
Hence I'm OK, for now, with Single Threaded implementation.

Is there a guide or examples how to use it from C Code under Windows?

Thank You.

@9il
Copy link
Member

9il commented Jun 9, 2017

@RoyiAvital ,

  1. Build the library using dub package manager https://github.com/libmir/mir-glas#manual-compilation.
  2. Include it into your project as common C library.
  3. Include headers https://github.com/libmir/mir-glas/tree/master/include/glas into your project.

See also examples folder.

@MigMuc
Copy link
Author

MigMuc commented Jun 18, 2017

I spent some time doing benchmark tests and here they are:
bench_sgemm
bench_dgemm
bench_cgemm
bench_zgemm

@MigMuc MigMuc closed this as completed Jun 18, 2017
@MigMuc MigMuc reopened this Jun 18, 2017
@RoyiAvital
Copy link

@MigMuc, Could you please add label for the axis?
I'm not sure if higher or lower is better.

Thank You.

@MigMuc
Copy link
Author

MigMuc commented Jun 18, 2017

As you can see the performance varies quite a bit, specially AMDs own ACML is really weak on single complex performance, where GLAS is the best. But there are two cases where GLAS could be substantially improved, i.e. for single and double precision cases.

Regarding the implementation of gemm in GLAS as far as I can see there are a few lines in glas/internal/gemm.d

auto re = s[0] * reg[n][0][m];
auto im = s[0] * reg[n][1][m];
re -= s[1] * reg[n][1][m];
im += s[1] * reg[n][0][m];
reg[n][0][m] = re;
reg[n][1][m] = im;

Is this the 1m implementation from BLIS for complex arithmetic?
I would like to test some blocking parameters, for example testing the blocking like in
https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_6x4_piledriver.S. Where can I set these parameters? Do you have any sugesstions about how to proceed?

@RoyiAvital
Copy link

Any chance having Intel MKL there as well?

Thank You.

@MigMuc
Copy link
Author

MigMuc commented Jun 18, 2017

This is an AMD CPU so I guess Intel MKL would not be optimized for this case. Probably it would work on this machine but I don't have MKL installed.

@MigMuc
Copy link
Author

MigMuc commented Jun 18, 2017

@RoyiAvital: BTW, do you have any benchmarks you could provide? It would be great to have some comparisons also with Intel CPUs as well.

@RoyiAvital
Copy link

I have done some Intel MKL vs. OpenBLAS using MATLAB and Julia.

Have a look at Benchmark MATLAB & Julia for Matrix Operations.

But now I'm mostly interested in small matrices (Up to ~1000 elements) performance.

@MigMuc
Copy link
Author

MigMuc commented Sep 18, 2017

Some time ago I did some benchmark testing with gemm. I would like to debug the gemm_example.d in the examples folder in order to know the blocking sizes of this particular CPU as caclulated from the mir-cpuid packge and compare them with the blocking sizes of OpenBLAS and BLIS. Therefore I changed the build type from --build=target-native to --build=debug in the dub.json file. But then I get linker errors:

The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-algorithm 0.6.13: target for configuration "library" is up to date.
mir-cpuid 0.5.2: target for configuration "library" is up to date.
gemm_example ~master: building configuration "application"...
Running pre-build commands...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-glas 0.2.3: building configuration "static"...
Compiling ../source/glas/precompiled/context.d...
Compiling ../source/glas/precompiled/l1d.d...
Compiling ../source/glas/precompiled/l1s.d...
Compiling ../source/glas/precompiled/l1c.d...
Compiling ../source/glas/precompiled/l1z.d...
Compiling ../source/glas/precompiled/l3c.d...
Compiling ../source/glas/precompiled/l3d.d...
Compiling ../source/glas/precompiled/l3s.d...
Compiling ../source/glas/precompiled/l3z.d...
Compiling ../source/glas/precompiled/utility.d...
Linking...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "release-nobounds" build using ldmd2 for x86_64.
mir-cpuid 0.5.2: building configuration "library"...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/amd.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/common.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/unified.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/intel.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/x86_any.d...
Linking...
Linking...
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas//libmir-glas.a(../.dub/build/static-debug-linux.posix-x86_64-ldc_2074-68AAD8DD4EB442FD2FE09072820FEAE2/home.miguel.Dokumente.DLang.mir-glas-0.2.3.mir-glas.source.glas.precompiled.context.d.o): In Funktion »_D4glas11precompiled7context6memoryFNbNimZAv«:
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:120: Warnung: undefinierter Verweis auf »_D4glas8internal6memory10deallocateFNbNiAvZb«
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:121: Warnung: undefinierter Verweis auf »_D4glas8internal6memory15alignedAllocateFNbNiNemkZAv«
collect2: Fehler: ld gab 1 als Ende-Status zurück
Error: /usr/bin/gcc failed with status: 1
ldmd2 failed with exit code 1.``



What can I do in order to compile the whole package with debug info?

@9il
Copy link
Member

9il commented Sep 18, 2017

GLAS building system was created with assumption that it always builds in release mode. Half of files just never compiles because of all functions are marked as always inlined.

I recommend to use C's printf to find the required information or fix the build configuration to compile and link required files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants