GitHub - Leonardo-Ding/gpu_sgemm

This is a simple demo about how to optimize the gemm on Nvidia gpu platform with using sass level optimization trick.

test pull request Enviroment:

Ubuntu 16.04/14.04(other Linux os may be OK)

Software required:

cuda 8.0

cmake 2.8 or above

perl

Hardware required:

Nvidia Pascal Architecture GPU(such as GTX1080/1070, Titan X, Tesla P4/P40/P100)

We first write a cuda-c level code in matrixMul_cuda dir. Firstly, this is a 'NT'(C = A*B, A is a column-majored matrix, B is a row-majored matrix) format doing gemm operation. Secondly, ouput C is 64x64 tile format, input A and B is 64x8 format, and we use 64 cuda threads to do it.

Some tricks are listed as followed:

8x8 loop unroll technique to bring Instruction Level Parallel gain in our code,
wide 128 load/store instructions to enhance load/store efficiency,
double buffer trick to remove one unneccesary __syncthreads() in for loop,
data prefetch to improve pipeline efficiency,
reform C to shared memory before storing data to global memory.

However, the cuda-c code is not optimized very well in TLP through nvcc compilier as we must control gpu register usage below 128 to keep theoretical occupancy 25%. So we decide to optimize the code in sass level code. Detailed register distribution is not listed here and this process is not so easy. Luckily, we have a good assembler to help us. We use NervanaSystems/maxas for reference and modify it to run code successfully on Nvidia pascal architecture gpu and cuda 8.0 environment.

Note: consider this is only a simple demo to show how to optimize gpu code in sass, some restriction in code is listed here:

A and B matrix row and column size must be 64 or 64 multiples,
This is only a version of matrix multiply for C = AxB and not a complete gemm version for C = alphaAB+beta*C,
You must write 128x64, 128x128 or others to have better performance beyond Cublas for large input matrix size, though our code has a better performance than cublas when doing 1024x1024 matrix multiply tested on GTX1080.
Sorry to say our document is not detailed, anybody who interested in this project can contact me.

Finally, thank Scott Gray very much for your wonderful code. And most importantly, great appreciation to my kind friend Alex.Xia(https://github.com/alex-xia-xia) who made so much contribution to our project and gave me a lot of help.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
matrixMul64x64CUBIN		matrixMul64x64CUBIN
matrixMulCUBLAS		matrixMulCUBLAS
matrixMul_cuda		matrixMul_cuda
matrixMul_sass		matrixMul_sass
maxas		maxas
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matrixMul64x64CUBIN

matrixMul64x64CUBIN

matrixMulCUBLAS

matrixMulCUBLAS

matrixMul_cuda

matrixMul_cuda

matrixMul_sass

matrixMul_sass

maxas

maxas

README.md

README.md

Repository files navigation

This is a simple demo about how to optimize the gemm on Nvidia gpu platform with using sass level optimization trick.

About

Releases

Packages

Contributors 2

Languages

Leonardo-Ding/gpu_sgemm

Folders and files

Latest commit

History

Repository files navigation

This is a simple demo about how to optimize the gemm on Nvidia gpu platform with using sass level optimization trick.

About

Topics

Resources

Stars

Watchers

Forks

Languages