why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

flygragon · 2021-12-16T13:04:50Z

Gemm Tests 0-2 with Pytorch have normal time performance, but Gemm Tests 0-2 with tensorflow have incorrect time permance 0.00ms. The tests were done on the same P100 device.

byshiue · 2021-12-16T13:22:42Z

What's the meaning of Gemm Test in PyTorch and TensorFlow?

flygragon · 2021-12-16T13:40:31Z

What's the meaning of Gemm Test in PyTorch and TensorFlow?

From codes, it seems that FasterTransformer will run the gemm test at the first time with a certain model config to generate the gemm algorithm parameters.

byshiue · 2021-12-16T13:44:09Z

Do you mean running "./bin/xxx_gemm xxx" in some script? Or can you point what you say?

flygragon · 2021-12-16T13:48:49Z

With the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, if using the same gemm_config.in, TF will reason out a wrong result.

flygragon · 2021-12-16T14:11:10Z

Do you mean running "./bin/xxx_gemm xxx" in some script? Or can you point what you say?

Yes, the same codes as embedded into the dynamic library. But with the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, the key point is that if using the same gemm_config.in, TF will reason out a wrong result.

byshiue · 2021-12-16T14:54:18Z

Can you point "what line" of code you say? I remember that we don't embed the "gemm_test" into both TF and PyTorch Op.
Do you compile the code with "-DSM=60"?

flygragon · 2021-12-17T04:14:28Z

Can you point "what line" of code you say? I remember that we don't embed the "gemm_test" into both TF and PyTorch Op. Do you compile the code with "-DSM=60"?

I've compiled the code with "-DSM=60". If not doing gemm test, encoder gemm will use the default algoId -1, also resulting in a wrong answer.
I'm using the branch v4.0. bert_encoder_transformer.h:412 does the gemm test to find the fastest algoId, then later cublasLtMatmul, cublasGemmEx and cublasGemmBatchedEx will use this algoId to compute matmul.

byshiue · 2021-12-17T06:08:31Z

Can you try to compile the code with -DCMAKE_BUILD_TYPE=DEBUG and run again?
Besides, try to run the gemm test directly like

./bin/encoder_gemm 16 512 12 64 1 0

and post the result here.

You can also try the v5.0_beta and run the gemm_test outside the op.

flygragon · 2021-12-17T07:01:06Z

Can you try to compile the code with -DCMAKE_BUILD_TYPE=DEBUG and run again? Besides, try to run the gemm test directly like
./bin/encoder_gemm 16 512 12 64 1 0
and post the result here.

You can also try the v5.0_beta and run the gemm_test outside the op.

The result of this command is on the image. Then I'll try debug mode using TF a soon later.
What's the difference between v5.0_beta and v4.0?

byshiue · 2021-12-17T07:04:46Z

The results seems correct.
The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

flygragon · 2021-12-17T07:11:37Z

The results seems correct. The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

What on this image is the gemm test using debug mode and TF. It seems the same as release mode, because both use the cublas and cublasLt library without debug mode. I think whether to place gemm test outside the framework makes no difference because gemm should using the fastest algoId, only produced by gemm test.

flygragon · 2021-12-17T07:18:26Z

The results seems correct. The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

Now the crucial problem is the wrong answer rather than the wrong algoId. I could try different algoId until get the right answer.

byshiue · 2021-12-17T07:23:21Z

Can you post the log of gemm test in TF? From the image, it seems that the program cannot run GEMM successfully on TF.

flygragon · 2021-12-17T08:30:25Z

Can you post the log of gemm test in TF? From the image, it seems that the program cannot run GEMM successfully on TF.

`
GEMM test 0: [M: 8192, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
algo_99 costs 0.712ms
algo_100 costs 0.729ms
algo_101 costs 0.699ms
algo_102 costs 0.674ms
algo_103 costs 0.663ms
algo_104 costs 0.671ms
algo_105 costs 0.665ms
algo_106 costs 0.661ms
algo_107 costs 0.654ms
algo_108 costs 0.649ms
algo_109 costs 0.647ms
algo_110 costs 0.654ms
algo_111 costs 0.646ms
algo_112 costs 0.649ms
algo_113 costs 0.646ms
algo_114 costs 0.653ms
algo_115 costs 0.650ms
fast_algo 111 costs 0.646 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End

GEMM test 1: [M: 8192, K: 768, N: 3072] attr_output * inter_kernel
algo_99 costs 2.396ms
algo_100 costs 2.393ms
algo_101 costs 2.393ms
algo_102 costs 2.396ms
algo_103 costs 2.397ms
algo_104 costs 2.392ms
algo_105 costs 2.396ms
algo_106 costs 2.390ms
algo_107 costs 2.394ms
algo_108 costs 2.390ms
algo_109 costs 2.395ms
algo_110 costs 2.394ms
algo_111 costs 2.395ms
algo_112 costs 2.393ms
algo_113 costs 2.393ms
algo_114 costs 2.389ms
algo_115 costs 2.391ms
fast_algo 114 costs 2.389 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End

GEMM test 2: [M: 8192, K: 3072, N: 768] inter_matmul * output_kernel
algo_99 costs 2.597ms
algo_100 costs 2.597ms
algo_101 costs 2.597ms
algo_102 costs 2.597ms
algo_103 costs 2.594ms
algo_104 costs 2.597ms
algo_105 costs 2.597ms
algo_106 costs 2.597ms
algo_107 costs 2.594ms
algo_108 costs 2.597ms
algo_109 costs 2.597ms
algo_110 costs 2.591ms
algo_111 costs 2.594ms
algo_112 costs 2.597ms
algo_113 costs 2.594ms
algo_114 costs 2.597ms
algo_115 costs 2.597ms
fast_algo 110 costs 2.591 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End

GEMM test 3: [M: 512, K: 64, N: 512] attention batched Gemm1
algo_99 costs 0.575ms
algo_100 costs 0.575ms
algo_101 costs 0.575ms
algo_102 costs 0.575ms
algo_103 costs 0.574ms
algo_104 costs 0.575ms
algo_105 costs 0.578ms
algo_106 costs 0.579ms
algo_107 costs 0.575ms
algo_108 costs 0.575ms
algo_109 costs 0.575ms
algo_110 costs 0.575ms
algo_111 costs 0.575ms
algo_112 costs 0.575ms
algo_113 costs 0.575ms
algo_114 costs 0.575ms
algo_115 costs 0.574ms
fast_algo 103 costs 0.574 ms

GEMM test 4: [M: 512, K: 512, N: 64] attention batched Gemm2
algo_99 costs 0.902ms
algo_100 costs 0.902ms
algo_101 costs 0.902ms
algo_102 costs 0.902ms
algo_103 costs 0.902ms
algo_104 costs 0.901ms
algo_105 costs 0.902ms
algo_106 costs 0.902ms
algo_107 costs 0.902ms
algo_108 costs 0.902ms
algo_109 costs 0.903ms
algo_110 costs 0.902ms
algo_111 costs 0.902ms
algo_112 costs 0.901ms
algo_113 costs 0.902ms
algo_114 costs 0.902ms
algo_115 costs 0.902ms
fast_algo 104 costs 0.901 ms

GEMM test 5: [M: 8192, K: 768, N: 768] from_tensor * weight_QKV in BatchGemm
algo_99 costs 1.822ms
algo_100 costs 1.823ms
algo_101 costs 1.822ms
algo_102 costs 1.824ms
algo_103 costs 1.821ms
algo_104 costs 1.822ms
algo_105 costs 1.823ms
algo_106 costs 1.824ms
algo_107 costs 1.825ms
algo_108 costs 1.824ms
algo_109 costs 1.822ms
algo_110 costs 1.824ms
algo_111 costs 1.824ms
algo_112 costs 1.820ms
algo_113 costs 1.823ms
algo_114 costs 1.824ms
algo_115 costs 1.823ms
fast_algo 112 costs 1.820 ms
cublas Gemm Testing End

Encoder Gemm Testing End
`

I've read the source codes for float16 gemm test, and found that test 0-2 all haven't done LtHgemmCustomFind with the different AlgoCount.
Always torch generates the available test exec_time rather than TF. I'm trying to modify the codes for writing to gemm_config.in file after calling LtHgemmCustomFind.

byshiue · 2021-12-17T08:48:57Z

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch?
I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0.
You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

flygragon · 2021-12-17T09:22:51Z

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch? I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0. You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

I've seen again the gemm test log, then I found that pytorch had done cublasLt successfully. libtf_fastertransformer.so depends on libcublasLt.so, but libpyt_fastertransformer.so doesn't, so pytorch can test OK. It may be a environment problem.

flygragon · 2021-12-17T09:32:57Z

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch? I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0. You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

Could you please provide the available TF versions?

byshiue · 2021-12-17T09:39:54Z

How do you setup the environment? Do you use the docker image of nvcr.io/nvidia/tensorflow?

flygragon · 2021-12-17T09:57:21Z

How do you setup the environment? Do you use the docker image of nvcr.io/nvidia/tensorflow?

I'm using a physical machine. I just compiled the tf_fastertransformer linking libtf_fastertransformer.so installed by pip3, TF version is 2.5.0, depending on cuda 11.0.

byshiue · 2021-12-17T10:41:38Z

I suggest to use the TF docker image first. There are many implicit issues in environment setup. Besides, we don't verify the result on TF2 although the implementation of custom op should be similar.

flygragon · 2021-12-17T12:24:42Z

I suggest to use the TF docker image first. There are many implicit issues in environment setup. Besides, we don't verify the result on TF2 although the implementation of custom op should be similar.

If using TF1.x, you must have used cublas of an older version.

byshiue · 2021-12-17T12:49:01Z

The TF 1 docker images in NGC use the latest cuda and cublas.

flygragon · 2021-12-17T12:50:51Z

The TF 1 docker images in NGC use the latest cuda and cublas.

Will the permance decrease due to the docker environment?

byshiue · 2021-12-17T12:53:21Z

We don't observe such problem. And for experiment and testing, using docker can prevent the issues caused by environment setup.

flygragon · 2021-12-17T13:11:40Z

We don't observe such problem. And for experiment and testing, using docker can prevent the issues caused by environment setup.

Okay, I'll have a try.

byshiue · 2022-04-18T23:42:33Z

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.

byshiue closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021

flygragon commented Dec 16, 2021

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021 •

edited

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 •

edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 •

edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 •

edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Apr 18, 2022

why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

Comments

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021

flygragon commented Dec 16, 2021

flygragon commented Dec 16, 2021

byshiue commented Dec 16, 2021 • edited

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 • edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 • edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021 • edited

byshiue commented Dec 17, 2021

flygragon commented Dec 17, 2021

byshiue commented Apr 18, 2022

byshiue commented Dec 16, 2021 •

edited

flygragon commented Dec 17, 2021 •

edited

flygragon commented Dec 17, 2021 •

edited

flygragon commented Dec 17, 2021 •

edited