Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173

Closed
flygragon opened this issue Dec 16, 2021 · 26 comments

Comments

@flygragon
Copy link

image
image
Gemm Tests 0-2 with Pytorch have normal time performance, but Gemm Tests 0-2 with tensorflow have incorrect time permance 0.00ms. The tests were done on the same P100 device.

@byshiue
Copy link
Collaborator

byshiue commented Dec 16, 2021

What's the meaning of Gemm Test in PyTorch and TensorFlow?

@flygragon
Copy link
Author

What's the meaning of Gemm Test in PyTorch and TensorFlow?

From codes, it seems that FasterTransformer will run the gemm test at the first time with a certain model config to generate the gemm algorithm parameters.

@byshiue
Copy link
Collaborator

byshiue commented Dec 16, 2021

Do you mean running "./bin/xxx_gemm xxx" in some script? Or can you point what you say?

@flygragon
Copy link
Author

With the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, if using the same gemm_config.in, TF will reason out a wrong result.

@flygragon
Copy link
Author

Do you mean running "./bin/xxx_gemm xxx" in some script? Or can you point what you say?

Yes, the same codes as embedded into the dynamic library. But with the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, the key point is that if using the same gemm_config.in, TF will reason out a wrong result.

@byshiue
Copy link
Collaborator

byshiue commented Dec 16, 2021

Can you point "what line" of code you say? I remember that we don't embed the "gemm_test" into both TF and PyTorch Op.
Do you compile the code with "-DSM=60"?

@flygragon
Copy link
Author

Can you point "what line" of code you say? I remember that we don't embed the "gemm_test" into both TF and PyTorch Op. Do you compile the code with "-DSM=60"?

I've compiled the code with "-DSM=60". If not doing gemm test, encoder gemm will use the default algoId -1, also resulting in a wrong answer.
I'm using the branch v4.0. bert_encoder_transformer.h:412 does the gemm test to find the fastest algoId, then later cublasLtMatmul, cublasGemmEx and cublasGemmBatchedEx will use this algoId to compute matmul.

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

Can you try to compile the code with -DCMAKE_BUILD_TYPE=DEBUG and run again?
Besides, try to run the gemm test directly like

./bin/encoder_gemm 16 512 12 64 1 0

and post the result here.

You can also try the v5.0_beta and run the gemm_test outside the op.

@flygragon
Copy link
Author

Can you try to compile the code with -DCMAKE_BUILD_TYPE=DEBUG and run again? Besides, try to run the gemm test directly like

./bin/encoder_gemm 16 512 12 64 1 0

and post the result here.

You can also try the v5.0_beta and run the gemm_test outside the op.

image
The result of this command is on the image. Then I'll try debug mode using TF a soon later.
What's the difference between v5.0_beta and v4.0?

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

The results seems correct.
The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

@flygragon
Copy link
Author

The results seems correct. The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

image
What on this image is the gemm test using debug mode and TF. It seems the same as release mode, because both use the cublas and cublasLt library without debug mode. I think whether to place gemm test outside the framework makes no difference because gemm should using the fastest algoId, only produced by gemm test.

@flygragon
Copy link
Author

The results seems correct. The functionality of v5.0_beta dn v4.0 are similar, but we remove the auto gemm test from the bert because no requirement for this feature. I also suggest to run the gemm test outside the framework.

Now the crucial problem is the wrong answer rather than the wrong algoId. I could try different algoId until get the right answer.

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

Can you post the log of gemm test in TF? From the image, it seems that the program cannot run GEMM successfully on TF.

@flygragon
Copy link
Author

flygragon commented Dec 17, 2021

Can you post the log of gemm test in TF? From the image, it seems that the program cannot run GEMM successfully on TF.

`
GEMM test 0: [M: 8192, K: 768, N: 768] from_tensor * weightQ/K/V, attr * output_kernel
algo_99 costs 0.712ms
algo_100 costs 0.729ms
algo_101 costs 0.699ms
algo_102 costs 0.674ms
algo_103 costs 0.663ms
algo_104 costs 0.671ms
algo_105 costs 0.665ms
algo_106 costs 0.661ms
algo_107 costs 0.654ms
algo_108 costs 0.649ms
algo_109 costs 0.647ms
algo_110 costs 0.654ms
algo_111 costs 0.646ms
algo_112 costs 0.649ms
algo_113 costs 0.646ms
algo_114 costs 0.653ms
algo_115 costs 0.650ms
fast_algo 111 costs 0.646 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End


GEMM test 1: [M: 8192, K: 768, N: 3072] attr_output * inter_kernel
algo_99 costs 2.396ms
algo_100 costs 2.393ms
algo_101 costs 2.393ms
algo_102 costs 2.396ms
algo_103 costs 2.397ms
algo_104 costs 2.392ms
algo_105 costs 2.396ms
algo_106 costs 2.390ms
algo_107 costs 2.394ms
algo_108 costs 2.390ms
algo_109 costs 2.395ms
algo_110 costs 2.394ms
algo_111 costs 2.395ms
algo_112 costs 2.393ms
algo_113 costs 2.393ms
algo_114 costs 2.389ms
algo_115 costs 2.391ms
fast_algo 114 costs 2.389 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End


GEMM test 2: [M: 8192, K: 3072, N: 768] inter_matmul * output_kernel
algo_99 costs 2.597ms
algo_100 costs 2.597ms
algo_101 costs 2.597ms
algo_102 costs 2.597ms
algo_103 costs 2.594ms
algo_104 costs 2.597ms
algo_105 costs 2.597ms
algo_106 costs 2.597ms
algo_107 costs 2.594ms
algo_108 costs 2.597ms
algo_109 costs 2.597ms
algo_110 costs 2.591ms
algo_111 costs 2.594ms
algo_112 costs 2.597ms
algo_113 costs 2.594ms
algo_114 costs 2.597ms
algo_115 costs 2.597ms
fast_algo 110 costs 2.591 ms
cublasLt Gemm Testing Beign
AlgoCount: 0
algo={ Id=0, tileIdx=0 (UNDEF) splitK=0 reduc=0 swizzle=0 custom=0 stages=0} status 0 time 0.000000ms workspace=0 mathMode=0 waves=0.000000
cublasLt Gemm Testing End


GEMM test 3: [M: 512, K: 64, N: 512] attention batched Gemm1
algo_99 costs 0.575ms
algo_100 costs 0.575ms
algo_101 costs 0.575ms
algo_102 costs 0.575ms
algo_103 costs 0.574ms
algo_104 costs 0.575ms
algo_105 costs 0.578ms
algo_106 costs 0.579ms
algo_107 costs 0.575ms
algo_108 costs 0.575ms
algo_109 costs 0.575ms
algo_110 costs 0.575ms
algo_111 costs 0.575ms
algo_112 costs 0.575ms
algo_113 costs 0.575ms
algo_114 costs 0.575ms
algo_115 costs 0.574ms
fast_algo 103 costs 0.574 ms


GEMM test 4: [M: 512, K: 512, N: 64] attention batched Gemm2
algo_99 costs 0.902ms
algo_100 costs 0.902ms
algo_101 costs 0.902ms
algo_102 costs 0.902ms
algo_103 costs 0.902ms
algo_104 costs 0.901ms
algo_105 costs 0.902ms
algo_106 costs 0.902ms
algo_107 costs 0.902ms
algo_108 costs 0.902ms
algo_109 costs 0.903ms
algo_110 costs 0.902ms
algo_111 costs 0.902ms
algo_112 costs 0.901ms
algo_113 costs 0.902ms
algo_114 costs 0.902ms
algo_115 costs 0.902ms
fast_algo 104 costs 0.901 ms


GEMM test 5: [M: 8192, K: 768, N: 768] from_tensor * weight_QKV in BatchGemm
algo_99 costs 1.822ms
algo_100 costs 1.823ms
algo_101 costs 1.822ms
algo_102 costs 1.824ms
algo_103 costs 1.821ms
algo_104 costs 1.822ms
algo_105 costs 1.823ms
algo_106 costs 1.824ms
algo_107 costs 1.825ms
algo_108 costs 1.824ms
algo_109 costs 1.822ms
algo_110 costs 1.824ms
algo_111 costs 1.824ms
algo_112 costs 1.820ms
algo_113 costs 1.823ms
algo_114 costs 1.824ms
algo_115 costs 1.823ms
fast_algo 112 costs 1.820 ms
cublas Gemm Testing End

Encoder Gemm Testing End
`

I've read the source codes for float16 gemm test, and found that test 0-2 all haven't done LtHgemmCustomFind with the different AlgoCount.
Always torch generates the available test exec_time rather than TF. I'm trying to modify the codes for writing to gemm_config.in file after calling LtHgemmCustomFind.

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch?
I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0.
You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

@flygragon
Copy link
Author

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch? I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0. You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

I've seen again the gemm test log, then I found that pytorch had done cublasLt successfully. libtf_fastertransformer.so depends on libcublasLt.so, but libpyt_fastertransformer.so doesn't, so pytorch can test OK. It may be a environment problem.

@flygragon
Copy link
Author

Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch? I guess the problem is because your environment does not support cublaslt and hence the execution time is always 0. You can try to disable this branch https://github.com/NVIDIA/FasterTransformer/blob/v4.0/fastertransformer/gemm_test/encoder_gemm_func.cc#L747 to try to solve this issue.

Could you please provide the available TF versions?

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

How do you setup the environment? Do you use the docker image of nvcr.io/nvidia/tensorflow?

@flygragon
Copy link
Author

flygragon commented Dec 17, 2021

How do you setup the environment? Do you use the docker image of nvcr.io/nvidia/tensorflow?

I'm using a physical machine. I just compiled the tf_fastertransformer linking libtf_fastertransformer.so installed by pip3, TF version is 2.5.0, depending on cuda 11.0.

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

I suggest to use the TF docker image first. There are many implicit issues in environment setup. Besides, we don't verify the result on TF2 although the implementation of custom op should be similar.

@flygragon
Copy link
Author

I suggest to use the TF docker image first. There are many implicit issues in environment setup. Besides, we don't verify the result on TF2 although the implementation of custom op should be similar.

If using TF1.x, you must have used cublas of an older version.

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

The TF 1 docker images in NGC use the latest cuda and cublas.

@flygragon
Copy link
Author

flygragon commented Dec 17, 2021

The TF 1 docker images in NGC use the latest cuda and cublas.

Will the permance decrease due to the docker environment?

@byshiue
Copy link
Collaborator

byshiue commented Dec 17, 2021

We don't observe such problem. And for experiment and testing, using docker can prevent the issues caused by environment setup.

@flygragon
Copy link
Author

We don't observe such problem. And for experiment and testing, using docker can prevent the issues caused by environment setup.

Okay, I'll have a try.

@byshiue
Copy link
Collaborator

byshiue commented Apr 18, 2022

Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.

@byshiue byshiue closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants