New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why tensorflow gemm tests generates different performance results from pytorch in fp16 mode with the same bert config? #173
Comments
What's the meaning of Gemm Test in PyTorch and TensorFlow? |
From codes, it seems that FasterTransformer will run the gemm test at the first time with a certain model config to generate the gemm algorithm parameters. |
Do you mean running "./bin/xxx_gemm xxx" in some script? Or can you point what you say? |
With the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, if using the same gemm_config.in, TF will reason out a wrong result. |
Yes, the same codes as embedded into the dynamic library. But with the same model config, FasterTransformer can generate different gemm parameters with TF and PyTorch in fp16 mode, and only pytorch can reason out the correct result compared to fp32 mode. However, the key point is that if using the same gemm_config.in, TF will reason out a wrong result. |
Can you point "what line" of code you say? I remember that we don't embed the "gemm_test" into both TF and PyTorch Op. |
I've compiled the code with "-DSM=60". If not doing gemm test, encoder gemm will use the default algoId -1, also resulting in a wrong answer. |
Can you try to compile the code with
and post the result here. You can also try the v5.0_beta and run the gemm_test outside the op. |
The results seems correct. |
Now the crucial problem is the wrong answer rather than the wrong algoId. I could try different algoId until get the right answer. |
Can you post the log of gemm test in TF? From the image, it seems that the program cannot run GEMM successfully on TF. |
` GEMM test 1: [M: 8192, K: 768, N: 3072] attr_output * inter_kernel GEMM test 2: [M: 8192, K: 3072, N: 768] inter_matmul * output_kernel GEMM test 3: [M: 512, K: 64, N: 512] attention batched Gemm1 GEMM test 4: [M: 512, K: 512, N: 64] attention batched Gemm2 GEMM test 5: [M: 8192, K: 768, N: 768] from_tensor * weight_QKV in BatchGemm Encoder Gemm Testing End I've read the source codes for float16 gemm test, and found that test 0-2 all haven't done LtHgemmCustomFind with the different AlgoCount. |
Can you post the log under pytorch? I wonder is the testing on cublaslt successful on pytorch? |
I've seen again the gemm test log, then I found that pytorch had done cublasLt successfully. libtf_fastertransformer.so depends on libcublasLt.so, but libpyt_fastertransformer.so doesn't, so pytorch can test OK. It may be a environment problem. |
Could you please provide the available TF versions? |
How do you setup the environment? Do you use the docker image of nvcr.io/nvidia/tensorflow? |
I'm using a physical machine. I just compiled the tf_fastertransformer linking libtf_fastertransformer.so installed by pip3, TF version is 2.5.0, depending on cuda 11.0. |
I suggest to use the TF docker image first. There are many implicit issues in environment setup. Besides, we don't verify the result on TF2 although the implementation of custom op should be similar. |
If using TF1.x, you must have used cublas of an older version. |
The TF 1 docker images in NGC use the latest cuda and cublas. |
Will the permance decrease due to the docker environment? |
We don't observe such problem. And for experiment and testing, using docker can prevent the issues caused by environment setup. |
Okay, I'll have a try. |
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem. |
Gemm Tests 0-2 with Pytorch have normal time performance, but Gemm Tests 0-2 with tensorflow have incorrect time permance 0.00ms. The tests were done on the same P100 device.
The text was updated successfully, but these errors were encountered: