Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problems using tensorflow_probability #893

Closed
roblem opened this issue Mar 12, 2020 · 4 comments
Closed

Performance problems using tensorflow_probability #893

roblem opened this issue Mar 12, 2020 · 4 comments

Comments

@roblem
Copy link

roblem commented Mar 12, 2020

System information

  • Have I written custom code (as opposed to using a stock
    example script provided in TensorFlow): Yes.

  • OS Platform and Distribution
    Linux Ubuntu 18.04 using upstream radeon kernel driver and launching ROCM scripts in the latest docker container using TF_ROCM_FUSION_ENABLE=1

  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if
    N/A

  • TensorFlow installed from (source or
    binary): installed from docker as rocm/tensorflow:latest

  • TensorFlow version (use command below): v2.1.0-15-g5466af3 2.1.0

  • Python version: 3.5.2

  • Bazel version (if compiling from source): Build label: 0.29.1

  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

  • CUDA/cuDNN version:

  • GPU model and memory: Radeon VII gfx906

Note: I installed tensorflow_probability in the ROCM docker container using the --no-deps option. After that I had to install an additionally dependency or two.

Describe the current behavior

Use of tensorflow probability methods (e.g. mcmc) are slower using rocm gpu relative to the tf-gpu stack on nvidia hardware. The test code shows that Tensorflow only functions in the ROCM stack run at speeds comparable to the cuda versions, but once tensorflow-probability routines are called using these functions, the ROCM stack is much slower. In the results below, NUTS uses the NUTS step method with adaptation (see this) and Function represents the execution of the pure tensorflow log-likelihood function with gradients and excludes any jit compile time since the function is invoked and then invoked again for timing purposes. The likelihood uses some linalg and reduce operations. The numbers in the table are runtimes in seconds using time.time() differences.

  NUTS Function
ROCM GPU (Radeon VII) 398.01 0.05
TF GPU (Nvidia Tesla P100) 128.28 0.05

Eyeballing watch -n .1 rocm-smi as the script executes shows GPU usage at 100% most of the time for the GPU tests.

Describe the expected behavior

Given that the tensorflow log-likelihood functions are comparable in execution times, I would expect The MCMC sampling times to be much closer. Instead we see that the ROCM stack takes nearly 3x longer to complete.

Standalone code to reproduce the issue
The script generating these results can be found at this gist

Other info / logs

@roblem
Copy link
Author

roblem commented Mar 12, 2020

Adding some more results here that also include CPU. Turning on xla compilation via @tf.function(experimental_compile=True) leads to huge differences:

  NUTS Function
ROCM CPU 832.8751 0.2176
ROCM GPU (Radeon VII) 386.89 0.0427
TF CPU (No XLA) 648.65 0.36
TF GPU (No XLA) (Nvidia Tesla P100) 128.28 0.05
TF CPU (XLA) 629.5549 0.7941
TF GPU (XLA) (Nvidia Tesla P100) 19.6332 0.0219

With xla on (which is recommended), the cuda stack is now approximately 20 times faster (NUTS column ROCM GPU / TF GPU (XLA)).

@roblem
Copy link
Author

roblem commented Mar 13, 2020

Here are few more benchmarks for a more realistic (not toy data) model from my own research. The likelihood function (pure tensorflow) for Model 2, in particular, has a lot of linalg, scatter_nd, gather_nd, and reduce calls. From the function evaluation times, we see that ROCM is slower than the XLA cuda call by 40%, which isn't ideal but understandable given that XLA hasn't been completed in ROCM yet.

Once the tensorflow_probability mcmc calls are included (e.g. Model 2 Nuts) performance is severely degraded to the point that you are better off not using Rocm GPU even over stock tensorflow on CPU.

  Model 1 NUTS (Seconds) Model 2 NUTS (Seconds) Function Model 2 (Milliseconds)
Rocm GPU (Radeon VII) 30.52 1002 14
Cuda CPU (XLA) 8.41 37.05 10
Cuda GPU (XLA) 3.55 8.5 10

@roblem
Copy link
Author

roblem commented Mar 24, 2020

XLA compilation failure might be the root cause of the performance issues outlined above. As I outline in #908, there seems to be a bug in tf.math.bincount that when fixed might enable XLA compilation and improve performance. So closing for now.

@roblem
Copy link
Author

roblem commented May 2, 2020

The benchmark timings should be ignored here. Instead, please see #954.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant