Performance problems using tensorflow_probability #893

roblem · 2020-03-12T15:56:56Z

System information

Have I written custom code (as opposed to using a stock
example script provided in TensorFlow): Yes.
OS Platform and Distribution
Linux Ubuntu 18.04 using upstream radeon kernel driver and launching ROCM scripts in the latest docker container using TF_ROCM_FUSION_ENABLE=1
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if
N/A
TensorFlow installed from (source or
binary): installed from docker as rocm/tensorflow:latest
TensorFlow version (use command below): v2.1.0-15-g5466af3 2.1.0
Python version: 3.5.2
Bazel version (if compiling from source): Build label: 0.29.1
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CUDA/cuDNN version:
GPU model and memory: Radeon VII gfx906

Note: I installed tensorflow_probability in the ROCM docker container using the --no-deps option. After that I had to install an additionally dependency or two.

Describe the current behavior

Use of tensorflow probability methods (e.g. mcmc) are slower using rocm gpu relative to the tf-gpu stack on nvidia hardware. The test code shows that Tensorflow only functions in the ROCM stack run at speeds comparable to the cuda versions, but once tensorflow-probability routines are called using these functions, the ROCM stack is much slower. In the results below, NUTS uses the NUTS step method with adaptation (see this) and Function represents the execution of the pure tensorflow log-likelihood function with gradients and excludes any jit compile time since the function is invoked and then invoked again for timing purposes. The likelihood uses some linalg and reduce operations. The numbers in the table are runtimes in seconds using time.time() differences.

	NUTS	Function
ROCM GPU (Radeon VII)	398.01	0.05
TF GPU (Nvidia Tesla P100)	128.28	0.05

Eyeballing watch -n .1 rocm-smi as the script executes shows GPU usage at 100% most of the time for the GPU tests.

Describe the expected behavior

Given that the tensorflow log-likelihood functions are comparable in execution times, I would expect The MCMC sampling times to be much closer. Instead we see that the ROCM stack takes nearly 3x longer to complete.

Standalone code to reproduce the issue
The script generating these results can be found at this gist

Other info / logs

The text was updated successfully, but these errors were encountered:

roblem · 2020-03-12T17:35:52Z

Adding some more results here that also include CPU. Turning on xla compilation via @tf.function(experimental_compile=True) leads to huge differences:

	NUTS	Function
ROCM CPU	832.8751	0.2176
ROCM GPU (Radeon VII)	386.89	0.0427
TF CPU (No XLA)	648.65	0.36
TF GPU (No XLA) (Nvidia Tesla P100)	128.28	0.05
TF CPU (XLA)	629.5549	0.7941
TF GPU (XLA) (Nvidia Tesla P100)	19.6332	0.0219

With xla on (which is recommended), the cuda stack is now approximately 20 times faster (NUTS column ROCM GPU / TF GPU (XLA)).

roblem · 2020-03-13T16:43:24Z

Here are few more benchmarks for a more realistic (not toy data) model from my own research. The likelihood function (pure tensorflow) for Model 2, in particular, has a lot of linalg, scatter_nd, gather_nd, and reduce calls. From the function evaluation times, we see that ROCM is slower than the XLA cuda call by 40%, which isn't ideal but understandable given that XLA hasn't been completed in ROCM yet.

Once the tensorflow_probability mcmc calls are included (e.g. Model 2 Nuts) performance is severely degraded to the point that you are better off not using Rocm GPU even over stock tensorflow on CPU.

	Model 1 NUTS (Seconds)	Model 2 NUTS (Seconds)	Function Model 2 (Milliseconds)
Rocm GPU (Radeon VII)	30.52	1002	14
Cuda CPU (XLA)	8.41	37.05	10
Cuda GPU (XLA)	3.55	8.5	10

roblem · 2020-03-24T10:15:23Z

XLA compilation failure might be the root cause of the performance issues outlined above. As I outline in #908, there seems to be a bug in tf.math.bincount that when fixed might enable XLA compilation and improve performance. So closing for now.

roblem · 2020-05-02T10:06:16Z

The benchmark timings should be ignored here. Instead, please see #954.

roblem closed this as completed Mar 24, 2020

roblem mentioned this issue May 2, 2020

Tensorflow probability very slow on GPU (even with XLA) for some models #954

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems using tensorflow_probability #893

Performance problems using tensorflow_probability #893

roblem commented Mar 12, 2020

roblem commented Mar 12, 2020 •

edited

roblem commented Mar 13, 2020

roblem commented Mar 24, 2020 •

edited

roblem commented May 2, 2020

Performance problems using tensorflow_probability #893

Performance problems using tensorflow_probability #893

Comments

roblem commented Mar 12, 2020

roblem commented Mar 12, 2020 • edited

roblem commented Mar 13, 2020

roblem commented Mar 24, 2020 • edited

roblem commented May 2, 2020

roblem commented Mar 12, 2020 •

edited

roblem commented Mar 24, 2020 •

edited