New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow probability very slow on GPU (even with XLA) for some models #954
Comments
@roblem Appreciate the efforts in coming up with benchmarks and share us with detailed results. This is to confirm that I can re-produce the I can try to root cause it, but without the source code I can't guarantee to pin-point to fix the exact issues in your model 2. It is much easier for me to start the triage once you have isolated the issue to a narrower scope. In order to understand what has caused this warning message to show up, we would typically do an exclude on each op being used. For example, on your shared code. I can see the following operator being used:
I wonder if you could do try to swap/remove any of the ops above, on a case by case manner just to look at how the runtime changes (Do worrying about the correctness temporarily). You will likely to be able to observe the biggest runtime drop after removing one ops of the above list. Let me know if this works/ you can isolate it to the specific op. Looking at the list above, I think it is less likely to be related with |
The warning To start triaging, here is a list of functions for each benchmark result submitted above, I have relabelled each row in the table to make this easier to compare with times above. Each row in the timings above are now in columns. Note that
Some thoughts on where to start triaging:
|
I have added two additional rows to the benchmarks table (15 and 16) and shown the functions used in the previous comment. Model 2 with Random Walk Metropolis runs very fast on the Radeon VII. This is consistent with the fast run times of the tensorflow function execution time (12). Judging by this, the problem seems to be in either |
Focusing on Model 2 here is a summary of times with different step methods (these are all based on timings using GPU only (I include ID for matching to things above where there is a match). All sampling and adaptive step methods are tensorflow probability ops (e.g.
Conclusions:
It is important to note that ALL of these timings use the same tensorflow ops and that function calculations using these ops are faster under ROCM than CUDA- as was demonstrated by comparing |
Thanks for the effort. We discussed offline and confirmed that non-XLA mode also suffers in Model2. I rerun the
|
Hi, I just reran |
Thanks for your comments. Yes, I also tend to think the performance issue is orthogonal to the warning message. More triage needs to be done, and I will keep the issue open in the mean time. |
This is an update to issue #893 which was closed because XLA compile was failing making apples to apples comparisons impossible. XLA compile was fixed with #908 and I include updated benchmarks here. Therefore this issue supersedes #893 and all benchmarks there should be ignored.
System information
Describe the current behavior
I have run some benchmarks under both ROCM (on Radeon VII) and Cuda (on P100 and V100) stacks. For the most part, ROCM is very much on par with Cuda but in one example is failing fairly spectacularly in terms of runtimes (but does eventually return a reasonable result).
Describe the expected behavior
I would expect similar runtimes across all models considered here.
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
rocm.log
The biggest difference I see in comparing the CUDA and ROCM logs is the repeated warning:
I don't see this warning on the CUDA stack. I see this warning in all cases including where ROCM compares favorably to CUDA in terms of timing. Also, the code runs without modification on both CUDA and ROCM.
Timing Benchmarks
Linear Regression
I have run two benchmarks using tensorflow probability's monte carlo markov chain library. The first uses simple linear regression with large numbers of parameters and I time a few hundred samples using both random walk metropolis hastings (MHRW) and the no-turn sampler (NUTS) for generating proposals. Nuts is a gradient-based sampler whereas MHRW is not. I also include timings for the likelihood function calculation. Columns are the same code run over different software stacks/hardware. Here are timings:
These timings are single runs using differences in
time.time()
. We see that the Radeon VII holds its own here against CUDA and in some cases is much faster. The one most relevant for most researchers would be the final row (NUTS Samples (GPU)). This shows that ROCM on Radeon VII falls somewhere between a P100 and V100.The code for these is here for MHRW and here for NUTS.
Two other models
The next set of benchmarks are for two models having many more parameters. Both are implemented using custom likelihood functions. Both models are similar to softmax with the difference being Model 2 has many many more parameters. We see that ROCM on GPU performs very well for Model 1 (faster than either CUDA stack), but is spectacularly slower in Model 2. In Model 2, the CUDA stack shows an approximate 6x (3.5x) speedup on GPU relative to CPU on the V100(P100), but the ROCM stack is substantially slower on GPU to the point that you are better off running on CPU. For each model, I run a function calculation over the Log-Likelihood which is based solely on tensorflow ops. These functions are used when sampling.
I am hesitant to publicly share this code for this example since it is from active ongoing research that isn't published yet. I would be willing to share privately or could follow tips to debug why CUDA is 150x faster than ROCM for Model 2 on GPU. For some of these, I've been unable to run on both the P100 and V100.
The text was updated successfully, but these errors were encountered: