HIP program state re-initialization logic#457
Conversation
|
/cc @adityaaatluri @sabreshao for awareness The patch to HIP for program state re-initialization has been updated so it could possibly be merged with the tip of HIP. Shall this PR be merged projects such as RCCL & PaddlePaddle should run properly. |
|
@whchung Is this PR fine to merge now? Or do we still need to verify some end applications? |
|
TensorFlow is fine but PaddlePaddle is not. Dynamic shared library loading behavior of PaddlePaddle is more complicated and we’ll hit ROCR runtime limitations. The PR needs to be improved to only construct HSA executables for newly identified code objects. |
This commit is to support kernels dynamically loaded thru means such as dlopen() after HIP runtime initializes.
|
Working on reducing data structures need to be reinitialized when we hit a missing kernel function address. Pushed one change to minimize number of HSA executables to be created. Saw perf improvement on ROCm 1.7.2 but the gain was reduced on ROCm 1.8.1. Still has room for improvement. |
Keep track of shared libaries already discovered. Do not build HSA executables for them.
|
@mangupta |
|
@mangupta Earlier today PaddlePaddle team reported an error and I've fixed it in commit# 32789a8 . Now both PaddlePaddle and TensorFlow could properly load dynamic libs without bumping into ROCR runtime limits. Please help review this PR once again and consider merge it. Thanks. Also it would be preferable if this PR could be made into |
|
With below 3 commits, PaddlePaddle can dynamic load rccl library and run rccl kernels normally.
|
|
@mangupta thanks! @parallelo now HIP PR 457 has been merged I’ll submit a PR to change Dockerfile for tensorflow build |
This PR tries to re-initialize HIP runtime data structures in program_state.cpp. In applications such as TensorFlow it was evident that HIP kernels within shared libraries may not be identified if they are loaded later after initialization by dlopen(). Instead of raising an exception immediately we re-initialize data structures within program_state.cpp with a rebuild flag.