Skip to content

HIP program state re-initialization logic#457

Merged
mangupta merged 3 commits intoROCm:masterfrom
whchung:hip-reinit
Jun 20, 2018
Merged

HIP program state re-initialization logic#457
mangupta merged 3 commits intoROCm:masterfrom
whchung:hip-reinit

Conversation

@whchung
Copy link
Copy Markdown
Contributor

@whchung whchung commented May 18, 2018

This PR tries to re-initialize HIP runtime data structures in program_state.cpp. In applications such as TensorFlow it was evident that HIP kernels within shared libraries may not be identified if they are loaded later after initialization by dlopen(). Instead of raising an exception immediately we re-initialize data structures within program_state.cpp with a rebuild flag.

@whchung whchung requested review from AlexVlx and mangupta May 18, 2018 15:16
@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented May 18, 2018

/cc @adityaaatluri @sabreshao for awareness

The patch to HIP for program state re-initialization has been updated so it could possibly be merged with the tip of HIP. Shall this PR be merged projects such as RCCL & PaddlePaddle should run properly.

@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented May 21, 2018

Build issue in CI #5 should have nothing to do with the PR as the PR has had successfully CI build #4 before.

Meanwhile please hold this PR as we need to make sure end applications do function properly with this PR.

@mangupta
Copy link
Copy Markdown
Contributor

mangupta commented Jun 1, 2018

@whchung Is this PR fine to merge now? Or do we still need to verify some end applications?

@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented Jun 1, 2018

TensorFlow is fine but PaddlePaddle is not. Dynamic shared library loading behavior of PaddlePaddle is more complicated and we’ll hit ROCR runtime limitations. The PR needs to be improved to only construct HSA executables for newly identified code objects.

This commit is to support kernels dynamically loaded thru means such as
dlopen() after HIP runtime initializes.
@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented Jun 15, 2018

Working on reducing data structures need to be reinitialized when we hit a missing kernel function address. Pushed one change to minimize number of HSA executables to be created. Saw perf improvement on ROCm 1.7.2 but the gain was reduced on ROCm 1.8.1. Still has room for improvement.

Keep track of shared libaries already discovered. Do not build HSA executables
for them.
@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented Jun 18, 2018

@mangupta
TensorFlow nightly builds have been adopting the updated PR for 3 days and everything looks fine. Now we just need confirmation from PaddlePaddle team so we can merge this in.

@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented Jun 19, 2018

@mangupta Earlier today PaddlePaddle team reported an error and I've fixed it in commit# 32789a8 . Now both PaddlePaddle and TensorFlow could properly load dynamic libs without bumping into ROCR runtime limits. Please help review this PR once again and consider merge it. Thanks.

Also it would be preferable if this PR could be made into roc-1.8.x release branch for upcoming 1.8.2 release.

@jichangjichang
Copy link
Copy Markdown

With below 3 commits, PaddlePaddle can dynamic load rccl library and run rccl kernels normally.

  • HIP program state re-initialization logic
  • Improve performance of re-initialization logic
  • Keep the map which tracks GPU kernel symbols to grow monotonically

@mangupta mangupta merged commit 8366272 into ROCm:master Jun 20, 2018
@whchung
Copy link
Copy Markdown
Contributor Author

whchung commented Jun 20, 2018

@mangupta thanks!

@parallelo now HIP PR 457 has been merged I’ll submit a PR to change Dockerfile for tensorflow build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants