HIP program state re-initialization logic by whchung · Pull Request #457 · ROCm/hip

whchung · 2018-05-18T15:16:18Z

This PR tries to re-initialize HIP runtime data structures in program_state.cpp. In applications such as TensorFlow it was evident that HIP kernels within shared libraries may not be identified if they are loaded later after initialization by dlopen(). Instead of raising an exception immediately we re-initialize data structures within program_state.cpp with a rebuild flag.

whchung · 2018-05-18T15:19:02Z

/cc @adityaaatluri @sabreshao for awareness

The patch to HIP for program state re-initialization has been updated so it could possibly be merged with the tip of HIP. Shall this PR be merged projects such as RCCL & PaddlePaddle should run properly.

whchung · 2018-05-21T15:39:17Z

Build issue in CI #5 should have nothing to do with the PR as the PR has had successfully CI build #4 before.

Meanwhile please hold this PR as we need to make sure end applications do function properly with this PR.

mangupta · 2018-06-01T08:29:15Z

@whchung Is this PR fine to merge now? Or do we still need to verify some end applications?

whchung · 2018-06-01T13:09:49Z

TensorFlow is fine but PaddlePaddle is not. Dynamic shared library loading behavior of PaddlePaddle is more complicated and we’ll hit ROCR runtime limitations. The PR needs to be improved to only construct HSA executables for newly identified code objects.

This commit is to support kernels dynamically loaded thru means such as dlopen() after HIP runtime initializes.

whchung · 2018-06-15T14:04:22Z

Working on reducing data structures need to be reinitialized when we hit a missing kernel function address. Pushed one change to minimize number of HSA executables to be created. Saw perf improvement on ROCm 1.7.2 but the gain was reduced on ROCm 1.8.1. Still has room for improvement.

Keep track of shared libaries already discovered. Do not build HSA executables for them.

whchung · 2018-06-18T13:58:49Z

@mangupta
TensorFlow nightly builds have been adopting the updated PR for 3 days and everything looks fine. Now we just need confirmation from PaddlePaddle team so we can merge this in.

whchung · 2018-06-19T02:39:54Z

@mangupta Earlier today PaddlePaddle team reported an error and I've fixed it in commit# 32789a8 . Now both PaddlePaddle and TensorFlow could properly load dynamic libs without bumping into ROCR runtime limits. Please help review this PR once again and consider merge it. Thanks.

Also it would be preferable if this PR could be made into roc-1.8.x release branch for upcoming 1.8.2 release.

jichangjichang · 2018-06-19T03:39:34Z

With below 3 commits, PaddlePaddle can dynamic load rccl library and run rccl kernels normally.

HIP program state re-initialization logic
Improve performance of re-initialization logic
Keep the map which tracks GPU kernel symbols to grow monotonically

whchung · 2018-06-20T04:32:48Z

@mangupta thanks!

@parallelo now HIP PR 457 has been merged I’ll submit a PR to change Dockerfile for tensorflow build

whchung requested review from AlexVlx and mangupta May 18, 2018 15:16

HIP program state re-initialization logic

379b7a2

This commit is to support kernels dynamically loaded thru means such as dlopen() after HIP runtime initializes.

whchung force-pushed the hip-reinit branch from 03c2d79 to 379b7a2 Compare June 14, 2018 15:48

Improve performance of re-initialization logic

ece4539

Keep track of shared libaries already discovered. Do not build HSA executables for them.

whchung force-pushed the hip-reinit branch from 17750ce to ece4539 Compare June 15, 2018 23:07

Keep the map which tracks GPU kernel symbols to grow monotonically

32789a8

mangupta approved these changes Jun 20, 2018

View reviewed changes

mangupta merged commit 8366272 into ROCm:master Jun 20, 2018

whchung mentioned this pull request Jul 10, 2018

[ROCm] Bazel build and continuous integration infrastructure tensorflow/tensorflow#20277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP program state re-initialization logic#457

HIP program state re-initialization logic#457
mangupta merged 3 commits intoROCm:masterfrom
whchung:hip-reinit

whchung commented May 18, 2018

Uh oh!

whchung commented May 18, 2018

Uh oh!

whchung commented May 21, 2018

Uh oh!

mangupta commented Jun 1, 2018

Uh oh!

whchung commented Jun 1, 2018

Uh oh!

whchung commented Jun 15, 2018 •

edited

Loading

Uh oh!

whchung commented Jun 18, 2018

Uh oh!

whchung commented Jun 19, 2018

Uh oh!

jichangjichang commented Jun 19, 2018

Uh oh!

whchung commented Jun 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

whchung commented May 18, 2018

Uh oh!

whchung commented May 18, 2018

Uh oh!

whchung commented May 21, 2018

Uh oh!

mangupta commented Jun 1, 2018

Uh oh!

whchung commented Jun 1, 2018

Uh oh!

whchung commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whchung commented Jun 18, 2018

Uh oh!

whchung commented Jun 19, 2018

Uh oh!

jichangjichang commented Jun 19, 2018

Uh oh!

whchung commented Jun 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

whchung commented Jun 15, 2018 •

edited

Loading