New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errors in pin-in-place path in HCC unpinned copy engine #27
Comments
/cc @scchan Please help collect more detailed log wrt copy with HCC_DB |
Logging a new error, same inputs to benchmark.
Gathering more detail next with HCC_DB. |
Ran with HCC_DB=0x48A. Output attached. |
48A is good for diagnosing faulty kernels not very good for copies. Please try check HCC header file and use other flags so more details of the copies are shown. |
I ran 0x0302 before your suggestion of 0x68A. Both attached. The 0x0302 run took longer for the error to appear. 0x68A showed the error relatively early. |
Need more log especially from Python layer to understand which command we are in. Also, to try make the behavior more deterministic, could you also try run the benchmark appending more env vars |
verbose1_serialize.txt.gz |
Setting
did not avoid the exception. |
I suppose it is worth noting that our TF1.3 sources ran fine. I added func FILE and LINE output to the THROW_ERROR macro in unpinned_copy_engine.cpp, build hcc from source, and found the exception is thrown at line 193. hsaErr = 4096. Just a generic 0x1000 HSA_STATUS_ERROR. |
I dug into hsa_amd_memory_lock to see what causes it to return HSA_STATUS_ERROR. It returns it in two different cases, so I changed the return code in the first case to differentiate it from the latter, then installed ROCR-Runtime from source and reran the benchmark code. It still returns HSA_STATUS_ERROR, which tells me https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/bfc4b9e98cb5e48d9f96287371d518cb444968ba/src/core/runtime/amd_memory_region.cpp#L611 it's failing in RegisterMemory in the call to hsaKmtRegisterMemoryToNodes(). |
Let's step back a bit. By "our TF1.3 sources ran fine" does it mean on the same lower-level bits (KFD / ROCR / HCC / HIP ) TF 1.3 + older TF benchmarks works while TF 1.8 + tip of TF benchmarks not? |
Yes, on the same lower-level bits TF 1.3 + older TF benchmarks works while TF 1.8 + top of TF benchmarks does not. |
Got a new error, but this time trying 1 ps and 1 worker. The worker failed during warmup with
|
According to @jeffdaily the issue only happens on a particular configuration of ROCm stack and not observable in others. Awaiting more info. |
Upgraded bazel from 0.10.1 to 0.12.0. No change. Switched to branch v1.8-rocm. Can no longer reproduce error. I've only tested these two branches (develop-upstream and v1.8-rocm), so I can't make any generalization. Just something to keep in mind if this issue shows up again perhaps for a v1.9-rocm release or in the develop-upstream branch again. Also should note that we can't use the benchmarks repo beyond tensorflow/benchmarks@3b90c14. There are two okay commits on May 31st, but the last commit on May 31st tensorflow/benchmarks@fef2030 introduces native collective reduce mode (tensorflow.python.ops.collective_ops), introduced after v1.8. |
Reopening since using grpc path causes the same issue again. Since TF 1.3, the grpc code path has been optimized so that it does not serialize tensors into a protobuf. Instead, it allocates a new Tensor on the host, copies from device to host, and then uses the host Tensor's byte buffer directly. Using branch r1.8-rocm, at or after commit 32b762d. That commit enables ROCM for the GPU device to host copy, otherwise you just get "No GPU device in process" at runtime when using grpc. I am running using two hosts, one is designated the 'worker' and the other as 'ps' (parameter server).
The 'worker' python process terminates with the HCC unpinned copy engine error during warm up. |
I reduced |
@jeffdaily Per discussion I'll need your help to walk me through steps to reproduce the issue so we can dive deeper into this. I assume the comments above are for me to study, correct? |
@whchung yes I've been trying to capture as much detail as I can for you and others to be able to reproduce this issue. |
@jeffdaily we'll likely have to study the internals of |
I agree I will need to study the distributed runtime better.
I set kGrpcWorkerServiceThreadCount to 1 and rebuilt TF. No change in behavior. Still getting the copy engine error during warmup.
From: Wen-Heng (Jack) Chung [mailto:notifications@github.com]
Sent: Monday, June 18, 2018 8:34 AM
To: ROCmSoftwarePlatform/tensorflow-upstream <tensorflow-upstream@noreply.github.com>
Cc: Daily, Jeff <Jeff.Daily@amd.com>; Mention <mention@noreply.github.com>
Subject: Re: [ROCmSoftwarePlatform/tensorflow-upstream] Kalmar::runtime_exception: HCC unpinned copy engine error (#27)
@jeffdaily<https://github.com/jeffdaily> we'll likely have to study the internals of grpc_worker_service.cc better. As an initial experiment please see what happens if we reduce kGrpcWorkerServiceThreadCount to 1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#27 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AA3MOBfgF3igYZceUH9Koha6N-4nLKJzks5t98hwgaJpZM4UnD53>.
|
@jeffdaily based on the comments / logs you gathered so far it seems we are failing at:
in This routine tries to pin host memory pages so they are visible to GPU. I'm guessing maybe we are pinning too many host memory thus making ROCR runtime to fail. At TensorFlow level we'll need to understand how many KB/MB of memory has been pinned with such method. At HCC runtime leve, we can probably use |
It seems all transfer is done via |
When preparing the protobuf response for requesting Tensors in @whchung I would like you to take a look at |
I just set HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) and was able to successfully run. |
@jeffdaily sounds good. let me rejoin you after i finish my initial sweep of tensorflow unit tests. |
/cc @fxkamd for awareness |
Per discussion thus far it seems in the implementation of @jeffdaily For now please try run the benchmarks with |
@gstoner need people with KFD knowledge here. We are seeing trivial host memory pinning giving errors |
What I know:
We suspected we might be leaking pinned memory, however setting the threads to 1 should not change the number of times we call the hsa_amd_memory_lock and we don't see the problem single-threaded. The problem is showing up when we're doing lots of H2D copies. Either the tf_cnn_benchmarks.py with |
In your log it looked like you were getting into trouble because different threads were registering the same pointers to different nodes. The registration to specific nodes is just a vestige in the Thunk API. It's mostly replaced by mapping to specific nodes now. I think I should be able to add a workaround in the Thunk by ignoring the node list in hsaKmtRegisterMemoryToNodes. This may break some KFDTest unit tests though. A better fix would be in the Runtime to use hsaKmtRegisterMemory instead of hsaKmtRegisterMemoryToNodes. That would have the same effect without an ugly workaround in the Thunk. |
Update on the race condition. It's in the HCC runtime. The reason I thought it was the function Turns out we weren't that far off. I had assumed there was only 1 copy engine. Turns out there are 2 copy engine instances per GPU aka |
@jeffdaily Could you weigh in on why file-scope mutex doesn't work? |
A file-scope mutex might work. My first file-scope mutex only protected the |
@jeffdaily Just want to have something working as the baseline. And let folks more dedicated to HCC runtime improve it, if needed. Notice in HIP runtime layer, there is a team working on switching the backend to something called VDI, which natively supports both Linux and Windows. So we may or may not see this issue again when we make the switch from HIP-HCC to HIP-VDI. |
I replaced the mutex in both the H2DPinInPlace and D2HPinInPlace with a single file-scope mutex. Problem solved. However, my performance concerns might be true. The resnet benchmark on 4 GPUs is faster if we set HCC_UNPINNED_COPY_MODE=2 to use staged copies only. If we allow the pinned copies, it's slightly slower. |
Using the discussed fix of an 549.19 images/s pinned with set-check |
The conclusion of the ticket is that there are race conditions in HCC. A preliminary PR has been filed by @jeffdaily to HCC runtime and @scchan will continue drive it to the conclusion. Meanwhile from TensorFlow perspective, setting HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) would be the workaround for now. |
Using latest develop-upstream branch and latest benchmarks master.
Running the tf_cnn_benchmarks.py code like so:
Eventually produces during warmup the following message
If you set
--local_parameter_device=gpu
instead, the problem doesn't manifest.However, the problem happens again even with
--local_parameter_device=gpu
during distributed training. Running 1 worker and 1 server like so:At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.
misc system info
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
lscpu
AMD EPYC 7551 32-Core Processor
uname -a
Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib
DYLD_LIBRARY_PATH is unset
rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic]
rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed]
rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed]
rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic]
rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocminfo/now 1.0.7 amd64 [installed,local]
The text was updated successfully, but these errors were encountered: