errors in pin-in-place path in HCC unpinned copy engine #27

jeffdaily · 2018-06-13T22:54:55Z

Using latest develop-upstream branch and latest benchmarks master.
Running the tf_cnn_benchmarks.py code like so:

python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Eventually produces during warmup the following message

terminate called after throwing an instance of 'Kalmar::runtime_exception'
  what():  HCC unpinned copy engine error
Aborted (core dumped)

If you set --local_parameter_device=gpu instead, the problem doesn't manifest.

However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:

# worker
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc
# ps
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc

At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.

misc system info

c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

lscpu

AMD EPYC 7551 32-Core Processor

uname -a
Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib
DYLD_LIBRARY_PATH is unset

rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic]
rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed]
rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed]
rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic]
rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocminfo/now 1.0.7 amd64 [installed,local]

The text was updated successfully, but these errors were encountered:

whchung · 2018-06-14T00:07:25Z

/cc @scchan

Please help collect more detailed log wrt copy with HCC_DB

jeffdaily · 2018-06-14T14:22:50Z

Logging a new error, same inputs to benchmark.

Memory access fault by GPU node-6 (Agent handle: 0x17e9ab0) on address 0x531981e000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Gathering more detail next with HCC_DB.

jeffdaily · 2018-06-14T14:33:00Z

Ran with HCC_DB=0x48A. Output attached.
debug.txt

whchung · 2018-06-14T14:36:20Z

48A is good for diagnosing faulty kernels not very good for copies. Please try check HCC header file and use other flags so more details of the copies are shown.

whchung · 2018-06-14T14:38:05Z

https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/include/hc_rt_debug.h

Maybe 0x68A?

jeffdaily · 2018-06-14T14:48:09Z

I ran 0x0302 before your suggestion of 0x68A. Both attached. The 0x0302 run took longer for the error to appear. 0x68A showed the error relatively early.
debug_0x68A.txt
debug_0x0302.txt

whchung · 2018-06-14T15:15:05Z

Need more log especially from Python layer to understand which command we are in.

Also, to try make the behavior more deterministic, could you also try run the benchmark appending more env vars HCC_SERIALIZE_KERNEL=3 HCC_SERIALIZE_COPY=3?

jeffdaily · 2018-06-14T16:44:14Z

HCC_DB=0x68A
HCC_SERIALIZE_KERNEL=3
HCC_SERIALIZE_COPY=3
TF_CPP_MIN_VLOG_LEVEL=1

verbose1_serialize.txt.gz
It was 75MB uncompressed so I gzipped it before uploading.

jeffdaily · 2018-06-14T16:59:06Z

Setting

HSA_ENABLE_SDMA=0
HSA_ENABLE_INTERRUPT=0
HSA_DISABLE_CACHE=1

did not avoid the exception.

jeffdaily · 2018-06-14T21:15:21Z

I suppose it is worth noting that our TF1.3 sources ran fine.

I added func FILE and LINE output to the THROW_ERROR macro in unpinned_copy_engine.cpp, build hcc from source, and found the exception is thrown at line 193.
https://github.com/RadeonOpenCompute/hcc/blob/39bf3750d60b2a395650acf9c280fc3a819ba5c6/lib/hsa/unpinned_copy_engine.cpp#L193

hsaErr = 4096. Just a generic 0x1000 HSA_STATUS_ERROR.

jeffdaily · 2018-06-14T21:55:53Z

I dug into hsa_amd_memory_lock to see what causes it to return HSA_STATUS_ERROR. It returns it in two different cases, so I changed the return code in the first case to differentiate it from the latter, then installed ROCR-Runtime from source and reran the benchmark code. It still returns HSA_STATUS_ERROR, which tells me https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/bfc4b9e98cb5e48d9f96287371d518cb444968ba/src/core/runtime/amd_memory_region.cpp#L611 it's failing in RegisterMemory in the call to hsaKmtRegisterMemoryToNodes().

whchung · 2018-06-14T22:00:09Z

Let's step back a bit. By "our TF1.3 sources ran fine" does it mean on the same lower-level bits (KFD / ROCR / HCC / HIP ) TF 1.3 + older TF benchmarks works while TF 1.8 + tip of TF benchmarks not?

jeffdaily · 2018-06-14T22:01:37Z

Yes, on the same lower-level bits TF 1.3 + older TF benchmarks works while TF 1.8 + top of TF benchmarks does not.

jeffdaily · 2018-06-14T22:46:26Z

Got a new error, but this time trying 1 ps and 1 worker. The worker failed during warmup with

Memory access fault by GPU node-7 (Agent handle: 0x24c6ab0) on address 0x4001d03000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

whchung · 2018-06-15T15:19:39Z

According to @jeffdaily the issue only happens on a particular configuration of ROCm stack and not observable in others. Awaiting more info.

jeffdaily · 2018-06-15T17:29:40Z

Upgraded bazel from 0.10.1 to 0.12.0. No change.

Switched to branch v1.8-rocm. Can no longer reproduce error. I've only tested these two branches (develop-upstream and v1.8-rocm), so I can't make any generalization. Just something to keep in mind if this issue shows up again perhaps for a v1.9-rocm release or in the develop-upstream branch again.

Also should note that we can't use the benchmarks repo beyond tensorflow/benchmarks@3b90c14. There are two okay commits on May 31st, but the last commit on May 31st tensorflow/benchmarks@fef2030 introduces native collective reduce mode (tensorflow.python.ops.collective_ops), introduced after v1.8.

jeffdaily · 2018-06-15T22:24:36Z

Reopening since using grpc path causes the same issue again.

Since TF 1.3, the grpc code path has been optimized so that it does not serialize tensors into a protobuf. Instead, it allocates a new Tensor on the host, copies from device to host, and then uses the host Tensor's byte buffer directly.

Using branch r1.8-rocm, at or after commit 32b762d. That commit enables ROCM for the GPU device to host copy, otherwise you just get "No GPU device in process" at runtime when using grpc.

I am running using two hosts, one is designated the 'worker' and the other as 'ps' (parameter server).

# on 'worker' host prj47-rack-02
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0

# on 'ps' host prj47-rack-05
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0

The 'worker' python process terminates with the HCC unpinned copy engine error during warm up.

jeffdaily · 2018-06-15T22:32:31Z

I reduced --num_gpus=4 down to --num_gpus=3 and was still able to reproduce. However, reducing it down to 2 or 1, I was unable to reproduce. Race condition?

whchung · 2018-06-18T14:06:25Z

@jeffdaily Per discussion I'll need your help to walk me through steps to reproduce the issue so we can dive deeper into this. I assume the comments above are for me to study, correct?

jeffdaily · 2018-06-18T14:34:30Z

@whchung yes I've been trying to capture as much detail as I can for you and others to be able to reproduce this issue.

whchung · 2018-06-18T15:34:06Z

@jeffdaily we'll likely have to study the internals of grpc_worker_service.cc better. As an initial experiment please see what happens if we reduce kGrpcWorkerServiceThreadCount to 1.

jeffdaily · 2018-06-18T15:58:52Z

I agree I will need to study the distributed runtime better. I set kGrpcWorkerServiceThreadCount to 1 and rebuilt TF. No change in behavior. Still getting the copy engine error during warmup. From: Wen-Heng (Jack) Chung [mailto:notifications@github.com] Sent: Monday, June 18, 2018 8:34 AM To: ROCmSoftwarePlatform/tensorflow-upstream <tensorflow-upstream@noreply.github.com> Cc: Daily, Jeff <Jeff.Daily@amd.com>; Mention <mention@noreply.github.com> Subject: Re: [ROCmSoftwarePlatform/tensorflow-upstream] Kalmar::runtime_exception: HCC unpinned copy engine error (#27) @jeffdaily<https://github.com/jeffdaily> we'll likely have to study the internals of grpc_worker_service.cc better. As an initial experiment please see what happens if we reduce kGrpcWorkerServiceThreadCount to 1. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#27 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AA3MOBfgF3igYZceUH9Koha6N-4nLKJzks5t98hwgaJpZM4UnD53>.

whchung · 2018-06-19T19:01:23Z

@jeffdaily based on the comments / logs you gathered so far it seems we are failing at:

        hsa_status_t hsa_status = hsa_amd_memory_lock(const_cast<char *>(srcp), theseBytes, &_hsaAgent, 1, &locked_srcp);

in void UnpinnedCopyEngine::CopyHostToDevicePinInPlace(void* dst, const void* src, size_t sizeBytes, const hsa_signal_t *waitFor).

This routine tries to pin host memory pages so they are visible to GPU. I'm guessing maybe we are pinning too many host memory thus making ROCR runtime to fail.

At TensorFlow level we'll need to understand how many KB/MB of memory has been pinned with such method.

At HCC runtime leve, we can probably use HCC_H2D_PININPLACE_THRESHOLD env var to tune the threshold to pin host memory. Refer to: https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L3746

whchung · 2018-06-19T19:22:22Z

It seems all transfer is done via CopyTensor::ViaDMA in core/common_runtime/copy_tensor.cc, and since we are failing in H2D copies so in TensorFlow layer that's CopyHostToDevice, which would eventually go to GPUUtil::CopyCPUTensorToGPU in core/common_runtime/gpu/gpu_util.cc.

jeffdaily · 2018-06-19T19:52:27Z

When preparing the protobuf response for requesting Tensors in GrpcWorker::GrpcRecvTensorAsync, there is also a call to GPUUtil::CopyGPUTensorToCPU in core/distributed_runtime/rpc/grpc_worker_service.cc IFF the memory is non-DMA-able.

@whchung I would like you to take a look at core/distributed_runtime/rpc/grpc_tensor_coding.cc. When a worker is responding to a request for a Tensor, it encodes the response into a grpc ByteBuffer. See grpc::EncodeTensorToByteBuffer in grpc_tensor_coding.cc. I don't see where any host pinning is taking place.

jeffdaily · 2018-06-19T20:01:50Z

I just set HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) and was able to successfully run.

whchung · 2018-06-19T20:04:34Z

@jeffdaily sounds good. let me rejoin you after i finish my initial sweep of tensorflow unit tests.

whchung · 2018-06-20T19:19:58Z

/cc @fxkamd for awareness

whchung · 2018-06-20T19:30:17Z

Per discussion thus far it seems in the implementation of hsa_amd_memory_unlock there might be resource leaking in KFD(?).

@jeffdaily For now please try run the benchmarks with HCC_UNPINNED_COPY_MODE=2, but please also see if it's possible to change memory allocator in TensorFlow and Eigen to only do aligned allocations so ROCR runtime can pin the pages correctly without leaking. Check how AllocateRaw is used in TensorFlow and understand what alignment is using. By default we probably need it be page-aligned (4096).

whchung · 2018-06-27T22:02:32Z

@gstoner need people with KFD knowledge here. We are seeing trivial host memory pinning giving errors

jeffdaily · 2018-06-29T23:09:11Z

What I know:

If the D2H and H2D copy routines in TF are turned into no-ops, there’s no problem.
If the D2H and H2D copy routines share a mutex, there’s no problem.
If HCC_UNPINNED_COPY_MODE=2 (forced Staging), there’s no problem.
If number of threads in TF is set to 1, there’s no problem.
There is a mutex within hcc/lib/hsa/unpinned_copy_engine.cpp that protects calls to hsa_amd_memory_lock, so if it is working properly we shouldn't be pinning the same host address by multiple threads at the same time.

We suspected we might be leaking pinned memory, however setting the threads to 1 should not change the number of times we call the hsa_amd_memory_lock and we don't see the problem single-threaded.

The problem is showing up when we're doing lots of H2D copies. Either the tf_cnn_benchmarks.py with --local_parameter_device=cpu or the cifar10 multi gpu from the models tutorial will put the master gradients on host memory, forcing many H2D copies. What is interesting is that there should also be many D2H copies to get the individual gradient contributions onto the CPU, but we don't see those going through the unpinned copy engine.

fxkamd · 2018-06-30T00:13:55Z

In your log it looked like you were getting into trouble because different threads were registering the same pointers to different nodes. The registration to specific nodes is just a vestige in the Thunk API. It's mostly replaced by mapping to specific nodes now.

I think I should be able to add a workaround in the Thunk by ignoring the node list in hsaKmtRegisterMemoryToNodes. This may break some KFDTest unit tests though.

A better fix would be in the Runtime to use hsaKmtRegisterMemory instead of hsaKmtRegisterMemoryToNodes. That would have the same effect without an ugly workaround in the Thunk.

jeffdaily · 2018-07-03T15:01:07Z

Update on the race condition. It's in the HCC runtime. The reason I thought it was the function IsLockedPointer is because the behavior we were seeing was as if the mutex inside the UnpinnedCopyEngine::CopyHostToDevicePinInPlace was somehow not working. I was seeing multiple threads calling UnpinnedCopyEngine::CopyHostToDevicePinInPlace simultaneously and assumed T#0 and T#1 would check for locked memory, both would report false, and both would attempt to pin.

Turns out we weren't that far off. I had assumed there was only 1 copy engine. Turns out there are 2 copy engine instances per GPU aka copyDevice, one engine for each direction. One engine for H2D and another for D2H is probably okay, but the copy engines don't share a common mutex (and perhaps shouldn't?) so with the 4 GPUs I was evaluating, the H2D engines were all trying to lock the same host pointers.

whchung · 2018-07-03T15:07:51Z

@jeffdaily Could you weigh in on why file-scope mutex doesn't work?

jeffdaily · 2018-07-03T15:12:23Z

A file-scope mutex might work. My first file-scope mutex only protected the IsLockedPointer function, but since the function is read-only it doesn't change any behavior. However, a file-scope mutex that protects the H2D and D2H PinInPlace functions, instead of the class instance member mutex, could work. Performance concerns?

whchung · 2018-07-03T15:16:00Z

@jeffdaily Just want to have something working as the baseline. And let folks more dedicated to HCC runtime improve it, if needed.

Notice in HIP runtime layer, there is a team working on switching the backend to something called VDI, which natively supports both Linux and Windows. So we may or may not see this issue again when we make the switch from HIP-HCC to HIP-VDI.

jeffdaily · 2018-07-03T15:32:07Z

I replaced the mutex in both the H2DPinInPlace and D2HPinInPlace with a single file-scope mutex. Problem solved. However, my performance concerns might be true. The resnet benchmark on 4 GPUs is faster if we set HCC_UNPINNED_COPY_MODE=2 to use staged copies only. If we allow the pinned copies, it's slightly slower.
484.90 images/s pinned
560.08 images/s staged

jeffdaily · 2018-07-03T18:50:24Z

Using the discussed fix of an unordered_set class-scope container to keep track of locked pointers, I get better performance than using the heavy-handed mutex. Still not as fast as using only staged copies. But much closer.

549.19 images/s pinned with set-check

whchung · 2018-07-09T17:18:28Z

The conclusion of the ticket is that there are race conditions in HCC. A preliminary PR has been filed by @jeffdaily to HCC runtime and @scchan will continue drive it to the conclusion.

Meanwhile from TensorFlow perspective, setting HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) would be the workaround for now.

jeffdaily closed this as completed Jun 15, 2018

jeffdaily reopened this Jun 15, 2018

whchung mentioned this issue Jun 26, 2018

fix missing system scope release for some situations ROCm/hcc#771

Merged

jeffdaily mentioned this issue Jul 3, 2018

fix mGPU race condition in unpinned copy engine ROCm/hcc#783

Closed

whchung changed the title ~~Kalmar::runtime_exception: HCC unpinned copy engine error~~ errors in pin-in-place path in HCC unpinned copy engine Jul 9, 2018

whchung closed this as completed Jul 9, 2018

scchan mentioned this issue Jul 10, 2018

Unpinned copy engine: use staging buffer by default ROCm/hcc#791

Merged

wormwang mentioned this issue Jun 26, 2019

[aarch64] Run python meet core dump failure on TF 1.12.2 with rocm 2.4 #506

Closed

errors in pin-in-place path in HCC unpinned copy engine #27

errors in pin-in-place path in HCC unpinned copy engine #27

Comments

jeffdaily commented Jun 13, 2018 • edited

misc system info

whchung commented Jun 14, 2018

jeffdaily commented Jun 14, 2018

jeffdaily commented Jun 14, 2018

whchung commented Jun 14, 2018

whchung commented Jun 14, 2018

jeffdaily commented Jun 14, 2018

whchung commented Jun 14, 2018

jeffdaily commented Jun 14, 2018 • edited

jeffdaily commented Jun 14, 2018

jeffdaily commented Jun 14, 2018 • edited

jeffdaily commented Jun 14, 2018 • edited

whchung commented Jun 14, 2018

jeffdaily commented Jun 14, 2018

jeffdaily commented Jun 14, 2018

whchung commented Jun 15, 2018

jeffdaily commented Jun 15, 2018

jeffdaily commented Jun 15, 2018

jeffdaily commented Jun 15, 2018

whchung commented Jun 18, 2018

jeffdaily commented Jun 18, 2018

whchung commented Jun 18, 2018

jeffdaily commented Jun 18, 2018 via email

whchung commented Jun 19, 2018

whchung commented Jun 19, 2018

jeffdaily commented Jun 19, 2018

jeffdaily commented Jun 19, 2018

whchung commented Jun 19, 2018

whchung commented Jun 20, 2018

whchung commented Jun 20, 2018

whchung commented Jun 27, 2018

jeffdaily commented Jun 29, 2018

fxkamd commented Jun 30, 2018

jeffdaily commented Jul 3, 2018

whchung commented Jul 3, 2018

jeffdaily commented Jul 3, 2018

whchung commented Jul 3, 2018 • edited

jeffdaily commented Jul 3, 2018

jeffdaily commented Jul 3, 2018

whchung commented Jul 9, 2018

jeffdaily commented Jun 13, 2018 •

edited

jeffdaily commented Jun 14, 2018 •

edited

jeffdaily commented Jun 14, 2018 •

edited

jeffdaily commented Jun 14, 2018 •

edited

whchung commented Jul 3, 2018 •

edited