Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors in pin-in-place path in HCC unpinned copy engine #27

Closed
jeffdaily opened this issue Jun 13, 2018 · 39 comments
Closed

errors in pin-in-place path in HCC unpinned copy engine #27

jeffdaily opened this issue Jun 13, 2018 · 39 comments

Comments

@jeffdaily
Copy link

jeffdaily commented Jun 13, 2018

Using latest develop-upstream branch and latest benchmarks master.
Running the tf_cnn_benchmarks.py code like so:

python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Eventually produces during warmup the following message

terminate called after throwing an instance of 'Kalmar::runtime_exception'
  what():  HCC unpinned copy engine error
Aborted (core dumped)

If you set --local_parameter_device=gpu instead, the problem doesn't manifest.

However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:

# worker
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc
# ps
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc

At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.

misc system info

c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

lscpu

AMD EPYC 7551 32-Core Processor

uname -a
Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib
DYLD_LIBRARY_PATH is unset

rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic]
rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed]
rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed]
rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed]
rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic]
rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed]
rocminfo/now 1.0.7 amd64 [installed,local]

@whchung
Copy link
Collaborator

whchung commented Jun 14, 2018

/cc @scchan

Please help collect more detailed log wrt copy with HCC_DB

@jeffdaily
Copy link
Author

Logging a new error, same inputs to benchmark.

Memory access fault by GPU node-6 (Agent handle: 0x17e9ab0) on address 0x531981e000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Gathering more detail next with HCC_DB.

@jeffdaily
Copy link
Author

Ran with HCC_DB=0x48A. Output attached.
debug.txt

@whchung
Copy link
Collaborator

whchung commented Jun 14, 2018

48A is good for diagnosing faulty kernels not very good for copies. Please try check HCC header file and use other flags so more details of the copies are shown.

@whchung
Copy link
Collaborator

whchung commented Jun 14, 2018

@jeffdaily
Copy link
Author

I ran 0x0302 before your suggestion of 0x68A. Both attached. The 0x0302 run took longer for the error to appear. 0x68A showed the error relatively early.
debug_0x68A.txt
debug_0x0302.txt

@whchung
Copy link
Collaborator

whchung commented Jun 14, 2018

Need more log especially from Python layer to understand which command we are in.

Also, to try make the behavior more deterministic, could you also try run the benchmark appending more env vars HCC_SERIALIZE_KERNEL=3 HCC_SERIALIZE_COPY=3?

@jeffdaily
Copy link
Author

jeffdaily commented Jun 14, 2018

HCC_DB=0x68A
HCC_SERIALIZE_KERNEL=3
HCC_SERIALIZE_COPY=3
TF_CPP_MIN_VLOG_LEVEL=1

verbose1_serialize.txt.gz
It was 75MB uncompressed so I gzipped it before uploading.

@jeffdaily
Copy link
Author

Setting

HSA_ENABLE_SDMA=0
HSA_ENABLE_INTERRUPT=0
HSA_DISABLE_CACHE=1

did not avoid the exception.

@jeffdaily
Copy link
Author

jeffdaily commented Jun 14, 2018

I suppose it is worth noting that our TF1.3 sources ran fine.

I added func FILE and LINE output to the THROW_ERROR macro in unpinned_copy_engine.cpp, build hcc from source, and found the exception is thrown at line 193.
https://github.com/RadeonOpenCompute/hcc/blob/39bf3750d60b2a395650acf9c280fc3a819ba5c6/lib/hsa/unpinned_copy_engine.cpp#L193

hsaErr = 4096. Just a generic 0x1000 HSA_STATUS_ERROR.

@jeffdaily
Copy link
Author

jeffdaily commented Jun 14, 2018

I dug into hsa_amd_memory_lock to see what causes it to return HSA_STATUS_ERROR. It returns it in two different cases, so I changed the return code in the first case to differentiate it from the latter, then installed ROCR-Runtime from source and reran the benchmark code. It still returns HSA_STATUS_ERROR, which tells me https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/bfc4b9e98cb5e48d9f96287371d518cb444968ba/src/core/runtime/amd_memory_region.cpp#L611 it's failing in RegisterMemory in the call to hsaKmtRegisterMemoryToNodes().

@whchung
Copy link
Collaborator

whchung commented Jun 14, 2018

Let's step back a bit. By "our TF1.3 sources ran fine" does it mean on the same lower-level bits (KFD / ROCR / HCC / HIP ) TF 1.3 + older TF benchmarks works while TF 1.8 + tip of TF benchmarks not?

@jeffdaily
Copy link
Author

Yes, on the same lower-level bits TF 1.3 + older TF benchmarks works while TF 1.8 + top of TF benchmarks does not.

@jeffdaily
Copy link
Author

Got a new error, but this time trying 1 ps and 1 worker. The worker failed during warmup with

Memory access fault by GPU node-7 (Agent handle: 0x24c6ab0) on address 0x4001d03000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

@whchung
Copy link
Collaborator

whchung commented Jun 15, 2018

According to @jeffdaily the issue only happens on a particular configuration of ROCm stack and not observable in others. Awaiting more info.

@jeffdaily
Copy link
Author

Upgraded bazel from 0.10.1 to 0.12.0. No change.

Switched to branch v1.8-rocm. Can no longer reproduce error. I've only tested these two branches (develop-upstream and v1.8-rocm), so I can't make any generalization. Just something to keep in mind if this issue shows up again perhaps for a v1.9-rocm release or in the develop-upstream branch again.

Also should note that we can't use the benchmarks repo beyond tensorflow/benchmarks@3b90c14. There are two okay commits on May 31st, but the last commit on May 31st tensorflow/benchmarks@fef2030 introduces native collective reduce mode (tensorflow.python.ops.collective_ops), introduced after v1.8.

@jeffdaily
Copy link
Author

Reopening since using grpc path causes the same issue again.

Since TF 1.3, the grpc code path has been optimized so that it does not serialize tensors into a protobuf. Instead, it allocates a new Tensor on the host, copies from device to host, and then uses the host Tensor's byte buffer directly.

Using branch r1.8-rocm, at or after commit 32b762d. That commit enables ROCM for the GPU device to host copy, otherwise you just get "No GPU device in process" at runtime when using grpc.

I am running using two hosts, one is designated the 'worker' and the other as 'ps' (parameter server).

# on 'worker' host prj47-rack-02
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0
# on 'ps' host prj47-rack-05
python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0

The 'worker' python process terminates with the HCC unpinned copy engine error during warm up.

@jeffdaily jeffdaily reopened this Jun 15, 2018
@jeffdaily
Copy link
Author

I reduced --num_gpus=4 down to --num_gpus=3 and was still able to reproduce. However, reducing it down to 2 or 1, I was unable to reproduce. Race condition?

@whchung
Copy link
Collaborator

whchung commented Jun 18, 2018

@jeffdaily Per discussion I'll need your help to walk me through steps to reproduce the issue so we can dive deeper into this. I assume the comments above are for me to study, correct?

@jeffdaily
Copy link
Author

@whchung yes I've been trying to capture as much detail as I can for you and others to be able to reproduce this issue.

@whchung
Copy link
Collaborator

whchung commented Jun 18, 2018

@jeffdaily we'll likely have to study the internals of grpc_worker_service.cc better. As an initial experiment please see what happens if we reduce kGrpcWorkerServiceThreadCount to 1.

@jeffdaily
Copy link
Author

jeffdaily commented Jun 18, 2018 via email

@whchung
Copy link
Collaborator

whchung commented Jun 19, 2018

@jeffdaily based on the comments / logs you gathered so far it seems we are failing at:

        hsa_status_t hsa_status = hsa_amd_memory_lock(const_cast<char *>(srcp), theseBytes, &_hsaAgent, 1, &locked_srcp);

in void UnpinnedCopyEngine::CopyHostToDevicePinInPlace(void* dst, const void* src, size_t sizeBytes, const hsa_signal_t *waitFor).

This routine tries to pin host memory pages so they are visible to GPU. I'm guessing maybe we are pinning too many host memory thus making ROCR runtime to fail.

At TensorFlow level we'll need to understand how many KB/MB of memory has been pinned with such method.

At HCC runtime leve, we can probably use HCC_H2D_PININPLACE_THRESHOLD env var to tune the threshold to pin host memory. Refer to: https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L3746

@whchung
Copy link
Collaborator

whchung commented Jun 19, 2018

It seems all transfer is done via CopyTensor::ViaDMA in core/common_runtime/copy_tensor.cc, and since we are failing in H2D copies so in TensorFlow layer that's CopyHostToDevice, which would eventually go to GPUUtil::CopyCPUTensorToGPU in core/common_runtime/gpu/gpu_util.cc.

@jeffdaily
Copy link
Author

When preparing the protobuf response for requesting Tensors in GrpcWorker::GrpcRecvTensorAsync, there is also a call to GPUUtil::CopyGPUTensorToCPU in core/distributed_runtime/rpc/grpc_worker_service.cc IFF the memory is non-DMA-able.

@whchung I would like you to take a look at core/distributed_runtime/rpc/grpc_tensor_coding.cc. When a worker is responding to a request for a Tensor, it encodes the response into a grpc ByteBuffer. See grpc::EncodeTensorToByteBuffer in grpc_tensor_coding.cc. I don't see where any host pinning is taking place.

@jeffdaily
Copy link
Author

I just set HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) and was able to successfully run.

@whchung
Copy link
Collaborator

whchung commented Jun 19, 2018

@jeffdaily sounds good. let me rejoin you after i finish my initial sweep of tensorflow unit tests.

@whchung
Copy link
Collaborator

whchung commented Jun 20, 2018

/cc @fxkamd for awareness

@whchung
Copy link
Collaborator

whchung commented Jun 20, 2018

Per discussion thus far it seems in the implementation of hsa_amd_memory_unlock there might be resource leaking in KFD(?).

@jeffdaily For now please try run the benchmarks with HCC_UNPINNED_COPY_MODE=2, but please also see if it's possible to change memory allocator in TensorFlow and Eigen to only do aligned allocations so ROCR runtime can pin the pages correctly without leaking. Check how AllocateRaw is used in TensorFlow and understand what alignment is using. By default we probably need it be page-aligned (4096).

@whchung
Copy link
Collaborator

whchung commented Jun 27, 2018

@gstoner need people with KFD knowledge here. We are seeing trivial host memory pinning giving errors

@jeffdaily
Copy link
Author

What I know:

  • If the D2H and H2D copy routines in TF are turned into no-ops, there’s no problem.
  • If the D2H and H2D copy routines share a mutex, there’s no problem.
  • If HCC_UNPINNED_COPY_MODE=2 (forced Staging), there’s no problem.
  • If number of threads in TF is set to 1, there’s no problem.
  • There is a mutex within hcc/lib/hsa/unpinned_copy_engine.cpp that protects calls to hsa_amd_memory_lock, so if it is working properly we shouldn't be pinning the same host address by multiple threads at the same time.

We suspected we might be leaking pinned memory, however setting the threads to 1 should not change the number of times we call the hsa_amd_memory_lock and we don't see the problem single-threaded.

The problem is showing up when we're doing lots of H2D copies. Either the tf_cnn_benchmarks.py with --local_parameter_device=cpu or the cifar10 multi gpu from the models tutorial will put the master gradients on host memory, forcing many H2D copies. What is interesting is that there should also be many D2H copies to get the individual gradient contributions onto the CPU, but we don't see those going through the unpinned copy engine.

@fxkamd
Copy link

fxkamd commented Jun 30, 2018

In your log it looked like you were getting into trouble because different threads were registering the same pointers to different nodes. The registration to specific nodes is just a vestige in the Thunk API. It's mostly replaced by mapping to specific nodes now.

I think I should be able to add a workaround in the Thunk by ignoring the node list in hsaKmtRegisterMemoryToNodes. This may break some KFDTest unit tests though.

A better fix would be in the Runtime to use hsaKmtRegisterMemory instead of hsaKmtRegisterMemoryToNodes. That would have the same effect without an ugly workaround in the Thunk.

@jeffdaily
Copy link
Author

Update on the race condition. It's in the HCC runtime. The reason I thought it was the function IsLockedPointer is because the behavior we were seeing was as if the mutex inside the UnpinnedCopyEngine::CopyHostToDevicePinInPlace was somehow not working. I was seeing multiple threads calling UnpinnedCopyEngine::CopyHostToDevicePinInPlace simultaneously and assumed T#0 and T#1 would check for locked memory, both would report false, and both would attempt to pin.

Turns out we weren't that far off. I had assumed there was only 1 copy engine. Turns out there are 2 copy engine instances per GPU aka copyDevice, one engine for each direction. One engine for H2D and another for D2H is probably okay, but the copy engines don't share a common mutex (and perhaps shouldn't?) so with the 4 GPUs I was evaluating, the H2D engines were all trying to lock the same host pointers.

@whchung
Copy link
Collaborator

whchung commented Jul 3, 2018

@jeffdaily Could you weigh in on why file-scope mutex doesn't work?

@jeffdaily
Copy link
Author

A file-scope mutex might work. My first file-scope mutex only protected the IsLockedPointer function, but since the function is read-only it doesn't change any behavior. However, a file-scope mutex that protects the H2D and D2H PinInPlace functions, instead of the class instance member mutex, could work. Performance concerns?

@whchung
Copy link
Collaborator

whchung commented Jul 3, 2018

@jeffdaily Just want to have something working as the baseline. And let folks more dedicated to HCC runtime improve it, if needed.

Notice in HIP runtime layer, there is a team working on switching the backend to something called VDI, which natively supports both Linux and Windows. So we may or may not see this issue again when we make the switch from HIP-HCC to HIP-VDI.

@jeffdaily
Copy link
Author

I replaced the mutex in both the H2DPinInPlace and D2HPinInPlace with a single file-scope mutex. Problem solved. However, my performance concerns might be true. The resnet benchmark on 4 GPUs is faster if we set HCC_UNPINNED_COPY_MODE=2 to use staged copies only. If we allow the pinned copies, it's slightly slower.
484.90 images/s pinned
560.08 images/s staged

@jeffdaily
Copy link
Author

Using the discussed fix of an unordered_set class-scope container to keep track of locked pointers, I get better performance than using the heavy-handed mutex. Still not as fast as using only staged copies. But much closer.

549.19 images/s pinned with set-check

@whchung whchung changed the title Kalmar::runtime_exception: HCC unpinned copy engine error errors in pin-in-place path in HCC unpinned copy engine Jul 9, 2018
@whchung
Copy link
Collaborator

whchung commented Jul 9, 2018

The conclusion of the ticket is that there are race conditions in HCC. A preliminary PR has been filed by @jeffdaily to HCC runtime and @scchan will continue drive it to the conclusion.

Meanwhile from TensorFlow perspective, setting HCC_UNPINNED_COPY_MODE=2 (StagingBuffer) would be the workaround for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants