Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU under-utilized when waiting for CPU to launch kernel #8818

Closed
panyx0718 opened this issue Mar 7, 2018 · 7 comments
Closed

GPU under-utilized when waiting for CPU to launch kernel #8818

panyx0718 opened this issue Mar 7, 2018 · 7 comments

Comments

@panyx0718
Copy link
Contributor

e3d361c98083ea2c01cf5362ce629e57

Currently we run Op one-by-one synchronously. For Ops that can be quickly finished by GPU, the CPU is too slow to launch GPU kernels. Hence, in many cases, the GPU is under-utilized.

To mitigate the situation, we need to schedule Ops in parallel (based on dependency information). So, we can better utilize both cpus and gpus.

@tonyyang-svail
Copy link

tonyyang-svail commented Mar 7, 2018

@panyx0718 I believe the GPU kernel is launched asynchronously.

SGDKernel<T><<<grid, block, 0, ctx.cuda_device_context().stream()>>>(
grad_data, param_data, learning_rate->data<T>(), param->numel(),
param_out_data);

In many computational intensive Ops (conv, MatMul etc.), I feel it is hard to believe that CPU, who launches the kernel, is slower than GPU, who executes the kernel.

@panyx0718
Copy link
Contributor Author

panyx0718 commented Mar 7, 2018

GPU kernel is launched async. But CPU is not launching then fast enough to make GPU full. As the timeline shows, SGD and Elementwise_mul is much faster on GPU than CPU.

@tonyyang-svail
Copy link

tonyyang-svail commented Mar 7, 2018

SGD and Elementwise_mul is much faster on GPU than CPU.

Sure. What is the time percental these ops makes up?

@panyx0718
Copy link
Contributor Author

I don't have actual number. By looking at the timeline, I feel that we can improve ~20% if we can get GPU always busy. (not just sgd and elementwise_mul, but also other ops)

@chengduoZH
Copy link
Contributor

GPU under-utilized when waiting for CPU to launch kernel

I don't think launching kernel will cost much time, "GPU under-utilized" is because the data processed by CUDA kernel is so little.

The two pictures can prove that.

A simple description, we assume that GPU has a task queue (in fact, there is indeed a task queue), in Figure 1, due to a large amount of data, each kernel takes a long time, maybe that GPU has not been complete the computation over kernel1 when kernel2 is added to the task queue. So the axis of GPU time is compact.

Figure 1:
cpu-gpu-2-2

However, for Figure 2, due to a small amount of data, each kernel takes very little time. maybe the task queue is empty in most time.

Figure 2:
cpu-gpu-2_1

@panyx0718
Copy link
Contributor Author

I don't mean the "launch kernel call" tasks too long. I mean the GPU is waiting for CPU to do all the computations and then launch the kernel so that it can start doing something

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

Performance Tuning automation moved this from Doing to Done Aug 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants