Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CumOp can be speed up #6110

Open
khaotik opened this issue Jul 5, 2017 · 2 comments
Open

GPU CumOp can be speed up #6110

khaotik opened this issue Jul 5, 2017 · 2 comments

Comments

@khaotik
Copy link
Contributor

khaotik commented Jul 5, 2017

Performance

I did a comparsion with PyTorch. Here's nvprof dump:

==22738== NVPROF is profiling process 22738, command: python3 _cumsum_profile.py
Using cuDNN version 6020 on context None
Preallocating 1200/2000 Mb (0.600000) on cuda
Mapped name None to device cuda: GeForce GTX 750 Ti (0000:01:00.0)
==22738== Profiling application: python3 _cumsum_profile.py
==22738== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 40.80%  129.91ms       202  643.12us  1.1520us  748.39us  [CUDA memcpy HtoD]
 23.37%  74.413ms       200  372.07us  5.5040us  757.54us  k_blockCumOp
 21.04%  66.996ms       100  669.96us  661.96us  885.19us  [CUDA memcpy DtoH]
  5.92%  18.858ms       100  188.58us  179.39us  199.30us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_downsweep, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, thrust::device_ptr<float>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9)
  4.42%  14.072ms       100  140.72us  134.37us  148.55us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::accumulate_tiles, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9)
  4.07%  12.963ms       100  129.63us  126.88us  132.61us  k_finalCumOp
  0.38%  1.2108ms       100  12.107us  10.976us  15.360us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=256, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=3>, unsigned long=256>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_scan_n, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, long, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=3)
  0.00%  5.1520us         1  5.1520us  5.1520us  5.1520us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)

k_blockCumOp and k_finalCumOp belongs to Theano. The thrust kernels with long name belongs to PyTorch.

The code used for profiling:

import numpy as np
import theano as th
import theano.tensor as T
import torch

s_x = T.vector()
s_y = T.extra_ops.cumsum(s_x)

fn = th.function([s_x], s_y)

def fn_torch(x):
    return torch.cumsum(torch.from_numpy(x).cuda(), -1)

x = np.random.rand(2**20).astype(th.config.floatX)
for _ in range(100):
    fn(x)
    fn_torch(x)

Interface

  • Only supports up to 3 dims
  • Only supports float32
@nouiz nouiz changed the title GPU CumOp issues GPU CumOp can be speed up Jul 5, 2017
@nouiz
Copy link
Member

nouiz commented Jul 5, 2017

I updated the title. Mostly, you tell that if we reuse pytorch kernel, we will speed up that op.

Do you have a Theano profile that show this op to take significant time in a real model?

@nouiz
Copy link
Member

nouiz commented Jul 5, 2017

Note the thurst library is inside CUDA, so this would not request a new dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants