GPU CumOp can be speed up #6110

khaotik · 2017-07-05T11:29:20Z

Performance

I did a comparsion with PyTorch. Here's nvprof dump:

==22738== NVPROF is profiling process 22738, command: python3 _cumsum_profile.py
Using cuDNN version 6020 on context None
Preallocating 1200/2000 Mb (0.600000) on cuda
Mapped name None to device cuda: GeForce GTX 750 Ti (0000:01:00.0)
==22738== Profiling application: python3 _cumsum_profile.py
==22738== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 40.80%  129.91ms       202  643.12us  1.1520us  748.39us  [CUDA memcpy HtoD]
 23.37%  74.413ms       200  372.07us  5.5040us  757.54us  k_blockCumOp
 21.04%  66.996ms       100  669.96us  661.96us  885.19us  [CUDA memcpy DtoH]
  5.92%  18.858ms       100  188.58us  179.39us  199.30us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_downsweep, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, thrust::device_ptr<float>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9)
  4.42%  14.072ms       100  140.72us  134.37us  148.55us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::accumulate_tiles, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9)
  4.07%  12.963ms       100  129.63us  126.88us  132.61us  k_finalCumOp
  0.38%  1.2108ms       100  12.107us  10.976us  15.360us  void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=256, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=3>, unsigned long=256>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_scan_n, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, long, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=3)
  0.00%  5.1520us         1  5.1520us  5.1520us  5.1520us  void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)

k_blockCumOp and k_finalCumOp belongs to Theano. The thrust kernels with long name belongs to PyTorch.

The code used for profiling:

import numpy as np
import theano as th
import theano.tensor as T
import torch

s_x = T.vector()
s_y = T.extra_ops.cumsum(s_x)

fn = th.function([s_x], s_y)

def fn_torch(x):
    return torch.cumsum(torch.from_numpy(x).cuda(), -1)

x = np.random.rand(2**20).astype(th.config.floatX)
for _ in range(100):
    fn(x)
    fn_torch(x)

Interface

Only supports up to 3 dims
Only supports float32

The text was updated successfully, but these errors were encountered:

nouiz · 2017-07-05T13:50:17Z

I updated the title. Mostly, you tell that if we reuse pytorch kernel, we will speed up that op.

Do you have a Theano profile that show this op to take significant time in a real model?

nouiz · 2017-07-05T13:51:22Z

Note the thurst library is inside CUDA, so this would not request a new dependency

nouiz changed the title ~~GPU CumOp issues~~ GPU CumOp can be speed up Jul 5, 2017

nouiz added GPU - New back-end Speed-Up labels Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU CumOp can be speed up #6110

GPU CumOp can be speed up #6110

khaotik commented Jul 5, 2017

nouiz commented Jul 5, 2017

nouiz commented Jul 5, 2017

GPU CumOp can be speed up #6110

GPU CumOp can be speed up #6110

Comments

khaotik commented Jul 5, 2017

Performance

Interface

nouiz commented Jul 5, 2017

nouiz commented Jul 5, 2017