We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I did a comparsion with PyTorch. Here's nvprof dump:
nvprof
==22738== NVPROF is profiling process 22738, command: python3 _cumsum_profile.py Using cuDNN version 6020 on context None Preallocating 1200/2000 Mb (0.600000) on cuda Mapped name None to device cuda: GeForce GTX 750 Ti (0000:01:00.0) ==22738== Profiling application: python3 _cumsum_profile.py ==22738== Profiling result: Time(%) Time Calls Avg Min Max Name 40.80% 129.91ms 202 643.12us 1.1520us 748.39us [CUDA memcpy HtoD] 23.37% 74.413ms 200 372.07us 5.5040us 757.54us k_blockCumOp 21.04% 66.996ms 100 669.96us 661.96us 885.19us [CUDA memcpy DtoH] 5.92% 18.858ms 100 188.58us 179.39us 199.30us void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_downsweep, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, thrust::device_ptr<float>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9) 4.42% 14.072ms 100 140.72us 134.37us 148.55us void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=128, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=9>, unsigned long=128>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::accumulate_tiles, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::device_ptr<float>, thrust::system::cuda::detail::aligned_decomposition<long>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=9) 4.07% 12.963ms 100 129.63us 126.88us 132.61us k_finalCumOp 0.38% 1.2108ms 100 12.107us 10.976us 15.360us void thrust::system::cuda::detail::bulk_::detail::launch_by_value<unsigned int=256, thrust::system::cuda::detail::bulk_::detail::cuda_task<thrust::system::cuda::detail::bulk_::parallel_group<thrust::system::cuda::detail::bulk_::concurrent_group<thrust::system::cuda::detail::bulk_::agent<unsigned long=3>, unsigned long=256>, unsigned long=0>, thrust::system::cuda::detail::bulk_::detail::closure<thrust::system::cuda::detail::scan_detail::inclusive_scan_n, thrust::tuple<thrust::system::cuda::detail::bulk_::detail::cursor<unsigned int=1>, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, long, thrust::detail::normal_iterator<thrust::pointer<float, thrust::detail::execute_with_allocator<THCThrustAllocator, thrust::system::cuda::detail::execute_on_stream_base>, thrust::use_default, thrust::use_default>>, AddOp<float>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>>>(unsigned long=3) 0.00% 5.1520us 1 5.1520us 5.1520us 5.1520us void gemmSN_NN_kernel<float, float, float, int=128, int=2, int=4, int=8, int=2, int=4>(cublasGemmSmallNParams<float, float, float>, float const *, float const *, float, float, int)
k_blockCumOp and k_finalCumOp belongs to Theano. The thrust kernels with long name belongs to PyTorch.
k_blockCumOp
k_finalCumOp
The code used for profiling:
import numpy as np import theano as th import theano.tensor as T import torch s_x = T.vector() s_y = T.extra_ops.cumsum(s_x) fn = th.function([s_x], s_y) def fn_torch(x): return torch.cumsum(torch.from_numpy(x).cuda(), -1) x = np.random.rand(2**20).astype(th.config.floatX) for _ in range(100): fn(x) fn_torch(x)
float32
The text was updated successfully, but these errors were encountered:
I updated the title. Mostly, you tell that if we reuse pytorch kernel, we will speed up that op.
Do you have a Theano profile that show this op to take significant time in a real model?
Sorry, something went wrong.
Note the thurst library is inside CUDA, so this would not request a new dependency
No branches or pull requests
Performance
I did a comparsion with PyTorch. Here's
nvprof
dump:k_blockCumOp
andk_finalCumOp
belongs to Theano. The thrust kernels with long name belongs to PyTorch.The code used for profiling:
Interface
float32
The text was updated successfully, but these errors were encountered: