# PyCUDA installation

In [1]:
!pip install pycuda

Collecting pycuda
[?25l  Downloading https://files.pythonhosted.org/packages/5e/3f/5658c38579b41866ba21ee1b5020b8225cec86fe717e4b1c5c972de0a33c/pycuda-2019.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 18.3MB/s 
[?25hCollecting pytools>=2011.2
[?25l  Downloading https://files.pythonhosted.org/packages/66/c7/88a4f8b6f0f78d0115ec3320861a0cc1f6daa3b67e97c3c2842c33f9c089/pytools-2020.1.tar.gz (60kB)
[K     |████████████████████████████████| 61kB 11.1MB/s 
Collecting appdirs>=1.4.0
  Downloading https://files.pythonhosted.org/packages/56/eb/810e700ed1349edde4cbdc1b2a21e28cdf115f9faf263f6bbf8447c1abf3/appdirs-1.4.3-py2.py3-none-any.whl
Collecting mako
[?25l  Downloading https://files.pythonhosted.org/packages/50/78/f6ade1e18aebda570eed33b7c534378d9659351cadce2fcbc7b31be5f615/Mako-1.1.2-py2.py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 12.9MB/s 
Building wheels for collected packages: pycuda, pytools
  Building wheel for pycuda (setup.py) .



---



# Version 4: using ```ElementwiseKernel```

The first part of the code is the usual one.

In [0]:
import numpy as np

import pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray

########
# MAIN #
########

start = cuda.Event()
end   = cuda.Event()

N = 100000

h_a = np.random.randn(1, N)
h_b = np.random.randn(1, N)

h_a = h_a.astype(np.float32)
h_b = h_b.astype(np.float32)

d_a = gpuarray.to_gpu(h_a)
d_b = gpuarray.to_gpu(h_b)

In this example, ```d_c``` is explicitly defined. This is necessary for the use of the ```ElementwiseKernel``` module.

In [0]:
d_c = gpuarray.empty_like(d_a)

Load the ```ElementwiseKernel``` module.

In [0]:
from pycuda.elementwise import ElementwiseKernel

The ```ElementwiseKernel``` enables to define only the kernel instructions to be elementwise executed within the kernel. Here, to generalize Version #1, a general linear combination between ```d_a``` and ```d_b``` is considered. A reference to the elementwise kernel is defined in ```lin_comb```.


In [0]:
lin_comb = ElementwiseKernel(
        "float *d_c, float *d_a, float *d_b, float a, float b",
        "d_c[i] = a * d_a[i] + b * d_b[i]")

Invoke the ```lin_comb``` function.

In [5]:
start.record()
lin_comb(d_c, d_a, d_b, 2, 3)
end.record() 
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))

Processing time = 0.088942s


In [7]:
'''start.record()
lin_comb(d_c, d_a, d_b, 1, 6)
end.record() 
end.synchronize()
secs = start.time_till(end) * 1e-3
print("Processing time = %fs" % (secs))'''

Processing time = 0.000111s


The last part is as usual.

In [6]:
h_c = d_c.get()

if np.array_equal(h_c, 2 * h_a + 3 * h_b):
  print("Test passed!")
else :
  print("Error!")

cuda.Context.synchronize()

Test passed!
