<a href="https://colab.research.google.com/github/geoffwoollard/gpu-speedups-mbptechtalk2020/blob/master/5_intro_pycuda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MBP Tech Talk 2020 :: Intro to PyCUDA
PyCUDA is a way to get a CUDA C kernel into python. If you come across an existing CUDA C file and don't want to rewrite it python to make a numba cuda kernel, pycuda is one option.

You might come across, the `skcuda` library, which is built on top of `pycuda`. It is developed independently and includes [high level routines](https://scikit-cuda.readthedocs.io/en/latest/reference.html#high-level-routines) like linear algebra (matrix operations, svd, PCA, eigencevtors and values, etc).

The following notebook is from the [pycuda tutorial](https://documen.tician.de/pycuda/index.html)

`pycuda` is not installed on google colab, but we can get it with `pip`

In [0]:
!pip install pycuda

In [0]:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

In [0]:
# made random 4x4 array and copy over to GPU
a = np.random.randn(4,4)
a = a.astype(np.float32) #nvidia single precision
a_gpu = cuda.mem_alloc(a.nbytes) # allocate memory on device
cuda.memcpy_htod(a_gpu,a) # transfer data to GPU

In [0]:
# The string below is in CUDA C. In large coding projects people separate this into a `.cu` file.
mod = SourceModule("""
  __global__ void doublify(float *a)
  {
    int idx = threadIdx.x + threadIdx.y*4; // comments allowed
    a[idx] *= 2;
  }
  """)

In [0]:
# SourceModule can catch compiler errors, and gives error messages for troubleshooting 
SourceModule("""
  __global__ void doublify(float *a)
  {
    int idx = threadIdx.x + threadIdx.y*4 // missing ;
    a[idx] *= 2;
  }
  """)

In [0]:
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1), grid=(1,1,1))

In [0]:
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled,a_gpu)
print(a_doubled)
print(2*a)

# `pycuda.gpuarray.GPUArray`
`pycuda.gpuarray.GPUArray` abstracts away much of the data trasfers and memory allocation mentioned above. 

In [0]:
import pycuda.gpuarray as gpuarray


In [0]:
a_gpu = gpuarray.to_gpu(np.random.randn(4,4).astype(np.float32))
a_doubled = (2*a_gpu).get()
print(a_doubled)
print(a_gpu)