# Part 1 - Programming Fancy Devices (with OpenCL)

## Setup

Before doing anything else, we need to import [PyOpenCL](https://documen.tician.de/pyopencl/) and [NumPy](http://www.numpy.org/).

In [None]:
import pyopencl
import numpy

## Build API

### Selecting the NVIDIA platform

1. Look at the platforms available
2. Select the first one, NVIDIA

In [None]:
print(pyopencl.get_platforms())
nvidia_platform = pyopencl.get_platforms()[0]

### Creating a context

1. Getting the devices from the platform
2. Using the devices to create the context

In [None]:
nvidia_devices = nvidia_platform.get_devices()
nvidia_context = pyopencl.Context(devices=nvidia_devices)

### Compiling a program

1. Specify the program source for a simple vector operation: $\vec{c} = \vec{a} + \vec{b}$
2. Create the program object
3. Build the program
4. Get the list of all the available kernels

In [None]:
program_source = """
kernel void sum(global float *a, 
                global float *b, 
                global float *c)
{
  int gid = get_global_id(0);
  c[gid] = a[gid] + b[gid];
}
"""
nvidia_program_source = pyopencl.Program(nvidia_context,program_source)
nvidia_program = nvidia_program_source.build()
print("Kernel Names:",nvidia_program.get_info(pyopencl.program_info.KERNEL_NAMES))

## Runtime API

### Creating the command queue

1. Create the queue using the existing context

In [None]:
nvidia_queue = pyopencl.CommandQueue(nvidia_context)

### Creating memory resources

(this will be explained in more detail in part II)
1. Create the arrays with data in them
2. Create the OpenCL buffers.

In [None]:
N = int(1e8)
a = numpy.random.rand(N).astype(numpy.float32)
b = numpy.random.rand(N).astype(numpy.float32)
c = numpy.empty_like(a)

In [None]:
a_nvidia_buffer = pyopencl.Buffer(nvidia_context,
                                  flags=pyopencl.mem_flags.READ_ONLY, 
                                  size=a.nbytes)
b_nvidia_buffer = pyopencl.Buffer(nvidia_context, 
                                  flags=pyopencl.mem_flags.READ_ONLY, 
                                  size=b.nbytes)
c_nvidia_buffer = pyopencl.Buffer(nvidia_context, 
                                  flags=pyopencl.mem_flags.WRITE_ONLY, 
                                  size=c.nbytes)

### Running the program

1. Copy the data from the arrays to the read buffers
2. Run the program
3. Ready the data from the result buffer, and wait for the read to finish
4. Check the result

In [None]:
def run_gpu_program():
    #copying data onto GPU
    pyopencl.enqueue_copy(nvidia_queue,
                          src=a,
                          dest=a_nvidia_buffer)
    pyopencl.enqueue_copy(nvidia_queue,
                          src=b,
                          dest=b_nvidia_buffer)
    
    #running program
    kernel_arguments = (a_nvidia_buffer,b_nvidia_buffer,c_nvidia_buffer) 
    nvidia_program.sum(nvidia_queue,
                       a.shape, #global size
                       None, #local size
                       *kernel_arguments)

    #copying data off GPU
    copy_off_event = pyopencl.enqueue_copy(nvidia_queue,
                                           src=c_nvidia_buffer,
                                           dest=c)
    copy_off_event.wait()
    
def check_results(a,b,c):
    if((c - (a + b)).sum() > 0.0): print("result does not match")
    else: print("result matches!")    

#checking result
run_gpu_program()
check_results(a,b,c)

## Module Challenge

Perform the vector addition example, as above, but using the Intel platform to program the instance's CPU:

In [None]:
#Building the Intel
intel_platform = pyopencl.get_platforms()[1]
intel_devices = intel_platform.get_devices()
intel_context = pyopencl.Context(devices=intel_devices)

#Building the program
intel_program_source = pyopencl.Program(intel_context,program_source)
intel_program = intel_program_source.build()

#Memory buffers
a_intel_buffer = pyopencl.Buffer(intel_context,
                                 flags=pyopencl.mem_flags.READ_ONLY, 
                                 size=a.nbytes)
b_intel_buffer = pyopencl.Buffer(intel_context, 
                                 flags=pyopencl.mem_flags.READ_ONLY, 
                                 size=b.nbytes)
c_intel_buffer = pyopencl.Buffer(intel_context, 
                                 flags=pyopencl.mem_flags.WRITE_ONLY, 
                                 size=c.nbytes)
#Command Queue
intel_queue = pyopencl.CommandQueue(intel_context)

def run_cpu_program():
    #copying data onto CPU
    pyopencl.enqueue_copy(intel_queue,
                          src=a,
                          dest=a_intel_buffer)
    pyopencl.enqueue_copy(intel_queue,
                          src=b,
                          dest=b_intel_buffer)
    
    #running program
    kernel_arguments = (a_intel_buffer,b_intel_buffer,c_intel_buffer) 
    intel_program.sum(intel_queue,
                       a.shape, #global size
                       None, #local size
                       *kernel_arguments)

    #copying data off CPU
    copy_off_event = pyopencl.enqueue_copy(intel_queue,
                                           src=c_intel_buffer,
                                           dest=c)
    copy_off_event.wait()

#checking result
run_cpu_program()
if((c - (a + b)).sum() > 0.0): print("result does not match")
else: print("result matches!")

### Bonus round

Compare the performance of the two using the `%timeit` magic function

In [None]:
%timeit run_gpu_program()
%timeit run_cpu_program()