### Style 1

In this mode, there will be a unified function definition. The array operations will operate ( preferably) on 1-D array, so it is useful for the functional chain approach, and can be integrated on oamap . 

The definitions will check if it is a Gpuarray or numpy/CPU array, and accordingly perform the operations. This will avoid GPUArray creation/copy everytime the function is called.

An example implementation for calculation dot() operation on two arrays is shown below

In [2]:
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
from pycuda.elementwise import ElementwiseKernel
import numpy as np

In [3]:
def dot(lst1, lst2):
    if isinstance(lst1, gpuarray.GPUArray) and isinstance(lst1, gpuarray.GPUArray): #check for GPUArray
        m = len(lst1)
        #n= len(lst2)
        #assert(m == n)
        gpukern = ElementwiseKernel(
        "float *x, float *y,float *out, int m",
        "out[i] = x[i]*y[i]",
        "gpukern",
        )
        out = gpuarray.empty_like(lst1)
        gpukern(lst1, lst2, out, m)
        return out
    else:
        return np.multiply(lst1, lst2)


In [4]:
a = np.arange(4000000, dtype = np.float32)
b = np.arange(4000000, dtype = np.float32)

a_gpu = gpuarray.to_gpu(a.astype(np.float32))
b_gpu = gpuarray.to_gpu(b.astype(np.float32))

c = dot(a_gpu, b_gpu)  # can change arguments to a, b. The function should automatically dtermine the type, and operation

### Limitations
This approach will however, be of limited use for GPU acceleration, in case the actual size of 1-D arrays generated is very small (less than 50,000 ~ 60,000). The GPU Copy overhead itself will kill the runtime of the code. 

Furthermore, it can become complicated soon, if multidimensional arrays are involved. This can be reduced to some extent by using a more GPU focused library like pytorch, or vectorize the codes with numba