# **Performance best practices**

Array operations with GPUs can provide considerable speedups over CPU computing.

[CuPy](https://cupy.dev/) is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

*  Most operations perform well on a GPU using CuPy. CuPy speeds up some operations more than 100X.

*  CuPy's interface is highly compatible with NumPy. CuPy supports various methods, indexing, data types, broadcasting and more. This [comparison table](https://docs.cupy.dev/en/stable/reference/comparison.html) shows a list of NumPy and their corresponding CuPy implementations.

[NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html) is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.



## **Benchmarking speed- NumPy vs CuPy**

In this tutorial, we will perform some opeations using NumPy and CuPy library and we will benchmark the time.

In [1]:
# import libraries
import numpy as np
import cupy

Let's start with creating an array using NumPy and CuPy and compare the time.

### **Creating an array with NumPy**

In [2]:
%%time
np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

CPU times: user 48 ms, sys: 8 ms, total: 56 ms
Wall time: 53.5 ms


### **Creating and array with CuPy**

In [3]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))
cupy.cuda.Stream.null.synchronize()

In [4]:
# print the numpy array
import numpy as np
np_var1 = np.random.random((1000, 1000))
np_var1 

array([[0.86139284, 0.51968382, 0.64682396, ..., 0.5719297 , 0.02804524,
        0.90956737],
       [0.77313878, 0.49454304, 0.42123998, ..., 0.90027678, 0.47534394,
        0.72142304],
       [0.25829272, 0.36737239, 0.20694988, ..., 0.95112   , 0.69233084,
        0.29654616],
       ...,
       [0.58576955, 0.01539602, 0.47200234, ..., 0.22742476, 0.31980811,
        0.66098713],
       [0.27765078, 0.99942745, 0.39060747, ..., 0.79641963, 0.53399145,
        0.39524481],
       [0.95848978, 0.72407359, 0.43415957, ..., 0.04734325, 0.59349766,
        0.78951583]])

In [5]:
# print the cupy array
import cupy

cp_var1 = cupy.random.random((10000, 10000))
cp_var1

array([[0.50679817, 0.88649618, 0.76166695, ..., 0.76044118, 0.59973805,
        0.02745095],
       [0.91170895, 0.91294462, 0.21397125, ..., 0.27013075, 0.90983984,
        0.2507654 ],
       [0.97663516, 0.36242495, 0.98456868, ..., 0.99710868, 0.99481914,
        0.39645813],
       ...,
       [0.67855648, 0.97348506, 0.31238627, ..., 0.48497102, 0.19246986,
        0.41768917],
       [0.87243899, 0.34477809, 0.83757912, ..., 0.25456894, 0.62435893,
        0.56858035],
       [0.54647418, 0.0370549 , 0.60770487, ..., 0.40473857, 0.47062977,
        0.80280961]])

### **Trigonometric function**

Let's do some trigonometric operations and compare the time.

In [6]:
import numpy as np

np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

%timeit bool((np.sin(np_var1) ** 2 + np.cos(np_var1) ** 2 == 1).all())

165 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))
cupy.cuda.Stream.null.synchronize()

%timeit bool((cupy.sin(cp_var2) ** 2 + cupy.cos(cp_var2) ** 2 == 1).all())

63.3 ms ± 5.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### **Multiplying the array**

Let's do some aritmetic operations and compare the time.

In [8]:
import numpy as np

np_var1 = np.random.random((1000, 1000))

np_var1 *= 10

In [9]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))

cp_var1 *= 10
cupy.cuda.Stream.null.synchronize()

### **Performing multiple operations on the array**

In [10]:
import numpy as np

np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

np_var1 *= 10
np_var2 *= 20
np_var1 *= np_var2

In [11]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))

cp_var1 *= 10
cp_var2 *= 20
cp_var1 *= cp_var2
cupy.cuda.Stream.null.synchronize()

# **Exercise**
You can also find other operations from [the comparison table](https://docs.cupy.dev/en/stable/reference/comparison.html).

Use the following operations do NumPy and CuPy implementation and then benchmark the times. 
- numpy.absolute | cupy.absolute
- numpy.median | cupy.median
- numpy.resize | cupy.resize
- numpy.shape | cupy.shape
- numpy.ndarray.fill() | cupy.ndarray.fill()

In [14]:
import numpy as np
import cupy as cp
import time
import warnings
warnings.filterwarnings("ignore")

def benchmark(label, numpy_func, cupy_func, *args, **kwargs):
    # NumPy
    start = time.time()
    numpy_result = numpy_func(*args, **kwargs)
    numpy_time = time.time() - start

    # CuPy
    cp_args = [cp.asarray(arg) if isinstance(arg, np.ndarray) else arg for arg in args]
    start = time.time()
    cupy_result = cupy_func(*cp_args, **kwargs)
    cp.cuda.Device(0).synchronize()
    cupy_time = time.time() - start

    print(f"{label} | NumPy: {numpy_time:.6f}s | CuPy: {cupy_time:.6f}s")
    return numpy_result, cupy_result

In [17]:
# Generate large random array
np_data = np.random.randn(1000000).astype(np.float32)

In [19]:
# 1. absolute
benchmark("absolute", np.absolute, cp.absolute, np_data)

absolute | NumPy: 0.003473s | CuPy: 0.009852s


(array([0.29724455, 0.08564132, 0.8278056 , ..., 0.8533848 , 1.3613843 ,
        1.0353965 ], dtype=float32),
 array([0.29724455, 0.08564132, 0.8278056 , ..., 0.8533848 , 1.3613843 ,
        1.0353965 ], dtype=float32))

In [20]:
# 2. median
benchmark("median", np.median, cp.median, np_data)

median | NumPy: 0.033503s | CuPy: 1.756849s


(-0.0006930663, array(-0.00069307, dtype=float32))

In [21]:
# 3. resize
benchmark("resize", np.resize, cp.resize, np_data, (500000,))

resize | NumPy: 0.001640s | CuPy: 0.787733s


(array([-0.29724455, -0.08564132,  0.8278056 , ..., -1.3116686 ,
        -1.6371326 ,  0.5224464 ], dtype=float32),
 array([-0.29724455, -0.08564132,  0.8278056 , ..., -1.3116686 ,
        -1.6371326 ,  0.5224464 ], dtype=float32))

In [22]:
# 4. shape (accessed directly, not a function)
print("shape | NumPy:", np_data.shape, "| CuPy:", cp.asarray(np_data).shape)

shape | NumPy: (1000000,) | CuPy: (1000000,)


In [23]:
# 5. ndarray.fill()
# NumPy
start = time.time()
np_fill = np.empty((1000000,), dtype=np.float32)
np_fill.fill(7)
numpy_fill_time = time.time() - start

# CuPy
start = time.time()
cp_fill = cp.empty((1000000,), dtype=cp.float32)
cp_fill.fill(7)
cp.cuda.Device(0).synchronize()
cupy_fill_time = time.time() - start

print(f"fill() | NumPy: {numpy_fill_time:.6f}s | CuPy: {cupy_fill_time:.6f}s")

fill() | NumPy: 0.003195s | CuPy: 0.764591s
