# **Performance best practices**

Array operations with GPUs can provide considerable speedups over CPU computing.

[CuPy](https://cupy.dev/) is an open-source array library for GPU-accelerated computing with Python. CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

*  Most operations perform well on a GPU using CuPy. CuPy speeds up some operations more than 100X.

*  CuPy's interface is highly compatible with NumPy. CuPy supports various methods, indexing, data types, broadcasting and more. This [comparison table](https://docs.cupy.dev/en/stable/reference/comparison.html) shows a list of NumPy and their corresponding CuPy implementations.

[NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html) is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.



## **Benchmarking speed- NumPy vs CuPy**

In this tutorial, we will perform some opeations using NumPy and CuPy library and we will benchmark the time.

In [1]:
# import libraries
import numpy as np
import cupy

Let's start with creating an array using NumPy and CuPy and compare the time.

### **Creating an array with NumPy**

In [2]:
%%time
np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

CPU times: user 52 ms, sys: 36 ms, total: 88 ms
Wall time: 89.4 ms


### **Creating and array with CuPy**

In [3]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))
cupy.cuda.Stream.null.synchronize()

In [4]:
# print the numpy array
import numpy as np
np_var1 = np.random.random((1000, 1000))
np_var1 

array([[0.35063281, 0.80893198, 0.91453072, ..., 0.60154181, 0.01145897,
        0.14707778],
       [0.09622752, 0.73849258, 0.60941948, ..., 0.45911144, 0.07324936,
        0.54811486],
       [0.63746221, 0.41509159, 0.48774599, ..., 0.43642978, 0.46748861,
        0.14921748],
       ...,
       [0.88984273, 0.30721908, 0.77000141, ..., 0.8578446 , 0.85904314,
        0.33674486],
       [0.24330214, 0.42415694, 0.80291595, ..., 0.20949958, 0.58662834,
        0.95382801],
       [0.78260394, 0.61864832, 0.81977013, ..., 0.00658037, 0.26171171,
        0.83665921]])

In [5]:
# print the cupy array
import cupy

cp_var1 = cupy.random.random((10000, 10000))
cp_var1

array([[0.65963383, 0.0358734 , 0.07628207, ..., 0.48937246, 0.52197267,
        0.33891188],
       [0.3073909 , 0.87452781, 0.07990874, ..., 0.68921197, 0.41215489,
        0.19447408],
       [0.36956123, 0.0063404 , 0.43942412, ..., 0.94645865, 0.50385998,
        0.87317452],
       ...,
       [0.41483496, 0.61972582, 0.18360737, ..., 0.18216225, 0.45737958,
        0.45824428],
       [0.53804461, 0.49404657, 0.85041006, ..., 0.52874339, 0.82241322,
        0.55631528],
       [0.11334664, 0.23737837, 0.36879398, ..., 0.88635032, 0.26332911,
        0.00368526]])

### **Trigonometric function**

Let's do some trigonometric operations and compare the time.

In [6]:
import numpy as np

np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

%timeit bool((np.sin(np_var1) ** 2 + np.cos(np_var1) ** 2 == 1).all())

164 ms ± 176 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))
cupy.cuda.Stream.null.synchronize()

%timeit bool((cupy.sin(cp_var2) ** 2 + cupy.cos(cp_var2) ** 2 == 1).all())

76.5 ms ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### **Multiplying the array**

Let's do some aritmetic operations and compare the time.

In [8]:
import numpy as np

np_var1 = np.random.random((1000, 1000))

np_var1 *= 10

In [9]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))

cp_var1 *= 10
cupy.cuda.Stream.null.synchronize()

### **Performing multiple operations on the array**

In [10]:
import numpy as np

np_var1 = np.random.random((1000, 1000))
np_var2 = np.random.random((1000, 1000))

np_var1 *= 10
np_var2 *= 20
np_var1 *= np_var2

In [11]:
import cupy

cp_var1 = cupy.random.random((1000, 1000))
cp_var2 = cupy.random.random((1000, 1000))

cp_var1 *= 10
cp_var2 *= 20
cp_var1 *= cp_var2
cupy.cuda.Stream.null.synchronize()

# **Exercise**
You can also find other operations from [the comparison table](https://docs.cupy.dev/en/stable/reference/comparison.html).

Use the following operations do NumPy and CuPy implementation and then benchmark the times. 
- numpy.absolute | cupy.absolute
- numpy.median | cupy.median
- numpy.resize | cupy.resize
- numpy.shape | cupy.shape
- numpy.ndarray.fill() | cupy.ndarray.fill()

In [12]:
import numpy as np
%time
np_var1 = np.absolute((1000, 1000))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 20.5 µs


In [13]:
import cupy

%time
# Create a cupy array from the tuple and apply absolute value
cp_var1 = cupy.absolute(cupy.array([1000, 1000]))
cupy.cuda.Stream.null.synchronize()

print(cp_var1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21 µs
[1000 1000]


In [14]:
import numpy as np
%time
np_var1 = np.median((1000, 1000))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.2 µs


In [15]:
import cupy

%time
# Create a cupy array from the tuple and apply absolute value
cp_var1 = cupy.median(cupy.array([1000, 1000]))
cupy.cuda.Stream.null.synchronize()

print(cp_var1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 24.1 µs
1000.0


In [16]:
import numpy as np
%time
np_var1 = np.resize(np.zeros((1000, 1000)), (2000, 2000))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.7 µs


In [17]:
import cupy

%time
# Create a Cupy array
cp_var1 = cupy.array(np.zeros((1000, 1000)))

# Resize the array to the new shape (for example, 2000x2000)
cp_var1_resized = cupy.resize(cp_var1, (2000, 2000))

cupy.cuda.Stream.null.synchronize()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.2 µs


In [18]:
import numpy as np
%time
np_var1 = np.shape((1000, 1000))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 24.6 µs


In [19]:
import cupy
%time
# Create a cupy array from the tuple and apply absolute value
cp_var1 = cupy.shape(cupy.array([1000, 1000]))
cupy.cuda.Stream.null.synchronize()

print(cp_var1)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 26.2 µs
(2,)


In [20]:
import numpy as np

%time
# Create an empty 1000x1000 array
np_var1 = np.empty((1000, 1000))

# Fill the array with a specific value (e.g., 5)
np_var1.fill(5)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 21.5 µs


In [21]:
import cupy

%time
# Create a 1000x1000 cupy array
cp_var1 = cupy.empty((1000, 1000))

# Fill the array with a specific value (e.g., 5)
cp_var1.fill(5)

# Or, use cupy.full() to directly create the array filled with the value
cp_var2 = cupy.full((1000, 1000), 5)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 22.6 µs
