# What is CuPy?
CuPy is a library that implements Numpy arrays on Nvidia GPUs by using the CUDA GPU library. 
<br>
The many CUDA cores GPUs have allows significant parallel speedup.
<br>
It shares a similar interface/API as Numpy.
<br>
It supports most of the array operations that Numpy has such as:
- Indexing
- Broadcasting
- Math on arrays
- Various matrix transformations

Able to write custom Python code that uses CUDA and GPU speedups.
-	Need small snippet of code in C++ format
-	CuPy will automatically do the GPU conversion (similar to Cython)

In [None]:
# pip install cupy

# Demo
Note: each operation is run thrice.

In [1]:
import numpy as np
import cupy as cp
import time

To switch between Numpy and CuPy, replace Numpy's `np` with CuPy’s `cp`.

## Array Creation
This creates a 3D array with 1 Billion 1’s for both Numpy and CuPy.

In [4]:
# Numpy and CPU
s = time.time()
x_cpu = np.ones((1000,1000,100))
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu = cp.ones((1000,1000,100))
cp.cuda.Stream.null.synchronize() # new!
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.21465325355529785
0.014160394668579102
15.158705571363628


Cupy part has additionl line after initialization of cupy array. It ensures that code finishes executing on GPU before going to the next line.

- Numpy took 0.21 seconds
- CuPy took 0.014 seconds

## Mathematical Operations 
Multiply entire array by 5 and check the speed of Numpy vs CuPy.

In [7]:
# Numpy and CPU
s = time.time()
x_cpu *= 5
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu *= 5
cp.cuda.Stream.null.synchronize()
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.13804173469543457
0.017012357711791992
8.11420362973863


- Numpy took 0.14
- CuPy took 0.017

## Working with Multiple Arrays
1. Multiple array by 5
2. Multiple array by itself
3. Add array to itself

In [10]:
# Numpy and CPU
s = time.time()
x_cpu *= 5
x_cpu *= x_cpu
x_cpu += x_cpu
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu *= 5
x_gpu *= x_gpu
x_gpu += x_gpu
cp.cuda.Stream.null.synchronize()
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.38594937324523926
0.05420041084289551
7.120783168303766


- Numpy took 0.39 seconds on CPU
- CuPy took 0.054 seconds on GPU

# Is CuPy Always Faster?
Speedups are dependant on size of the array.
<br>

## n = 100,00,00

In [13]:
# Numpy and CPU
s = time.time()
x_cpu = np.ones((100,100,100))
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu = cp.ones((100,100,100))
cp.cuda.Stream.null.synchronize() # new!
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.003010988235473633
0.0007617473602294922
3.952738654147105


In [16]:
# Numpy and CPU
s = time.time()
x_cpu *= 5
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu *= 5
cp.cuda.Stream.null.synchronize()
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.002004861831665039
0.0009992122650146484
2.006442376521117


## n = 1000,00,00

In [19]:
# Numpy and CPU
s = time.time()
x_cpu = np.ones((1000,100,100))
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu = cp.ones((1000,100,100))
cp.cuda.Stream.null.synchronize() # new!
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.02737903594970703
0.001996755599975586
13.711761194029851


In [22]:
# Numpy and CPU
s = time.time()
x_cpu *= 5
e = time.time()
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu *= 5
cp.cuda.Stream.null.synchronize()
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.02737903594970703
0.002001523971557617
13.679094699225729


# n = 1000,000,00

In [25]:
# Numpy and CPU
s = time.time()
x_cpu = np.ones((1000,1000,100))
e = time.time()
cpu_time = e - s
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu = cp.ones((1000,1000,100))
cp.cuda.Stream.null.synchronize() # new!
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

# Numpy and CPU
s = time.time()
x_cpu *= 5
e = time.time()
print(cpu_time)

# CuPy and GPU
s = time.time()
x_gpu *= 5
cp.cuda.Stream.null.synchronize()
e = time.time()
gpu_time = e - s
print(gpu_time)

print(cpu_time/gpu_time)

0.2261803150177002
0.013798236846923828
16.391972215502644
0.2261803150177002
0.017035484313964844
13.277011140345966


The table shows difference in speed with varying size of the array:

| Operation | Array Size | Speedup |
| --- | ----------- | --- |
| Create | 100,00,00 | 3.95 |
| Multiple by 5 | 100,00,00 | 2.01 |
| Create | 1000,00,00 | 13.7 |
| Multiple by 5 | 1000,00,00 | 13.7 |
| Create | 1000,000,00 | 16.4 |
| Multiple by 5 | 1000,000,00 | 13.3 |

The speedup drastically kicks up once processing about 10 million data points and gets much faster crossing the 100 million points mark. Below that, Numpy is faster.

Surprisingly, when n = 1000,000,00, cupy will always be slower than numpy on the first run. Cupy becomes faster from the second run onwards.

# Reference
https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56