# Benchmarking affine transforms using numpy, cupy and clesperanto.
Here we compare performance of affine transforms implemented in [cupy](https://cupy.dev), [scipy](https://scipy.org) and [clEsperanto](https://github.com/clEsperanto/pyclesperanto_prototype/tree/master).

In [1]:
import pyclesperanto_prototype as cle
from skimage.io import imread, imshow
import numpy as np
import time
import cupy
from cupyx.scipy import ndimage as ndi
from scipy import ndimage as sndi
import stackview

In [2]:
cle.available_device_names()

['NVIDIA GeForce RTX 3050 Ti Laptop GPU',
 'gfx1035',
 'cupy backend (experimental)']

In [3]:
# to measure kernel execution duration properly, we need to set this flag. It will slow down exection of workflows a bit though
cle.set_wait_for_kernel_finish(True)

# selet a GPU with the following in the name. This will fallback to any other GPU if none with this name is found
cle.select_device('TX')

<NVIDIA GeForce RTX 3050 Ti Laptop GPU on Platform: NVIDIA CUDA (1 refs)>

In [4]:
image = imread('../03b_image_processing/data/Haase_MRT_tfl3d1.tif')

In [5]:
stackview.insight(image[96])

0,1
,"shape(256, 256) dtypeuint8 size64.0 kB min0max255"

0,1
shape,"(256, 256)"
dtype,uint8
size,64.0 kB
min,0
max,255


In [6]:
# scaling by factor
s = 0.5
matrix = np.asarray([
    [1/s, 0, 0, 0],
    [0, 1/s, 0, 0],
    [0, 0, 1/s, 0],
    [0, 0, 0, 1],
])
output_shape = tuple((np.asarray(image.shape) * s).astype(int))
print(output_shape)

(96, 128, 128)


## cupy

In [7]:
cuda_image = cupy.asarray(image)

cuda_scaled = cupy.ndarray(output_shape)
for i in range(0, 10):
    start_time = time.time()
    ndi.affine_transform(cuda_image, cupy.asarray(matrix), output=cuda_scaled, output_shape=output_shape)
    cupy.cuda.stream.get_current_stream().synchronize() # we need to wait here to measure time properly
    print("cupy affine transform duration: " + str(time.time() - start_time))
          
result = cupy.asnumpy(cuda_scaled)
stackview.insight(result[48])

cupy affine transform duration: 0.13012480735778809
cupy affine transform duration: 0.03827500343322754
cupy affine transform duration: 0.03521728515625
cupy affine transform duration: 0.030150413513183594
cupy affine transform duration: 0.030646800994873047
cupy affine transform duration: 0.02948594093322754
cupy affine transform duration: 0.030394315719604492
cupy affine transform duration: 0.0323793888092041
cupy affine transform duration: 0.032364606857299805
cupy affine transform duration: 0.02682209014892578


0,1
,"shape(128, 128) dtypefloat64 size128.0 kB min-1.814630295061106e-06max255.00002926477683"

0,1
shape,"(128, 128)"
dtype,float64
size,128.0 kB
min,-1.814630295061106e-06
max,255.00002926477683


## clEsperanto

In [8]:
ocl_image = cle.push(image)

ocl_scaled = cle.create(output_shape)
for i in range(0, 10):
    start_time = time.time()
    cle.affine_transform(ocl_image, ocl_scaled, transform=np.linalg.inv(matrix), linear_interpolation=True)
    print("clEsperanto affine transform duration: " + str(time.time() - start_time))

result = cle.pull(ocl_scaled)
stackview.insight(result[48])

clEsperanto affine transform duration: 0.01579570770263672
clEsperanto affine transform duration: 0.005071163177490234
clEsperanto affine transform duration: 0.00540924072265625
clEsperanto affine transform duration: 0.004240274429321289
clEsperanto affine transform duration: 0.00461268424987793
clEsperanto affine transform duration: 0.005407094955444336
clEsperanto affine transform duration: 0.00403904914855957
clEsperanto affine transform duration: 0.005263566970825195
clEsperanto affine transform duration: 0.0051195621490478516
clEsperanto affine transform duration: 0.004664421081542969


0,1
,"shape(128, 128) dtypefloat32 size64.0 kB min0.0max255.0"

0,1
shape,"(128, 128)"
dtype,float32
size,64.0 kB
min,0.0
max,255.0


## Scipy

In [9]:
scaled = np.ndarray(output_shape)
for i in range(0, 10):
    start_time = time.time()
    sndi.affine_transform(image, matrix, output=scaled, output_shape=output_shape)
    print("scipy affine transform duration: " + str(time.time() - start_time))

stackview.insight(scaled[48])

scipy affine transform duration: 1.724757194519043
scipy affine transform duration: 1.7545890808105469
scipy affine transform duration: 1.785327434539795
scipy affine transform duration: 1.8502540588378906
scipy affine transform duration: 1.8263814449310303
scipy affine transform duration: 1.8311867713928223
scipy affine transform duration: 1.7970592975616455
scipy affine transform duration: 1.7730398178100586
scipy affine transform duration: 1.8295350074768066
scipy affine transform duration: 1.84840989112854


0,1
,"shape(128, 128) dtypefloat64 size128.0 kB min-2.2358752113995016e-15max255.00000000000014"

0,1
shape,"(128, 128)"
dtype,float64
size,128.0 kB
min,-2.2358752113995016e-15
max,255.00000000000014


## Exercise
Run the benchmark using different input sizes. Make the input image much smaller e.g. by skipping to every 2,3,4th voxel in X,Y and Z (reducing the image size by factor 8, 27, 64). In which case does it make sense to use a GPU and in which not?

## Exercise
Go back 2 weeks to the [exercise where we compared Voronoi-Otsu-Labeling in two libraries](https://github.com/ScaDS/BIDS-lecture-2024/blob/main/04a_image_segmentation/11_voronoi_otsu_labeling.ipynb). Benchmark these two functions. How much faster is the one compared to the other on your laptop? How much faster is it on clara?