# clEsperanto: Introduction

The python package `pyclesperanto` can be install using `pip` or `mamba`. If all went well, there shouldn't be any issue at the package import stage.

In [66]:
import pyclesperanto as cle
import numpy as np

`clEsperanto` runs on __OpenCL__ (Open Computing Language) which is a standard language for parallel programming of diverse Processing Units which can be Graphical (GPU) or Central (CPU). Strong point of the language is its compatibility with a vast set of devices. At import, clesperanto will automatically select a device (the first one found) but it might not be the best nor the one you want to select is you have more than one device.

Hence, the first step is to prospect your hardware and identify the best device to run our operations. We provide the following function to enquiry and manage the Processing Units of your system.

### Exercice 1: System specification

Using `cle.info()` fetch your full system specification: How many devices available do you have? Which device is the most adapted for you?

Here are a few quick definition:

- __GLOBAL_MEM_SIZE__: Total RAM memory of the device
- __MAX_MEM_ALLOC_SIZE__: Maximum RAM memory allocation possible
- __MAX_COMPUTE_UNITS__: Number of computing core of the Processing Unit
- __MAX_CLOCK_FREQUENCY__: Processing speed of each core

In [67]:
# TODO - Check your system information

### Exercice 2: Select a device

Using `cle.select_device()`, select a specific device from your system

In [68]:
# TODO - select a specific device

The devices memory and main computer memory are separated, in order for a device to acces a data it requires to transfert it from the main memory to the device memory, and to transfert it back to the main computer memory once all processing are done. 

Because of this, `clesperanto` comes with its own array structure which is compatible with `numpy.array`, and a set of function to send and read from the device:
- `push()` : send an array from CPU to GPU
- `pull()` : read an array from GPU to CPU
- `create()` : allocate empty space on GPU

In [69]:
np.random.seed(0)
np_arr = np.random.rand(5,10)  # CPU array
gpu_arr = cle.push(np_arr)    # GPU array
out_arr = cle.pull(gpu_arr)   # CPU array

print(f"np_arr:  shape={np_arr.shape}, dtype={np_arr.dtype}, device={np_arr.device}, type={type(np_arr)}")
print(f"gpu_arr: shape={gpu_arr.shape}, dtype={gpu_arr.dtype}, device={gpu_arr.device.name}, type={type(gpu_arr)}")
print(f"out_arr: shape={out_arr.shape}, dtype={out_arr.dtype}, device={out_arr.device}, type={type(out_arr)}")

np_arr:  shape=(5, 10), dtype=float64, device=cpu, type=<class 'numpy.ndarray'>
gpu_arr: shape=(5, 10), dtype=float32, device=NVIDIA GeForce RTX 4090, type=<class 'pyclesperanto._pyclesperanto._Array'>
out_arr: shape=(5, 10), dtype=float32, device=cpu, type=<class 'numpy.ndarray'>


In [70]:
print(gpu_arr)
print("\ngpu arr(0,0)=", gpu_arr[0,0])  # Accessing element at (0,0))

[[0.5488135  0.71518934 0.60276335 0.5448832  0.4236548  0.6458941
  0.4375872  0.891773   0.96366274 0.3834415 ]
 [0.79172504 0.5288949  0.56804454 0.92559665 0.07103606 0.0871293
  0.0202184  0.83261985 0.77815676 0.87001216]
 [0.9786183  0.7991586  0.46147937 0.7805292  0.11827443 0.639921
  0.14335328 0.9446689  0.5218483  0.41466194]
 [0.2645556  0.7742337  0.45615032 0.56843394 0.0187898  0.6176355
  0.6120957  0.616934   0.94374806 0.6818203 ]
 [0.3595079  0.43703195 0.6976312  0.06022547 0.6667667  0.67063785
  0.21038257 0.12892629 0.31542835 0.36371076]]

gpu arr(0,0)= [[0.5488135]]


### Exercice 1: Load an image into your device

In [71]:
from skimage.io import imread
image = imread("https://imagej.net/ij/images/3_channel_inverted_luts.tif")

# TODO - check the image shape and push it to the GPU

### Exercise 2: What is the largest array you can push to your device?

One of the biggest limitation in GPU-acceleration is the memory limitation of your device, what is the size data you can push to your device? Does it fit your hardware specification?  
Here, we want to see your hardware limitation and the type of error you would get if this happens.

In [72]:
# TODO - generate a large numpy array and push it to the GPU until ... it crashes ?

You can trace your device usage with various OSs application to see memory occupancy and the core usage:
- MacOS: Activity Monitor > View > GPU History
- Windows: Task Manager > Performance
- NVIDIA: run `watch -n0.1 nvidia-smi` in a prompt / terminal
- ...

Now that you manage to fill up your device memory, delete the variable using `del`

In [73]:
# TODO - delete the GPU variable array

# Cupy: Introduction

> [WARNING]
> Only for NVIDIA Hardware !

Cupy is a library for processing data using __CUDA__ (Compute Unified Device Architecture) a proprietary language for NVIDIA graphics cards __only__.

It is a numpy/scipy compatible library for GPU acceleration and can act as a dropin replacement, with a seemless integration to the __numpy__ ecosystem.

In [74]:
try:
    import cupy as xp
except:
    import numpy as xp
    Warning("Cupy not found, using numpy instead.")

try:
    import cupyx.scipy.ndimage as xdi
except:
    import scipy.ndimage as xdi
    Warning("Cupy not found, using scipy instead.")

import numpy as np
import scipy.ndimage as ndi

print(xp)
print(xdi)
print()
print(np)
print(ndi)

<module 'cupy' from '/home/strigaud/Libraries/miniforge3/envs/skbe/lib/python3.12/site-packages/cupy/__init__.py'>
<module 'cupyx.scipy.ndimage' from '/home/strigaud/Libraries/miniforge3/envs/skbe/lib/python3.12/site-packages/cupyx/scipy/ndimage/__init__.py'>

<module 'numpy' from '/home/strigaud/Libraries/miniforge3/envs/skbe/lib/python3.12/site-packages/numpy/__init__.py'>
<module 'scipy.ndimage' from '/home/strigaud/Libraries/miniforge3/envs/skbe/lib/python3.12/site-packages/scipy/ndimage/__init__.py'>


Because it is CUDA based, only NVIDIA device can be used. Hence, you can identify the hardware you want to use based on their __index__ which is based on the library discovery pattern which can change vary between system and version.

We can explore the devices specificities and we should recover similar information than with `clesperanto` with possible different name convention or units.

In [75]:
num_devices = xp.cuda.runtime.getDeviceCount()
for i in range(num_devices):
    print(f"Device {i}:")
    device_properties = xp.cuda.runtime.getDeviceProperties(i)
    for k, v in device_properties.items():
        print(f"- {k}: {v}")

Device 0:
- name: b'NVIDIA GeForce RTX 4090'
- totalGlobalMem: 25358630912
- sharedMemPerBlock: 49152
- regsPerBlock: 65536
- warpSize: 32
- maxThreadsPerBlock: 1024
- maxThreadsDim: (1024, 1024, 64)
- maxGridSize: (2147483647, 65535, 65535)
- clockRate: 2520000
- totalConstMem: 65536
- major: 8
- minor: 9
- textureAlignment: 512
- texturePitchAlignment: 32
- multiProcessorCount: 128
- kernelExecTimeoutEnabled: 1
- integrated: 0
- canMapHostMemory: 1
- computeMode: 0
- maxTexture1D: 131072
- maxTexture2D: (131072, 65536)
- maxTexture3D: (16384, 16384, 16384)
- concurrentKernels: 1
- ECCEnabled: 0
- pciBusID: 1
- pciDeviceID: 0
- pciDomainID: 0
- tccDriver: 0
- memoryClockRate: 10501000
- memoryBusWidth: 384
- l2CacheSize: 75497472
- maxThreadsPerMultiProcessor: 1536
- isMultiGpuBoard: 0
- cooperativeLaunch: 1
- cooperativeMultiDeviceLaunch: 1
- deviceOverlap: 1
- maxTexture1DMipmap: 32768
- maxTexture1DLinear: 268435456
- maxTexture1DLayered: (32768, 2048)
- maxTexture2DMipmap: (32768,

Even if the interface allows dropin-replacement, the memory still requires a copy from the CPU memory to the GPU memory. Here, Cupy rely on a numpy style api with `asarray` and `get` methods to transfert the data.

In [76]:
np.random.seed(0)
np_arr = np.random.rand(5,10)  # CPU array
gpu_arr = xp.asarray(np_arr)   # GPU array
out_arr = gpu_arr.get()        # CPU array

print(f"np_arr:  shape={np_arr.shape}, dtype={np_arr.dtype}, device={np_arr.device}, type={type(np_arr)}")
print(f"gpu_arr: shape={gpu_arr.shape}, dtype={gpu_arr.dtype}, device={gpu_arr.device}, type={type(gpu_arr)}")
print(f"out_arr: shape={out_arr.shape}, dtype={out_arr.dtype}, device={out_arr.device}, type={type(out_arr)}")

np_arr:  shape=(5, 10), dtype=float64, device=cpu, type=<class 'numpy.ndarray'>
gpu_arr: shape=(5, 10), dtype=float64, device=<CUDA Device 0>, type=<class 'cupy.ndarray'>
out_arr: shape=(5, 10), dtype=float64, device=cpu, type=<class 'numpy.ndarray'>


In [77]:
print(gpu_arr)
print("\ngpu arr(0,0)=", gpu_arr[0,0])  # Accessing element at (0,0))

[[0.5488135  0.71518937 0.60276338 0.54488318 0.4236548  0.64589411
  0.43758721 0.891773   0.96366276 0.38344152]
 [0.79172504 0.52889492 0.56804456 0.92559664 0.07103606 0.0871293
  0.0202184  0.83261985 0.77815675 0.87001215]
 [0.97861834 0.79915856 0.46147936 0.78052918 0.11827443 0.63992102
  0.14335329 0.94466892 0.52184832 0.41466194]
 [0.26455561 0.77423369 0.45615033 0.56843395 0.0187898  0.6176355
  0.61209572 0.616934   0.94374808 0.6818203 ]
 [0.3595079  0.43703195 0.6976312  0.06022547 0.66676672 0.67063787
  0.21038256 0.1289263  0.31542835 0.36371077]]

gpu arr(0,0)= 0.5488135039273248
