### **Python in the loop**

As we are kind of fed-up with the silly huge C portions of code that we essentially need to write in order to produce a fully-functioning cuda program, let's introduce how we can add python in the loop, so that we get the benefit of its endless facilities, and conveniently carry out operations as: reading/writing folders, files, images, audio files, ... etc.

Let's first divide the code we write into 3 parts:


*   The core kernel code
*   The data handling wrapper (memory allocations, data transfer and kernel invocation)
*   Other program code concerned with inputs and outputs.

We'll now show how keep out kernel and wrapper C code (parts 1&2), compile it into a dynamic library, then link this library to the python program.

In [None]:
# Setup cuda environment
!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc4jupyter

Collecting git+https://github.com/andreinechaev/nvcc4jupyter.git
  Cloning https://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-0o5j4oqy
  Running command git clone --filter=blob:none --quiet https://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-0o5j4oqy
  Resolved https://github.com/andreinechaev/nvcc4jupyter.git to commit 28f872a2f99a1b201bcd0db14fdbc5a496b9bfd7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: nvcc4jupyter
  Building wheel for nvcc4jupyter (pyproject.toml) ... [?25l[?25hdone
  Created wheel for nvcc4jupyter: filename=nvcc4jupyter-1.2.1-py3-none-any.whl size=10742 sha256=fdb30b442a03a2b978d94f242f4f3839c3b517111d4c5bcc31496f40b5aacbba
  Stored in directory: /tmp/pip-ephem-wheel-cache-_dx40vwh/wheels/ef/1d/c6/f7e47f1aa1bc9d05c4120d94f90a79cf28603ef343b0dd43ff
Successfully bu

In [None]:
%%writefile sumArrayGPU.cu

// CUDA kernel function
__global__ void my_cuda_kernel(int *input, int *output, int size) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < size) {
        output[tid] = input[tid] * 2;  //# Example: Double each element
    }
}

// Wrapper function to call the CUDA kernel
extern "C" void my_cuda_function(int *input, int *output, int size) {
    // Allocate device memory
    int *d_input, *d_output;
    cudaMalloc((void**)&d_input, size * sizeof(int));
    cudaMalloc((void**)&d_output, size * sizeof(int));

    // Copy input data to device
    cudaMemcpy(d_input, input, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch CUDA kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
    my_cuda_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, d_output, size);

    // Copy result back to host
    cudaMemcpy(output, d_output, size * sizeof(int), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_input);
    cudaFree(d_output);
}

Writing sumArrayGPU.cu


In [None]:
# Compile the cuda code and produce a shared library to get linked to the python main program
!nvcc -arch=sm_75 -o sumArrayGPU.so -shared -Xcompiler -fPIC sumArrayGPU.cu

In [None]:
# Python function calling the compiled C++/CUDA function

# ctypes in python bridges the gap between python dynamic data types and c static ones.
import ctypes

# Load the CUDA library
cuda_lib = ctypes.CDLL('./sumArrayGPU.so')  # Update with the correct path

# Define the function prototype
cuda_lib.my_cuda_function.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
cuda_lib.my_cuda_function.restype = None

# Prepare data
input_data = [1, 2, 3, 4]
output_data = [0, 0, 0, 0]
size = len(input_data)

# Convert Python lists to ctypes arrays
input_array = (ctypes.c_int * size)(*input_data)
output_array = (ctypes.c_int * size)(*output_data)

# Call the CUDA function
cuda_lib.my_cuda_function(input_array, output_array, size)

# Print the result
result = list(output_array)
print("Result:", result)

Result: [2, 4, 6, 8]


In [None]:
# Python function calling the compiled C++/CUDA function

# ctypes in python bridges the gap between python dynamic data types and c static ones.
import ctypes
import numpy as np

# Load the CUDA library
cuda_lib = ctypes.CDLL('./sumArrayGPU.so')  # Update with the correct path

# Define the function prototype
cuda_lib.my_cuda_function.argtypes = [ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
cuda_lib.my_cuda_function.restype = None

# Prepare data
input_data = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
output_data = np.array([[0, 0, 0, 0], [0, 0, 0, 0]])
size = len(input_data.flatten())

# Convert Python lists to ctypes arrays
input_array = (ctypes.c_int * size)(*input_data.flatten())
output_array = (ctypes.c_int * size)(*output_data.flatten())

# Call the CUDA function
cuda_lib.my_cuda_function(input_array, output_array, size)

# Print the result
result = np.array(list(output_array)).reshape(2, 4)
print("Result:", result)

Result: [[ 2  4  6  8]
 [10 12 14 16]]


In [None]:
import numpy as np

mat = np.array([[1,2,3],[4,5,6],[7,8,9]])

print(mat.flatten())

volume = np.array([[[1,1,1],
                    [2,2,2],],
                  [[3,3,3],
                   [4,4,4],],
                  [[5,5,5],
                   [6,6,6],]
                   ])

print(volume.flatten())

print(mat.flatten().reshape((3,3)))
print()
print(volume.flatten().reshape((3,2,3)))


[1 2 3 4 5 6 7 8 9]
[1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6]
[[1 2 3]
 [4 5 6]
 [7 8 9]]

[[[1 1 1]
  [2 2 2]]

 [[3 3 3]
  [4 4 4]]

 [[5 5 5]
  [6 6 6]]]


### **Resources**

*   https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
*   https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/#overlapping_kernel_execution_and_data_transfers
*   https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
*   https://vitalitylearning.medium.com/using-c-c-and-cuda-functions-as-regular-python-functions-716f01f7ca22

