# 9.2. `nvmath-python`: Kernel fusion

Some computational problems have lower compute to memory access instructions ratio, which can often be addressed by lowering the number of memory accesses via kernel fusion. This exercise illustrate how `nvmath-python` inherently fuses simpler operations into a single composite kernel.

We illustrate this on the example of neural network propagation and the fast Fourier transfer example.

## Advanced `matmul` with bias and epilog
Based on **Exercise 9.1**, it is not clear why one would need to use `nvmath-python` for matrix multiplications. Indeed, for basic `matmul` operation using `nvmath-python` alongside CuPy does seem an overkill. However, in scientific computing applications and AI, `matmul`s are often used in combination with other operations. For example, in neural networks quite a common usage pattern is as follows.

**Matrix-Matrix Multiplication with Bias and ReLU:**

$$C = \text{ReLU}(A \cdot B + b^T)$$

where:
- $A \in \mathbb{R}^{m \times k}$ is the input matrix
- $B \in \mathbb{R}^{k \times n}$ is the weight matrix  
- $b \in \mathbb{R}^{m}$ is the bias vector (transposed and broadcasted to $m \times n$)
- $\text{ReLU}(x) = \max(0, x)$ is the Rectified Linear Unit activation function
- $C \in \mathbb{R}^{m \times n}$ is the output matrix

The bias vector $b$ is transposed to $b^T \in \mathbb{R}^{m \times 1}$ and automatically broadcasted across all columns of the result matrix. The ReLU function is applied element-wise to the result of the matrix multiplication plus bias.



In [None]:
import cupy as cp

# Define the ReLU function
def relu(x):
    return cp.maximum(0, x)

# Matrix dimensions
# m, k, n = 2, 5, 4
m, k, n = 2000, 1000, 4000

# Create input matrix A (m x k)
A = cp.random.randn(m, k)

# Create weight matrix B (k x n)
B = cp.random.randn(k, n)

# Create bias vector b (m,) that will be transposed and broadcasted to (m x n)
b = cp.random.randn(m)

# Implement the formula: C = ReLU(A * B + b^T)
# Kernel 1: Matrix multiplication
matmul_result = cp.matmul(A, B)

# Kernel 2: Add bias (broadcasting happens automatically)
bias_result = matmul_result + b.reshape(-1, 1)

# Kernel 3: Apply ReLU activation
C = relu(bias_result)

# print(f"A: {A}")
# print(f"B: {B}")
# print(f"b: {b}")
# print(f"C: {C}")


print(f"Input matrix A shape: {A.shape}")
print(f"Weight matrix B shape: {B.shape}")
print(f"Bias vector b shape: {b.shape}")
print(f"Transposed bias b^T shape: {b.reshape(-1, 1).shape}")
print(f"Output matrix C shape: {C.shape}")
print(f"Output matrix C device: {C.device}")

**TODO: Validate the above implementation on small easy to comprehend inputs by manually initializing matrices and bias and by printing results step by step**

`nvmath-python` leverages the power of `cuBLASLt` library that provides variety of options to implement such computational patterns. Here's an example of how one can implement the above example using `nvmath-python`:

In [None]:
import nvmath

# Kernel 1, 2, and 3 are fused into a single kernel
C = nvmath.linalg.advanced.matmul(A, B, epilog=nvmath.linalg.advanced.MatmulEpilog.RELU_BIAS, epilog_inputs={"bias": b})

# print(f"A: {A}")
# print(f"B: {B}")
# print(f"b: {b}")
# print(f"C: {C}")

print(C.shape)
print(C.device)

**TODO: Ensure that `nvmath-python` results are identical for the same small inputs**

All three kernels are fused into a single kernel using JIT machinery behind `cuBLASLt`, which in certain problem settings may result in overall better performance due to better compute-to-memory access ratio

## Using custom FFT callbacks written in Python

In previous example the epilog is a predefined set of activation functions and their gradients. This example illustrates the case when the epilog is a custom Python function, which is compiled into internal intermediate representation (LTO-IR) and then fused with the FFT operation into a single kernel.

Specifically, we illustrate how to perform a convolution by providing a Python callback function as an epilog to the FFT operation.

To begin with, let's create some input data. We will use the batched 1D FFT and apply a sine-form filter in the frequency domain.

In [None]:
# Create the data for the batched 1-D FFT.
B, N = 256, 1024
a = cp.random.rand(B, N) + 1j * cp.random.rand(B, N)

# Create the data to use as a filter.
filter_data = cp.sin(a)

We also define the epilog function for forward FFT, a convolution, which corresponds to pointwise multiplication in the frequency domain. We also scale by the FFT size `N` here.

In [None]:
def convolve(data_out, offset, data, filter_data, unused):
    data_out[offset] = data * filter_data[offset] / N

Note we are accessing `data_out` and `filter_data` with a single `offset` integer, even though the output and `filter_data` are 2D tensors (batches of samples). Care must be taken to ensure that both arrays accessed here have the same memory layout.

Next thing is to compile the epilog to intermediate representation (LTO-IR). In a system with GPUs that have different compute capability, the `compute_capability` option must be specified to the `compile_prolog` or `compile_epilog` helpers. Alternatively, the epilog can be compiled in the context of the device where the FFT to which the epilog is provided is executed. In this case we use the current device context, where the operands have been created:

In [None]:
with cp.cuda.Device():
    epilog = nvmath.fft.compile_epilog(convolve, "complex128", "complex128")

Finally, we perform the convolution as the forward FFT with the compiled epilog (filter) followed by the inverse FFT transformation:

In [None]:
r = nvmath.fft.fft(a, axes=[-1], epilog={"ltoir": epilog, "data": filter_data.data.ptr})
r = nvmath.fft.ifft(r, axes=[-1])