In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<img src="./images/nvmath_head_panel@0.5x.png" alt="nvmath-python" />

# Getting Started with nvmath-python: Kernel Fusion

## Overview

This notebook provides a fundamental introduction to **nvmath-python** and explains how it integrates with the existing scientific computing ecosystem in Python. The focus is on understanding kernel fusion and how it enables performance gains in composite operations.

**Learning Objectives:**
* Understand how **nvmath-python** coexists with array libraries like [NumPy](https://numpy.org/), [CuPy](https://cupy.dev/), and [PyTorch](https://pytorch.org/)
* Apply `nvmath.linalg.advanced.matmul` to perform matrix multiplication operations
* Benchmark GPU codes using `cupyx.profiler.benchmark` for accurate performance measurements
* Explain the performance benefits of kernel fusion in composite operations
* Use NVIDIA Nsight profiling tools to visualize kernel execution
* Experiment with different matrix sizes to examine their effect on kernel fusion benefits

---
## Introduction

**nvmath-python** is a powerful library designed to bridge the gap between Python's scientific computing community and [NVIDIA CUDA-X math libraries](https://developer.nvidia.com/gpu-accelerated-libraries). While the community has done excellent work enabling GPU computing through libraries like [CuPy](https://cupy.dev/) and [PyTorch](https://pytorch.org/), these frameworks are often constrained to *[NumPy](https://numpy.org/)-like* APIs and do not always exploit the full potential and advanced features of underlying NVIDIA CUDA-X libraries. This is where **nvmath-python** becomes valuable.

**nvmath-python** reimagines how math library API design can be intuitive, Pythonic, and performant in sophisticated usage scenarios. Like other NumPy-like API libraries, **nvmath-python** implements core numerical algorithms useful in many scientific and engineering fields. However, **nvmath-python** does not aim to replace or duplicate them but rather co-exist with them.

This notebook focuses on *kernel fusion*, a key optimization technique that allows multiple operations to be combined into a single GPU kernel, dramatically improving performance by reducing memory bandwidth requirements and kernel launch overhead.

**Prerequisites:** To use this notebook, you will need:
- A computer equipped with an NVIDIA GPU
- Basic familiarity with NumPy-like array operations
- Understanding of matrix multiplication concepts
- Understanding of computational complexity and arithmetic intensity concepts

---
## Setup

This notebook requires the following Python libraries:
- `nvmath`: NVIDIA's mathematical library for Python
- `numpy`: For CPU array operations
- `cupy`: For GPU array operations  
- `torch` (optional): For PyTorch tensor operations

If you have not already installed these libraries, you can install them using:

```bash
pip install "nvmath-python[cu12,cpu,dx]" 'nvidia-cuda-nvcc-cu12==12.8.*' 'nvidia-cuda-nvrtc-cu12==12.8.*' --extra-index-url https://download.pytorch.org/whl/cu128 torch
pip install cupy-cuda12x
```
Note that the above instructions will install nvmath-python with [NVIDIA CUDA Toolkit (CTK)](https://developer.nvidia.com/cuda-toolkit) 12.x. To avoid dependency conflicts CuPy and PyTorch packages must be installed for CTK 12.x, too. It also installs nvmath-python's (optional) CPU backend.

For detailed installation instructions, please refer to the [nvmath-python documentation](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#install-nvmath-python) as well as the installation instructions for CuPy and PyTorch.

---
## nvmath-python is NOT an Array Library

First and foremost, **nvmath-python** **is NOT** an *array library*. It does not implement traditional array library functionality such as array *indexing* and *slicing*. The example below demonstrates typical NumPy array operations that are **not** part of **nvmath-python**'s scope:

In [None]:
# Basic NumPy array creation, indexing and slicing
import numpy as np

# 1D array
a = np.arange(10)  # create array with values from 0 to 9
print("a =", a)  # print the array
print("a[2] =", a[2])  # access the third element (index 2)
print("a[2:7:2] =", a[2:7:2])  # slice from index 2 to 6 with step 2

# 2D array
b = np.arange(12).reshape(3, 4)  # create 3x4 array (matrix)
print("b =", b)  # print the matrix
print("b[0:2, 1:4] (submatrix) =", b[0:2, 1:4])  # slice rows 0-1 and columns 1-3
print("b[:,1] (second column) =", b[:, 1])  # access all rows in second column
print("b[0] (first row) =", b[0])  # access first row
print("b[1,2] =", b[1, 2])  # access element at row 1, column 2

### Interoperability with Array Libraries

Instead, **nvmath-python** is designed to **co-exist with array libraries** such as NumPy, CuPy, and PyTorch. It accepts arrays from these libraries as inputs and returns results as arrays of the same type. The following examples demonstrate this interoperability:

In [None]:
import numpy as np
import nvmath

n, m, k = 2, 4, 5
a_cpu = np.random.rand(n, k)
b_cpu = np.random.rand(k, m)

c_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)  # matrix multiplication
print("c_cpu.shape =", c_cpu.shape)  # should be (2, 4)
print(type(c_cpu))  # should be numpy.ndarray

In [None]:
import nvmath
import cupy as cp

n, m, k = 2, 4, 5
a_gpu = cp.random.rand(n, k)
b_gpu = cp.random.rand(k, m)

c_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)  # matrix multiplication
print("c_gpu.shape =", c_gpu.shape)  # should be (2, 4)
print(type(c_gpu))  # should be cupy.ndarray

In [None]:
import torch
import nvmath

n, m, k = 2, 4, 5

# CPU tensors
a_cpu = torch.rand(n, k)
b_cpu = torch.rand(k, m)

# On CPU
c_cpu = nvmath.linalg.advanced.matmul(a_cpu, b_cpu)
print("c_cpu.shape =", c_cpu.shape)  # should be (2, 4)
print("type(c_cpu) =", type(c_cpu))  # should be torch.Tensor

# Run on GPU if available
if torch.cuda.is_available():
    # Move the tensors to GPU
    a_gpu = a_cpu.cuda()
    b_gpu = b_cpu.cuda()
    c_gpu = nvmath.linalg.advanced.matmul(a_gpu, b_gpu)
    print("c_gpu.shape =", c_gpu.shape)
    print("c_gpu device =", getattr(c_gpu, "device", "unknown"))
    print("type(c_gpu) =", type(c_gpu))

**Key Takeaways:**

- **nvmath-python** accepts arrays from multiple libraries: NumPy (CPU), CuPy (GPU), and PyTorch (CPU/GPU).
- To determine where a result resides, use simple type/device checks such as `isinstance(c, np.ndarray)`, `isinstance(c, cp.ndarray)`, or for PyTorch inspect `c.is_cuda` / `c.device`.
- **Remember**: Array semantics (indexing/slicing) are provided by the array library (NumPy/CuPy/PyTorch). **nvmath-python** focuses on mathematical operations and instead interoperates with those libraries.

---
## Benchmarking GPU Codes with `cupyx.profiler.benchmark`

Since GPU kernels are launched asynchronously, a host call may return before the device work finishes. Naive timing methods (such as Python's `time.time()` method or Jupyter's `%%timeit` magic) measure only host-side overhead, not true device execution time. 

The `cupyx.profiler.benchmark` function uses CUDA events, proper synchronization, warm-ups, and repeated runs to produce stable, device-level timing measurements. It removes much of the noise introduced by Python overhead, one-time setup costs, takes into account the asynchronous execution, and reports aggregated statistics so you get reproducible, comparable numbers.

Below is a helper function that benchmarks implementations and optionally compares them to report speedup/slowdown:

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

### Benchmarking Matrix Multiplication

Now let's perform real benchmarking to compare how **nvmath-python** performs on `matmul` relative to CuPy:

**Practical Benchmarking Notes:**

- Make sure data is already allocated on the device before benchmarking (transfer costs should be excluded unless you intend to measure them).
- Run several repeats and inspect distributions (median is usually more robust than min or mean).
- For end-to-end profiling (memory transfers, kernel launches, kernel internals), use NVIDIA tools like `nsys`, `nvprof`, or Nsight Systems. The `cupyx.profiler.benchmark` function focuses on accurate kernel timing within Python workflows.
- Watch out for GPU power/clock state and thermal throttling; for stable numbers, use consistent GPU governor/clock settings if available.

In [None]:
m, n, k = 8192, 4096, 2048  # Use large enough sizes to get measurable times. Make sure it fits into GPU memory.

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)

# It's time to do real benchmarking to compare how nvmath-python
# performs on `matmul` relative to `cupy`:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a, b),  # nvmath-python implementation
    F_name="nvmath-python matmul",
    F_alternative=lambda: cp.matmul(a, b),  # CuPy implementation
    F_alternative_name="CuPy matmul",
)

### Understanding the Performance Results

Is there anything wrong with **nvmath-python**? Why does the "advanced" `matmul` run as good (or as poorly) as CuPy's `matmul`?

The explanation is actually very simple: both CuPy and **nvmath-python** rely on the same cuBLAS library, and both perform nothing more than pure matrix-matrix multiplication. This is the reason (and the only reason) why we observe **identical** performance. To make a difference, we need a more **sophisticated use case** that does not simply translate to a NumPy-like `matmul` call.

---
## Composite Operations with nvmath-python

In many real-world scenarios, the `matmul` operation we considered earlier is chained with other operations. For example, in linear algebra, *Generalized Matrix Multiplication* (GEMM) is a commonly used building block in scientific and engineering applications. In its simplified form, GEMM performs the following operation:

$$
D_{m \times n} \leftarrow \alpha \cdot (A_{m \times k} \cdot B_{k \times n}) + \beta \cdot C_{m \times n}
$$

Using an array library, one can easily implement this chained operation:

In [None]:
m, n, k = 10_000_000, 40, 10  # Take tall-and-skinny matrices to illustrate kernel fusion benefits

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)
c = cp.random.rand(m, n, dtype=cp.float32)

alpha = 1.5
beta = 0.5

d = alpha * a @ b + beta * c
d.shape

With **nvmath-python**, you can also perform the composite operation with a single function call:

In [None]:
d = nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta)
d.shape

Now let's benchmark each alternative:

In [None]:
benchmark(
    lambda: nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta),  # nvmath-python implementation
    F_name="nvmath-python GEMM",
    F_alternative=lambda: alpha * a @ b + beta * c,  # CuPy implementation
    F_alternative_name="CuPy equivalent",
)

### Why is nvmath-python Significantly Faster?

Both **nvmath-python** and CuPy perform the same composite operation, and both are accelerated by the cuBLAS library. So why is **nvmath-python** significantly faster?

The "magic" behind this performance difference is called *kernel fusion*. In the case of CuPy, every basic operation is a separate function call on the host, which initiates a respective GPU kernel invocation asynchronously, returns to the host, submits the next kernel for execution, and so on. In the case of **nvmath-python**, the chained operation is performed as a whole in a single fused kernel. There are no accompanying overheads of multiple kernel invocations. More importantly, execution as a fused kernel allows optimization of memory accesses, which significantly increases the *arithmetic intensity* of the kernel.

## Exercise: Evaluate performance of CuPy `@`-based vs. `matmul`-based implementations of GEMM

In the above examples we implemented GEMM using `@` operator notation for matrix multiplication. Implement CuPy variant using `matmul` function. Benchmark `@` variant vs. `matmul` variant and explain performance difference (if any).

**Hint**: Consider operation precedence along with computational costs of each chained operation.

In [None]:
## Exercise: Evaluate performance of CuPy `@`-based vs. `matmul`-based implementations of GEMM

import nvmath # Facilitates shared objects loading required by CuPy (Workaround for CuPy being unable to find nvrtc installed in wheels)
import cupy as cp
import numpy as np
import cupyx as cpx

# Define GEMM parameters
m, n, k = 10_000_000, 40, 10

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)
c = cp.random.rand(m, n, dtype=cp.float32)

alpha = 1.5
beta = 0.5

# Benchmarking function

# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    pass # TODO: Implement this function

# Write two functions that implement GEMM using `@` operator and `matmul` function.
def gemm_operator_form(a, b, c, alpha, beta):
    # TODO: Implement this function
    pass

def gemm_matmul_form(a, b, c, alpha, beta):
    # TODO: Implement this function
    pass

# Benchmark the two implementations
benchmark(
    # TODO: Pass the two implementations to the benchmark function
)

# Compute the number of flops for the two implementations
def gemm_operator_form_flops(a, b, c):
    # TODO: Implement this function
    pass

def gemm_matmul_form_flops(a, b, c):
    # TODO: Implement this function
    pass

# Print the number of flops for the two implementations
print(f"GEMM operator form: {gemm_operator_form_flops(a, b, c) * 1e-9:.2f} GFLOPS")
print(f"GEMM matmul form: {gemm_matmul_form_flops(a, b, c) * 1e-9:.2f} GFLOPS")

---
## NVIDIA Nsight Plugin for JupyterLab

NVIDIA has created a useful JupyterLab plugin for the NVIDIA Nsight Tools, allowing performance profiling from within notebooks. The following command will install the plugin into your environment:

```bash
pip install jupyterlab-nvidia-nsight
```

After installation, you will see a new tab **NVIDIA Nsight** in JupyterLab's menu. In the menu, select **Profiling with Nsight Systems...**. This will restart the JupyterLab kernel. Select the cells you wish to profile and execute **Run and profile selected cells...**

In [None]:
# Profile this cell with Nsight Systems...
d = alpha * a @ b + beta * c
d.shape

After running the profiler and opening the generated report, you should see something like this:

<img src="./images/nsys-report-cupy.png" alt="Nsight Systems report showing CuPy GEMM with multiple kernels" width="75%"/>

Note the number of kernels and their execution times. You should see that the CuPy multiply kernel takes the longest and consumes more time than the SGEMM kernel.

Let us profile **nvmath-python**'s implementation now:

In [None]:
# Profile this cell with Nsight Systems...
d = nvmath.linalg.advanced.matmul(a, b, c, alpha=alpha, beta=beta)
type(d)

By opening the **Nsight Systems** UI, you should see something like this:

<img src="./images/nsys-report-nvmath.png" alt="Nsight Systems report showing nvmath-python GEMM with a single fused kernel" width="75%"/>

Note that the only kernel in the timeline is SGEMM. Everything else has been fused into that single kernel. This is the true reason why **nvmath-python**'s performance is much better in a composite operation like GEMM.

---
## GEMM with Fused Epilogs

The library goes one step further and allows fusion of the GEMM operation with an epilog function \( f(x) \), which is applied element-wise to the result of the GEMM operation. Specifically:

$$
D_{m \times n} \leftarrow f(\alpha \cdot (A_{m \times k} \cdot B_{k \times n}) + \beta \cdot C_{m \times n})
$$

For a deeper dive into GEMM epilogs and other advanced techniques of matrix-matrix multiplication in **nvmath-python**, please refer to the collection of notebooks in the `notebooks/matmul/` GitHub directory:
* [01_introduction.ipynb](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/01_introduction.ipynb)
* [02_epilogs.ipynb](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/02_epilogs.ipynb)
* [03_backpropagation.ipynb](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/03_backpropagation.ipynb)
* [04_fp8.ipynb](https://github.com/NVIDIA/nvmath-python/blob/main/notebooks/matmul/04_fp8.ipynb)

---
## Conclusion

In this notebook, we explored the fundamentals of **nvmath-python** and its key feature: *kernel fusion*. We demonstrated how **nvmath-python** integrates seamlessly with array libraries like NumPy, CuPy, and PyTorch, accepting their array types as inputs and returning results in the same format.

**Key Takeaways:**
- **nvmath-python** is not an array library but rather a mathematical operations library that coexists with existing array libraries.
- Kernel fusion combines multiple operations into a single GPU kernel, dramatically improving performance by reducing memory bandwidth requirements and kernel launch overhead.
- For simple operations like matrix multiplication, **nvmath-python** and CuPy perform similarly since both use cuBLAS. However, for composite operations like GEMM, **nvmath-python** shows significant speedups through kernel fusion.
- Proper benchmarking using `cupyx.profiler.benchmark` is essential for accurate performance measurements on GPUs.
- NVIDIA Nsight profiling tools provide deep insights into kernel execution and help visualize the benefits of kernel fusion.

**Next Steps:**
- Explore memory and execution spaces in the next notebook: [02_mem_exec_spaces.ipynb](02_mem_exec_spaces.ipynb)
- Learn about stateful APIs and autotuning: [03_stateful_api.ipynb](03_stateful_api.ipynb)
- Discover FFT callbacks: [04_callbacks.ipynb](04_callbacks.ipynb)
- Explore device APIs: [05_device_api.ipynb](05_device_api.ipynb)

---
## References

- NVIDIA nvmath-python documentation, "Installation Guide", https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html. Accessed: October 23, 2025.
- NVIDIA, "cuBLAS Library User Guide", https://docs.nvidia.com/cuda/cublas/. Accessed: October 23, 2025.
- NVIDIA, "Nsight Systems", https://developer.nvidia.com/nsight-systems. Accessed: October 23, 2025.
- Harris, Charles R., et al., "Array programming with NumPy", Nature, 585(7825), 357-362, 2020.
- Williams, Samuel, et al., "Roofline: An Insightful Visual Performance Model for Multicore Architectures", Communications of the ACM, 52(4), 65-76, 2009.
- Cormen, Thomas H., et al., "Introduction to Algorithms", 3rd Edition, MIT Press, 2009.
- Asanovic, Krste, et al., "The Landscape of Parallel Computing Research: A View from Berkeley", Technical Report UCB/EECS-2006-183, UC Berkeley, 2006.