# Chapter 6: Introduction to nvmath-python

<img src="images/chapter-06/nvmath-python.jpeg" style="width:600px;"/>

The **nvmath-python** (Beta) library brings the power of the NVIDIA math libraries to the Python ecosystem. The package aims to provide intuitive pythonic APIs that provide users full access to all the features offered by NVIDIA’s libraries in a variety of execution spaces. nvmath-python works seamlessly with existing Python array/tensor frameworks and focuses on providing functionality that is missing from those frameworks.

This library seeks to meet the needs of:​
- Researchers seeking productivity, interoperability with other libraries and frameworks, and performance​
- Library/Framework developers seeking out-of-the-box performance and better maintainability through Python​
- Kernel developers seeking for highest performance without the need to switch to CUDA​

Nvmath-python features:​
- Low-level bindings to CUDA math libraries​
- Pythonic high-level APIs (host and device): ​At this point limited to extended matmul and FFTs​
- Device functions callable in Numba kernels​
- Interoperability with NumPy, CuPy, and PyTorch tensors​


### Installation 

Please use the [nvmath-python installation guide](https://docs.nvidia.com/cuda/nvmath-python/latest/installation.html#) for setting up in your hardware and Python environment.

In [None]:
!pip install nvmath-python[cu12,dx,cpu]

In this installation method, we are installing the dependencies CUDA 12 and CPU dependencies. Additionally we are installing both the host & device APIs, which is only supported in CUDA 12.

In [None]:
# Using NumPy array on CPU for first version
import numpy as np
import nvmath
import cupy as cp

## Getting Started

### Prototype Your Problem on the CPU



In [None]:
shape = 64, 256, 128
axes = 0, 1

# Create out ndarray, stored on the CPU
a = np.random.rand(*shape) + 1j * np.random.rand(*shape)

# We utilize the host APIs to process the input directly on the CPU
b = nvmath.fft.fft(a, axes=axes)
print(f"Input type = {type(a)}, FFT output type = {type(b)}") 

We can change the execution of the FFT, this will copy the data to the GPU for us and handle the computation utilizing cuFFT.

In [None]:
# Copy to the GPU to process with cuFFT
b = nvmath.fft.fft(a, axes=axes, execution="cuda")
print(f"Input type = {type(a)}, FFT output type = {type(b)}") 

# High-Level Modules
Nvmath-python provides common out-of-box performant operations without leaving Python.

This includes:
- Linear Algebra
- Fast Fourier Transform

The nvmath-python library enables the fusion of epilog operations, offering enhanced performance. Available epilog operations include:
- RELU: Applies the Rectified Linear Unit activation function.
- GELU: Applies the Gaussian Error Linear Unit activation function.
- BIAS: Adds a bias vector.
- SIGMOID: Applies the sigmoid function.
- TANH: Applies the hyperbolic tangent function.
These epilogs can be combined, for example, RELU and BIAS can be fused. Custom epilogs can also be defined as Python functions and compiled using LTO-IR.

## Linear Algebra

The nvmath-python library offers a specialized matrix multiplication interface to perform scaled matrix-matrix multiplication with predefined epilog operations as a single fused kernel. This kernel fusion can potentially lead to significantly better efficiency.

In addition, nvmath-python’s stateful APIs decompose such operations into planning, autotuning, and execution phases, which enables amortization of one-time preparatory costs across multiple executions.

### Matmul with CuPy Arrays (Stateless)

This example demonstrates basic matrix multiplication of CuPy arrays.

nvmath-python supports multiple frameworks. The result of each operation is a tensor of the
same framework that was used to pass the inputs. It is also located on the same device as
the inputs.

This example is stateless in the sense that it uses a functional-style API.


In [None]:
# Prepare sample input data.
n, m, k = 123, 456, 789
a = cp.random.rand(n, k)
b = cp.random.rand(k, m)

# Perform the multiplication.
result = nvmath.linalg.advanced.matmul(a, b)

# Synchronize the default stream, since by default the execution is non-blocking for GPU
# operands.
cp.cuda.get_current_stream().synchronize()

# Check if the result is cupy array as well.
print(f"Inputs were of types {type(a)} and {type(b)} and the result is of type {type(result)}.")
assert isinstance(result, cp.ndarray)

### Matmul with CuPy Arrays (Stateful)

This example illustrates the use of stateful matrix multiplication objects. The stateful API is object-oriented.  Stateful objects
amortize the cost of preparation across multiple executions.  

The inputs as well as the result are CuPy ndarrays.

In [None]:
# Prepare sample input data.
m, n, k = 123, 456, 789
a = cp.random.rand(m, k)
b = cp.random.rand(k, n)

# Use the stateful object as a context manager to automatically release resources.
with nvmath.linalg.advanced.Matmul(a, b) as mm:
    # Plan the matrix multiplication. Planning returns a sequence of algorithms that can be
    # configured as we'll see in a later example.
    mm.plan()

    # Execute the matrix multiplication.
    result = mm.execute()

    # Synchronize the default stream, since by default the execution is non-blocking for GPU
    # operands.
    cp.cuda.get_current_stream().synchronize()
    print(f"Input types = {type(a), type(b)}, device = {a.device, b.device}")
    print(f"Result type = {type(result)}, device = {result.device}")

### Matmul with CuPy Arrays (Stateless with Epilog)

This example demonstrates usage of epilogs.

Epilogs allow you to execute extra computations after the matrix multiplication in a single
fused kernel. In this example we'll use the BIAS epilog, which adds bias to the result.

In [None]:
m, n, k = 64, 128, 256
a = cp.random.rand(m, k)
b = cp.random.rand(k, n)
bias = cp.random.rand(m, 1)

# Perform the multiplication with BIAS epilog.
epilog = nvmath.linalg.advanced.MatmulEpilog.BIAS
result = nvmath.linalg.advanced.matmul(a, b, epilog=epilog, epilog_inputs={"bias": bias})

# Synchronize the default stream, since by default the execution is non-blocking for GPU
# operands.
cp.cuda.get_current_stream().synchronize()
print(f"Inputs were of types {type(a)} and {type(b)}, the bias type is {type(bias)}, and the result is of type {type(result)}.")

## Fast Fourier Transform

Backed by the NVIDIA cuFFT library, nvmath-python provides a powerful set of APIs to perform N-dimensional discrete Fourier Transformations. These include forward and inverse transformations for complex-to-complex, complex-to-real, and real-to-complex cases. The operations are available in a variety of precisions, both as host and device APIs.

The user can provide callback functions written in Python to selected nvmath-python operations like FFT, which results in a fused kernel and can lead to significantly better performance. Advanced users may benefit from nvmath-python device APIs that enable fusing core mathematical operations like FFT and matrix multiplication into a single kernel, bringing performance close to the theoretical maximum.

In [None]:
# Prepare sample input data.
m, n, k = 64, 128, 256
a = cp.random.rand(m, k)
b = cp.random.rand(k, n)
bias = cp.random.rand(m, 1)

# Perform the multiplication with BIAS epilog.
epilog = nvmath.linalg.advanced.MatmulEpilog.BIAS
result = nvmath.linalg.advanced.matmul(a, b, epilog=epilog, epilog_inputs={"bias": bias})

# Synchronize the default stream, since by default the execution is non-blocking for GPU
# operands.
cp.cuda.get_current_stream().synchronize()
print(f"Inputs were of types {type(a)} and {type(b)}, the bias type is {type(bias)}, and the result is of type {type(result)}.")

### FFT's with CuPy Arrays

The input as well as the result from the FFT operations are CuPy ndarrays, resulting
in effortless interoperability between nvmath-python and CuPy.

In [None]:
shape = 512, 256, 512
axes = 0, 1

a = cp.random.rand(*shape, dtype=cp.float64) + 1j * cp.random.rand(*shape, dtype=cp.float64)

# Forward FFT along the specified axes, batched along the complement.
b = nvmath.fft.fft(a, axes=axes)

# Inverse FFT along the specified axes, batched along the complement.
c = nvmath.fft.ifft(b, axes=axes)

# Synchronize the default stream
cp.cuda.get_current_stream().synchronize()
print(f"Input type = {type(a)}, device = {a.device}")
print(f"FFT output type = {type(b)}, device = {b.device}")
print(f"IFFT output type = {type(c)}, device = {c.device}")

### FFT with Callback

User-defined functions can be compiled to the LTO-IR format and provided as epilog or prolog to the FFT operation, allowing for Link-Time Optimization and fusing.

This example shows how to perform a convolution by providing a Python callback function as prolog to the IFFT operation.

In [None]:
B, N = 256, 1024
a = cp.random.rand(B, N, dtype=cp.float64) + 1j * cp.random.rand(B, N, dtype=cp.float64)

# Create the data to use as a filter.
filter_data = cp.sin(a)

# Define the prolog function for the inverse FFT.
# A convolution corresponds to pointwise multiplication in the frequency domain.
def convolve(data_in, offset, filter_data, unused):
    # Note we are accessing `data_out` and `filter_data` with a single `offset` integer,
    # even though the input and `filter_data` are 2D tensors (batches of samples).
    # Care must be taken to ensure that both arrays accessed here have the same memory
    # layout.
    return data_in[offset] * filter_data[offset] / N

# Compile the prolog to LTO-IR.
with cp.cuda.Device():
    prolog = nvmath.fft.compile_prolog(convolve, "complex128", "complex128")

# Perform the forward FFT, followed by the inverse FFT, applying the filter as a prolog.
r = nvmath.fft.fft(a, axes=[-1])
r = nvmath.fft.ifft(r, axes=[-1], prolog={
        "ltoir": prolog,
        "data": filter_data.data.ptr
    })


# Low-level Modules
Provides direct access to CUDA internals and CUDA C math libraries.

This includes:
- Device API's
- Math Library Bindings

There is also access to Host API's (and Host API's with callbacks), but we will focus on the Device side here.

## Device API's

The device module of nvmath-python `nvmath.device` offers integration with NVIDIA’s high-performance computing libraries through device APIs for cuFFTDx, cuBLASDx, and cuRAND. Detailed documentation for these libraries can be found at [cuFFTDx](https://docs.nvidia.com/cuda/cufftdx/1.2.0/), [cuBLASDx](https://docs.nvidia.com/cuda/cublasdx/0.1.1/), and [cuRAND](https://docs.nvidia.com/cuda/curand/group__DEVICE.html#group__DEVICE) device APIs respectively.

Users may take advantage of the device module via the two approaches below:
- Numba Extensions: Users can access these device APIs via Numba by utilizing specific extensions that simplify the process of defining functions, querying device traits, and calling device functions.
- Third-party JIT Compilers: The APIs are also available through low-level interfaces in other JIT compilers, allowing advanced users to work directly with the raw device code.

This example shows how to use the cuRAND to sample a single-precision value from a normal distribution.

In [None]:
from numba import cuda
from nvmath.device import random
compiled_apis = random.Compile()

threads, blocks = 128, 128
nthreads = blocks * threads

states = random.StatesPhilox4_32_10(nthreads)

# Next, define and launch a setup kernel, which will initialize the states using
# nvmath.device.random.init function.
@cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
def setup(states):
    i = cuda.grid(1)
    random.init(1234, i, 0, states[i])

setup[blocks, threads](states)

# With your states array ready, you can use samplers such as
# nvmath.device.random.normal2 to sample random values in your kernels.
@cuda.jit(link=compiled_apis.files, extensions=compiled_apis.extension)
def kernel(states):
    i = cuda.grid(1)
    random_values = random.normal2(states[i])

## Math Library Bindings

Low-level Python bindings for C APIs from NVIDIA Math Libraries are exposed under the corresponding modules in nvmath.bindings. To access the Python bindings, use the modules for the corresponding libraries. Under the hood, nvmath-python handles the run-time linking to the libraries for you lazily.

The currently supported libraries along with the corresponding module names are listed as follows:
- [cuBLAS](https://docs.nvidia.com/cuda/cublas/) (`nvmath.bindings.cublas`)
- [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) (`nvmath.bindings.cublasLt`)
- [cuFFT](https://docs.nvidia.com/cuda/cufft/) (`nvmath.bindings.cufft`)
- [cuRAND](https://docs.nvidia.com/cuda/curand/index.html) (`nvmath.bindings.curand`)
- [cuSOLVER](https://docs.nvidia.com/cuda/cusolver/index.html) (`nvmath.bindings.cusolver`)
- [cuSOLVERDn](https://docs.nvidia.com/cuda/cusolver/index.html#cusolverdn-dense-lapack) (`nvmath.bindings.cusolverDn`)
- [cuSPARSE](https://docs.nvidia.com/cuda/cusparse/) (`nvmath.bindings.cusparse`)

Guidance to translate library function names from C to Python are documented here: https://docs.nvidia.com/cuda/nvmath-python/latest/bindings/index.html 

## Links to References
nvmath-python home: https://developer.nvidia.com/nvmath-python 

nvmath-python documentation: https://docs.nvidia.com/cuda/nvmath-python/latest/index.html 

nvmath-python GitHub repository: https://developer.nvidia.com/nvmath-python

Fusing Epilog Operations with Matrix Multiplication Using nvmath-python blog post: https://developer.nvidia.com/blog/fusing-epilog-operations-with-matrix-multiplication-using-nvmath-python/

# Examples

A complete set of examples are available in the nvmath-python Github repository: https://github.com/NVIDIA/nvmath-python/tree/main/examples 