In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<img src="./images/nvmath_head_panel@0.5x.png" alt="nvmath-python" />

# Getting Started with nvmath-python: Stateful APIs and Autotuning

## Overview

This notebook introduces **nvmath-python**'s stateful APIs and autotuning capabilities. Understanding the difference between stateless and stateful APIs is crucial for optimizing performance in scenarios involving repeated operations, such as batch processing.

**Learning Objectives:**
* Understand the difference between stateless (function-form) and stateful (class-form) APIs
* Identify the four phases of **nvmath-python** operations: specification, planning, execution, and resource management
* Apply stateful APIs to amortize planning costs across multiple executions
* Use `nvmath.linalg.advanced.Matmul` class for stateful matrix multiplication
* Implement autotuning to find optimal kernel configurations
* Benchmark performance improvements from using stateful APIs and autotuning

---
## Introduction

In many scientific computing workloads, the same operation is performed repeatedly with different data but identical parameters (e.g., matrix shapes, data types, and operation configurations). In such scenarios, performing the full operation setup (specification and planning) for each execution is wasteful. 

**nvmath-python** addresses this through *stateful APIs* (also called *class-form APIs*), which allow you to separate the specification and planning phases from the execution phase. This enables you to amortize the overhead of setup across multiple executions, significantly improving performance in batch processing scenarios.

Additionally, **nvmath-python** provides *autotuning* capabilities that search for optimal kernel configurations beyond the default heuristics, further improving performance for specific problem sizes and configurations.

This notebook demonstrates how to use stateful APIs and autotuning to achieve optimal performance in repeated operations.

**Prerequisites:** To use this notebook, you will need:
- A computer equipped with an NVIDIA GPU
- An environment with properly installed Python libraries and (optionally) CUDA Toolkit
- Completion of previous notebooks on kernel fusion and memory spaces (recommended)
- Understanding of matrix multiplication and batch processing concepts

For installation instructions, please refer to the [nvmath-python documentation](https://docs.nvidia.com/cuda/nvmath-python/0.2.1/installation.html#install-nvmath-python).

---
## Setup

This notebook uses the same benchmarking helper function from previous notebooks:

In [None]:
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt

---
## Stateful and Stateless APIs

Let us consider a scenario typical in neural networks: performing a batch of matrix-matrix multiplications combined with bias and *ReLU* activation. This example will illustrate the differences between stateless and stateful APIs. 

In [None]:
import cupy as cp
import nvmath
from nvmath.linalg.advanced import MatmulEpilog

m, n, k = 124, 10, 15

a = cp.random.rand(m, k, dtype=cp.float32)
b = cp.random.rand(k, n, dtype=cp.float32)
d = cp.empty((m, n), dtype=cp.float32)
bias = cp.random.rand(m, dtype=cp.float32)

d = nvmath.linalg.advanced.matmul(a, b, epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias})

print("d type =", type(d))
print("d shape =", d.shape)

### Understanding Stateless APIs

What we have used so far is the *stateless* (or *function-form*) API for `matmul` in **nvmath-python**. This is a convenience API that allows you to perform a desired operation as a single function call and get the result. Under the hood, however, a lot of machinery operates before actual computation takes place. Let us illustrate this using **nvmath-python**'s logging capabilities:

In [None]:
# Demonstrate what happens under the hood using nvmath-python's logging capabilities
import logging

# Configure the root logger to INFO and include timestamps
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)  # ensure logging is enabled

logging.info("About to call matmul() â€” inspect logs for execution flow")

d = nvmath.linalg.advanced.matmul(a, b, epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias})

print("d type =", type(d))
print("d shape =", d.shape)

logging.info("Completed matmul() call")
logging.disable(logging.CRITICAL)  # disable logging

### The Four Phases of nvmath-python Operations

As you can see from the logging output, quite a bit of action happens during the call to `matmul`:

* **SPECIFICATION PHASE:** In this phase, **nvmath-python** analyzes input arguments and creates an internal `Matmul` object with all required data prepared for the underlying [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) library call. Note that this phase may introduce noticeable overhead when inputs are relatively small.

* **PLANNING PHASE:** This is where cuBLASLt analyzes the prepared data from the `Matmul` object and performs a search for a suitable algorithm to effectively perform the operation. It uses internal heuristics to select the most promising algorithm. The planning phase also introduces noticeable overhead before actual computation begins.

* **EXECUTION PHASE:** This is the phase where actual computation occurs.

* **RESOURCE MANAGEMENT PHASE:** This is the final phase where `Matmul` resources are released. Did you notice the respective INFO line `The Matmul object's resources have been released` in the log?

### Motivation for Stateful APIs

Now imagine that our workflow assumes a **series of matrix multiplications**. In this scenario, for each matrix multiplication we go through all phases over and over again. What if matrix shapes and layouts do not change? What if input `dtypes` do not change? In this case, it would be desirable to perform the *specification* and *planning* once and amortize their cost through *multiple executions*. 

Exactly for this more sophisticated scenario, **nvmath-python** offers the **stateful** (or **class-form**) API for matrix-matrix multiplication. In this API, you construct an object (an instance of the class `Matmul`, which has methods `plan()` and `execute()`, allowing them to be invoked separately. 

Did you notice the line in the logger stating `The Matmul operation has been created`? nvmath-python uses *stateful* APIs under-the-hood, and *stateless* APIs are just convenience wrappers around stateful API calls.


### Stateful and stateless implementations for batched `matmul`

We will be using a for loop to retrieve the new `a[i]`, `b[i]`, and `bias[i]` in the batch. For the stateful API implementation we will have to use `reset_operands()` method that prepares the new inputs for the `execute()`. Alternatively, you can perform the in-place updates for `a[i]`, `b[i]`, and `bias[i]` in each iteration. 

**Question:** Which method is faster, resetting operands or in-place update?

In [None]:
import cupy as cp
import nvmath
from nvmath.linalg.advanced import MatmulEpilog

m, n, k, batch_size = 124, 10, 15, 8 # Sizes are small for demonstration purposes

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32)
bias = cp.random.rand(batch_size, m, dtype=cp.float32)


def matmul_batched_stateless(a, b, bias):
    for i in range(batch_size): # We use a for loop to retrieve the new `a[i]`, `b[i]`, and `bias[i]` in the batch.
        d[i] = nvmath.linalg.advanced.matmul(
            a[i], b[i], epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[i]}
        )

    return d


def matmul_batched_stateful(a, b, bias):
    with nvmath.linalg.advanced.Matmul(a[0], b[0]) as mm: # We create a Matmul object for the first batch element.
        mm.plan(epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[0]}) # We assume a[0] and b[0] are representative of the batch.
        mm.execute() # The execution doesn't require resetting the operands.
        for i in range(1, batch_size):
            mm.reset_operands(a=a[i], b=b[i], epilog_inputs={"bias": bias[i]}) # Subsequent executions require resetting the operands.
            d[i] = mm.execute() # We execute with the new operands.

    return d

Now let us run the logger to see what is happening under-the-hood in stateless and stateful cases:

In [None]:
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)-8s %(message)s", force=True)
logging.disable(logging.NOTSET)

logging.info("*************** Stateless API ***************")

matmul_batched_stateless(a, b, bias)

logging.info("*************** Stateful API ***************")

matmul_batched_stateful(a, b, bias)

logging.disable(logging.CRITICAL)

### Comparing Stateless and Stateful APIs

You can see from the logging output that in the case of the stateless API there are 8 **SPECIFICATION**, 8 **PLANNING**, and 8 **EXECUTION** phases. In the case of the stateful API implementation, we managed to reduce the number of **SPECIFICATION** and **PLANNING** phases to just 1. Let us measure the performance impact:

In [None]:
m, n, k, batch_size = 124, 1024, 1512, 1024 # Select larger sizes for benchmarking purposes

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32)
bias = cp.random.rand(batch_size, m, dtype=cp.float32)

benchmark(
    lambda: matmul_batched_stateful(a, b, bias), "Stateful API", lambda: matmul_batched_stateless(a, b, bias), "Stateless API"
)

## Exercise: Batch Dimension vs. Batch Sequence

In the above example, we implemented batching as a sequence of matrices being processed one by one in a loop. This is a common technique for streaming data or when the entire batch does not fit into GPU memory. An alternative approach is to add a dedicated batching dimension and operate with the batch as a single tensor. The **nvmath-python** library supports both use cases.

Implement a batching dimension approach and compare performance to the batch sequence approach. Explain the performance difference (if any).

In [None]:
## Exercise: Batch Dimension vs. Batch Sequence

import nvmath
from nvmath.linalg.advanced import MatmulEpilog
import cupy as cp
import numpy as np
import cupyx as cpx


# Helper function to benchmark two implementations F and (optionally) F_alternative
# When F_alternative is provided, in addition to raw performance numbers (seconds)
# speedup of F relative to F_alternative is reported
def benchmark(
    F, F_name="Implementation", F_alternative=None, F_alternative_name="Alternative implementation", n_repeat=10, n_warmup=1
):
    timing = cpx.profiler.benchmark(F, n_repeat=n_repeat, n_warmup=n_warmup)  # warm-up + repeated runs
    perf = np.min(timing.gpu_times)  # best time from repeated runs
    print(f"{F_name} performance = {perf:0.4f} sec")

    if F_alternative is not None:
        timing_alt = cpx.profiler.benchmark(F_alternative, n_repeat=n_repeat, n_warmup=n_warmup)
        perf_alt = np.min(timing_alt.gpu_times)
        print(f"{F_alternative_name} performance = {perf_alt:0.4f} sec")
        print(f"Speedup = {perf_alt / perf:0.4f}x")
    else:
        perf_alt = None

    return perf, perf_alt


m, n, k, batch_size = 64, 128, 256, 512 # Select larger sizes for benchmarking purposes

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
bias = cp.random.rand(batch_size, m, 1, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32) # Use pre-allocated resulting array to save memory


def matmul_batched_stateless(a, b, bias):
    # TODO: Implement stateless batching dimension approach
    pass


def matmul_batched_stateful_sequence(a, b, bias):
    # TODO: Implement stateful batching sequence approach
    pass

benchmark(
    lambda: matmul_batched_stateless(a, b, bias),
    "Stateless batched dimension approach",
    lambda: matmul_batched_stateful_sequence(a, b, bias),
    "Stateful with a batch sequence approach",
)

Why `matmul_batched_stateless()` is so much faster than `matmul_batched_stateful_sequence()`? What would be situation when the latter may be a preferred path to go?

**Hint:** Consider amount of parallelism that can be exploited in each approach. Also take into consideration the arithmetic intensity in each approach. Last but not least, take into account the amount of consumed memory in each approach.

---
## Autotuning with nvmath-python

Using stateful APIs becomes even more essential when there is a need for autotuning. In many cases, the built-in heuristics for `matmul` kernel selection work reasonably well out-of-the-box. However, there are cases where the underlying cuBLASLt library may choose a suboptimal kernel, and additional tuning is required. 

Autotuning searches through multiple algorithm candidates and selects the one with the best performance for your specific problem configuration. While autotuning itself has a cost, this cost can be amortized through multiple executions using stateful APIs.

Let us see how autotuning works and how its cost can be amortized:

In [None]:
m, n, k, batch_size = 124, 1024, 1512, 1024

a = cp.random.rand(batch_size, m, k, dtype=cp.float32)
b = cp.random.rand(batch_size, k, n, dtype=cp.float32)
d = cp.empty((batch_size, m, n), dtype=cp.float32)
bias = cp.random.rand(batch_size, m, dtype=cp.float32)


def matmul_batched_stateful_autotuned(a, b, bias):
    with nvmath.linalg.advanced.Matmul(a[0], b[0]) as mm:
        mm.plan(epilog=MatmulEpilog(MatmulEpilog.RELU_BIAS), epilog_inputs={"bias": bias[0]})
        mm.autotune(iterations=5)
        mm.execute()
        for i in range(1, batch_size):
            mm.reset_operands(a=a[i], b=b[i], epilog_inputs={"bias": bias[i]})
            d[i] = mm.execute()


benchmark(
    lambda: matmul_batched_stateful_autotuned(a, b, bias),
    "Stateful API with autotuning",
    lambda: matmul_batched_stateful(a, b, bias),
    "Stateful API without autotuning",
)

**Key Takeaways:**

- Stateless API is convenient for single operations but repeats specification and planning for each call.
- Stateful API allows specification and planning once, then multiple executions, significantly improving performance for batched operations.
- Four phases in **nvmath-python** operations: specification, planning, execution, and resource management.
- Autotuning finds optimal kernels when built-in heuristics are suboptimal, providing additional performance gains.
- The cost of autotuning can be amortized across many executions, making it worthwhile for production workloads.

---
## Conclusion

In this notebook, we explored **nvmath-python**'s stateful APIs and autotuning capabilities. These advanced features are essential for achieving optimal performance in production workloads involving repeated operations.

**Key Takeaways:**
- **nvmath-python** operations consist of four phases: specification, planning, execution, and resource management.
- Stateless (function-form) APIs are convenient but repeat all phases for each call.
- Stateful (class-form) APIs allow you to perform specification and planning once, then execute multiple times with different data.
- For batch processing, stateful APIs provide significant performance improvements by amortizing setup overhead.
- Autotuning searches for optimal kernels beyond default heuristics, providing additional performance gains.
- The cost of autotuning can be amortized across many executions using stateful APIs.
- Use the **nvmath-python** logging mechanism to understand which phases consume time and optimize accordingly.

**Next Steps:**
- Explore FFT callbacks in the next notebook: [04_callbacks.ipynb](04_callbacks.ipynb)
- Discover device APIs: [05_device_api.ipynb](05_device_api.ipynb)

---
## References

- NVIDIA nvmath-python documentation, "API Reference," https://docs.nvidia.com/cuda/nvmath-python/, Accessed: October 23, 2025.
- NVIDIA, "cuBLASLt Library User Guide," https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api, Accessed: October 23, 2025.
- Williams, Samuel, et al., "Roofline: An Insightful Visual Performance Model for Multicore Architectures," Communications of the ACM, 52(4), 65-76, 2009.
- Cormen, Thomas H., et al., "Introduction to Algorithms," 3rd Edition, MIT Press, 2009.